Again, as someone who actually works in implementing solutions, the biggest detriment to me and others who are trying to apply these (very useful, though absolutely brittle) solutions in production is that so many maybe well meaning, maybe hype-jacking, and maybe profiteering people are misrepresenting everything that DL can do, the ease with which it can be accomplished, and making ill-formed blanket statements that all we need is “more data” without stopping to consider all the potentially flawed and biased base assumptions that data brings.
The data might be utter garbage, the variables completely unrelated to each other (or just happenstance correlations), mislabeled (if labeled at all), or might shift concept or use randomly throughout the dataset (where some developer kept changing their mind about what a column was supposed to be doing, its categorical vs. numerical nature, range, interval, etc.). And that’s just the data aspect of it. Then there’s the algorithms themselves that we use, which again, are just clever mathematical tricks to attempt to force a certain “shapes” or “boundaries” into the jumbled mess, where the algorithms and parameters we set, by nature of their numerical embodiment, have unintended effects on the output shape such as creating clusters and divisions between groups that really shouldn’t be there, and yet we accept it because 80% recall is “good enough” for a certain application.
A production Deep Learning system is (oversimplified) just a numerical manipulation through a set of fixed-weight matrices which feed into functions. Our ability to get these systems to train, even with “clean” data assumes that any real or actual relationship exists between the input variables. Having to update weights through the bruteforce of backpropagation, though it sometimes works, isn’t guaranteed to find a working, or even a good solution consistently. There’s a lot of randomness and non-deterministic behavior so that even with the same architecture, same shapes, same data, even same learning rate and other hyperparameters, you still might not consistently arrive at a working model which means that you’ll still spend more time, energy, electricity, all of it, just to attempt to maybe get a working model.
So it’s often fine when you can get it to work and make sure to buttress it with all the required constraints and expectations, but way too many folks and companies out there are making way too big of claims about the ability of DL to solve problems, much less lead to AGI. Often, the people who talk the loudest about it know the least about how to actually implement any of it. They’re just hucksters looking to make a profit off the hopes/dreams of the gullible and ignorant-but-well-meaning.
DL is applied calculus and it IS pretty neat when it works. But network-wide back-propagation is a terribly inefficient way to conduct learning which produces fascinating and still brittle results, and I’ll stick by that. Even those impressive massive models which have memorized troves of written data (GPT-3, for example) are still terribly brittle and temperamental beasts around which folks are working hard to place hand-written filters and limiters so that only semi-correct answers are allowed to fall out.
Simply repeating the flawed approach over and over again while scaling up to powerplant-dependent levels of electricity is not going to cut it. Instead we’ll need Numenta (and others) who are pushing the boundaries on biologically-mimicking systems with more efficient basic operations, a different approach to the math, corresponding ASICs, and a rethink of how we’re picking/choosing what connections to update and when.
Attention mechanisms help in DL, but if we take a step back, the entire HTM approach was already a multilayered, multi-headed attention mechanism long before the DL community ever considered attention heads.