Limits of deep learning


I kinda liked this article about the limits of deep learning:
It is true that each layer in a neural net is a relatively simple vector mapping that stretches some input points away from each other and compresses some close to each other.
That raises the question how complex the learning algorithm needs to be. Maybe back propagation is in excess of what is needed? I’ll try out some simpler ideas in code and see how things work out.


I think another way to understand deep learning’s limit is to understand why it works so well. I recently reported on a recent paper by Lin & Tegmark trying to understand mathematically why deep learning works so well in practice.

The long and short of it is basically this: 1) phenomena we’re interested in can generally be described with low-order polynomials. 2) deep learning is good at approximating low-order polynomials. 3) Therefore, deep learning works good in practice.

Of course, there’s a whole host of issues that are not addressed such as the enormous training set requirements, time dimension, continuous learning, etc. But I think it really helps to understand things.

So, knowing what deep learning is good at, you can easily see how to break it.


That was a terribly interesting paper. I’ve been thinking recently how simple preprocessing applied to the input of a deep net irrevocably alters the nature of the calculations done by the net.
For example doing a 2D FFT of an image concentrates most of the information into a small number of low frequency components, with the net then mainly responding only to a small subset of its input.
One possible alternative to back propagation through parameters is back propagation through shallow nets. For each layer (a shallow net in itself) you learn separate a shallow inverse net. And use the inverse net for back propagating rather than going through the awkward (and biologically unrealistic) process of inverting though the forward layer parameters. The question is if such a system can usefully learn by reaching equilibrium during training and under what exact conditions.


The answer is that finding an equilibrium is not helpful. It seems that the actual optimization effect of gradient descent is the more important thing.
I still have an analog version of a context sensitive decision/prediction tree to try that would substantively dodge the need for search/optimization behavior during learning. If that can’t be avoided then there are difficult questions regarding how the biological brain implements optimization for learning.


One way of looking at deep networks is to regard each layer as relatively simple vector mapping. With different input points being squeezed together or pulled apart from each other.
Each layer is part analog hash table, part separating classifier. It is possible that the term “back propagation” is a misnomer. That actually only the optimization effect of gradient descent is important. The lack of local minima in deep networks would allow gradient descent to work unhindered, finding suitable hash mappings and separations.


For those interested and less inclined to read papers, Max did a presentation about their work:


I actually didn’t buy into that video when I saw it before, or I just didn’t like the style of it. Far more exhausting is this paper:

I like the idea that you provide (position) evaluation functions as in chess programming and the entire system optimizes itself in fulfillment of those (eg. homeostasis.)


I’ll try the idea of feedback alignment mentioned in the the above paper. I remember Hinton mentioned something about that in one of his videos but I didn’t follow it up.


I definitely shouldn’t post code I only finished 5 minutes ago. It might be incorrect or whatever. But hobbyist advantage, I can just post anyway.


@Sean_O_Connor “Toward an Integration of Deep Learning and Neuroscience” is extremely comprehensive and informative. I’ve only completed up to section 3 and I’m glad to see HTM theory is included in the paper (one of the few things in ML I’m confident in my understanding). Thank you for sharing it.

As for feedback alignment, I’d be interested to see if this solves the vanishing gradient problem of RNNs.


It’s a good overview paper for sure. I’m experimenting with feedback alignment at the moment. I have to include weight decay apparently.
I looked before at sending back Gaussian noise proportional to the length of the error vector. There was this kind of discussion that conventional back-propagation just scrambled the error to Gaussian noise after going back through 2 or 3 layers anyway. If that is true then the “alignment” part may be unnecessary or sub-optimal. I’ll do some experiments and compare.


Using feedback alignment looks far better than proportional Gaussian noise after letting comparison code run for a few hours. I’ll experiment some more with random weight initialization and different activation functions.


Feedback alignment with fully linear net:
You get the idea anyway. I’ll leave it there and anyone interested can do their own assessment.


It output of feedback alignment nets looks very similar to the output of nets I did that had unsupervised feature learning followed by a readout layer. I don’t want to give feedback alignment a thumbs up or a thumbs down, I don’t know. You be the judge.



Actually scale free mutations would make a good probe for estimating the gradient in a deep neural networks. Whereas normal Gaussian (etc.) mutations would not:

You would have to do some sort of rolling average of the gradient direction determined from the those mutations that gave an improvement. It seems like it would make a neat way to mix evolution and conventional neural network learning, perhaps getting the best of both. I’ll have to try it.