Bengio's view of deep learning in the brain


I’m exploring a contrary way of organizing an AI brain to the one in this paper:
It could be though that the brain has some massively parallel deep learning capabilities.


Moved from #htm-theory:neuroscience because this is not a neuroscience paper.


Was MNISt learnt by unsupervised learning in this paper? Seems like still supervised learning but I only glanced through the paper. Seems like bengio only helped in implementing some backprop alternative algos of the paper and wasn’t the main contributor. But regardless, it seems like an interesting paper. The math is also an indexing hell, makes it hard to follow.


I just scanned the paper to get the gist of it, too. The main point I am interested in is if dynamic on-line deep learning is necessary for human like general intelligence. Current hardware isn’t sufficient to do that in real time. The capability of AI will grow slowly giving the human species plenty of time to adapt and to decide how far to take things.

If all that is required is to train a memory utilizing deep network for a few weeks on a supercomputer and then that deep network can go on to dynamical use memory for learning and recall via its intrinsic intelligence, then there will be no time for a coordinated response.

Anyway it only adds slightly to the list of existential risks that humans face and are so caviler about. In evolutionary biology there is the notion of “kill the winner.” Otherwise one species would tend to completely dominate the environment. Up until a few thousand years that rule held. But hey, we’re here. Let’s see if we can continue to defy evolutionary gravity.


It’s hard to say what is required and what is not required for intelligence. But if I were to bet, I put my money in dynamic online. HTM by far is the closest to how human brain works, hands down. But still, as Jeff keeps reminding, it’s evolving.

To be honest, I think for practical reasons and to prove the power of HTM, numenta should combine it with some current efficient algos.

One application that comes to mind, is using k-sparse autoencoders as a an encoder for image/video feeding of HTM cortical algo.

A K-sparse autoencoder with 2048 latent space, can create the exact SDRs HTM is looking for, and then it can be fed into an HTM. It wouldn’t be cheating, since it’s only for the encoder part. Such a system may be even able to learn intuitive physics better than one frame predicting GANs.

Hey @rhyolight has anyone thought of using k-sparse autoencoders as a visual encoder?



To understand what i’m Saying, you may want to go to reviewer’s one comment here:
On the detail of making it sparse.


I should be putting the washing out, but it’s raining. I am also procrastinating about other things, like coding schemes I have in my head.
If you have a large data reservoir that you put sensory information into (and more) and can read and write from, you can weight everything in the reservoir to soft select what you need, do a dimension reducing set of random projections and feed that to a specific neural network (or associative memory). Then you can take the output of neural network and do a dimension increasing set of random projections, and weight that to soft select what effect in the reservoir. You do that multiple times with multiple different small networks and associative memory systems. There are no hard edges in such a scheme, everything is smooth, finely quantized and easy to evolve.
I don’t think back-propagation could work, you must use evolution, but that is not so impossible.


If you evolve rather than back-propagate then you can use x*x as a sparsity inducing activation function. Anyway, whatever form of sparsity you use, you are throwing a way a lot of the input information at each layer, which you must compensate for in some way. You can do what resnets do, or maintain some weights back to the input at each layer. Or with a global data reservoir the sensory input is always available, presuming you write that back in at each step.


If you have 4 numbers a,b,c and d. And you arbitrarily add and subtract them
x=a-b+c+d, or y=-a+b-c+d then those are 2 simple random projections of the 4 numbers.

If you have 32 numbers then there are 2^32 different simple random projections through which you can view the original data. With a high probability that 2 randomly chosen RPs are approximately orthogonal. Therefore if you randomly pick 32 random projections of the input data you will hardly lose any information about it. Ie. You have enough information to reconstruct the 32 numbers to good accuracy. Almost as if you deliberately created 32 new orthogonal basis vectors to do a simple linear algebra change of basis.

Then you can have more complicated RPs such as z=c1.a+c2.b+c3.c+c4.d where c1,c2,c3 and c4 are random constants.

RP’s provide you with many, many different windows on the same underlying data.


To me the Bengio et al. paper is exactly Jeff’s neurons missing one BIG piece. That is they have no distal dendritic compartments that have their own local thresholding and put the soma into a "predictive state. They treat all basal dendritic synapses the same just a sum and threshold at the soma.

They are supposed to be physiologists how can they not include this major feature?