Deep Predictive Learning: A Comprehensive Model of Three Visual Streams

I skimmed through the article, here are some highlights:

it is worth observing that the odds of discovering a model of this complexity through a purely bottom up, empirically-driven approach seem rather small.

we assume that an individual simulated neuron in our model corresponds to a population of roughly 100 spiking neurons organized into microcolumns in the neocortex

[The model is] trained with a strong influence from a common predictive error signal represented as a temporal difference over the pulvinar. […] To measure network learning, we compute the cosine difference between the minus-phase prediction and plus-phase actual input over this V1p layer (cosine is computed as the normalized dot product between the two vectors, separately mean-normalized).

I think that this excerpt describes how the model forms viewpoint invariant representations:

The V4d and TEOd deep layers also receive a self-context projection, which integrates across the prior deep layer activations in addition to the superficial layers. This supports more enduring activation states over time.

I stand corrected, they do use hebbian learning, among other things:

Importance of Temporal Context, Hebbian Learning, Momentum

Finally, we report the effects of various important elements of the DeepLeabra computational framework, including the deep-layer temporal context mechanism, the combination of BCM-like Hebbian learning along with error-driven learning, and the effects of using momentum in the learning rule. Figure 16 shows that each of these factors plays an important role in contributing to the overall performance of the intact network. For the Hebbian and momentum factors, both of these produced more “dead” units (the flip-side of the hog units mentioned above — these are easier to quantify), particularly in the higher layers, with hebbian being particularly important for TEO while momentum was more important for V4. [Emphasis is mine]

2 Likes

There are several different learning methods employed in this model.
I can see room to add the HTM model and speed the learning up something wonderful.

Are you not cautious/worried about the mish mash of methods such as back propagation and hebbian learning, and the seemingly organized information processing pipeline that is presented here? If say the brain works similarly then it’s then somewhat similar to a flowchart of a procedural program where everything seem to be very organized and follows an obvious pattern which I think is not the brain’s case. I just would like to know your thoughts on this perspective as backprop plus NS bothers me a bit. There are NS evidences presented but IMO they seem to be there to complement/justify deep learning style (e.g. backprop) technique.

One question I have for this paper is that whether NS evidences were used as constraints similar to how HTM was created or are they more of the icing in the cake where the cake is simply the predictive autoencoder?

I would also like to admit that the paper made my nose bleed, not all of it I was able to digest, hence I will need to reread.

2 Likes

No! The brain is full of different types of learning. In fact, I fully expect this is going to be a requirement for the “way forward” in larger systems. As I have posted elsewhere on the forum, the hippocampus and subcortical structures are very different. I see the various layers in the cortex as also being different. This is the first place that I have seen that took the timing behavior of the lower layers and explained how that was useful - a predictive mechanism!

One of the things that brought me here was the temporal aspect of HTM. My understanding of how the brain works demanded this temporal mechanism and the classic RNN from the DL community really did not excite me. Here is a second and very biologically plausible mechanism - an embarrassment of riches really.

This paper is one of the first that embraces that and uses it in a fairly biological way.
I also am a fan of this overall layout, but expect to see more lateral map-2-map connections; much more.

The counter-flowing streams closely maps how I know the brain to be laid out so the recognition that this supports the local equivalent of backprop is a massive breakthrough in understanding WHY it is laid out this way.

This paper took me about 6 passes to finally get all they were saying and that was after I have have read some of the earlier papers that lead up to this. It packs a massive amount of information in a very small space. Nobody said understanding how the brain works was going to be easy to understand.

When I could finally understand exactly what was going in in figure 18 I became a total fanboy.

4 Likes

I am not sure and I cannot prove this as well, but my personal opinion is that there is likely one type of learning and that learning is an emergence from simple local rules. And we humans are trying program this emergence in a sequential or organized manner much like a software. The CHL is still biologically implausible as I’ve understood so far and I can’t find in this paper if it addressed its biological implausibility, or maybe I’ve missed it (in this case sorry). Even though it’s hebbian-like learning it still is inspired with backpropagation and in my understanding it will need a symmetric set of neural pathways to update the weights during feedback which I believe is not true in the brain. As the brain is much more messy at least chemically and by neuronal connections.

You are missing a key feature of the paper: that each of the two streams serves to distribute the error to the other stream. There is no overarching teacher function - the streams themselves are both the teacher and students. The paper describes the method and spends considerable time showing the progress of the process and evolution of the location of training as the learning progresses.
Please don’t get hung up that the use they same description that the non-biological method uses - this is a biologically plausible method to achieve the same type of function and performance.

The same kinds of progression of connections and learning is observed in humans.

Shame, I think I missed this :cold_face: and thanks for pointing this out. So CHL was approximated by XCAL which is a much more biologically plausible function. A thing to note though that XCAL is somewhat inspired to backprop.

Agree, this is not easy to do though. At the end of the day, ML aims to build computational models that scale and apply to real-world problems and in today’s case it’s backprop that is very successful in error-driven learning, and even the XCAL function was emphasized that it can approximate it. So it’s kind of hard to not think about this learning technique especially backprop that is being indirectly used as a benchmark or at least treated as the optimal error-driven learning technique, at least in this paper.

My key takeaway from this paper is that I learned that computationally there is this XCAL that approximates the backprop function which is at the same time biologically plausible. However the plausibility is under the assumption that backprop is the optimal error-driven learning mechanism.

1 Like

Optimal?
Backprop?
No - I think you are mixing some things up here. Backprop is a method for delivering some error function to the right place in a hierarchy. This is distinctly different from the cost/error calculation.
The 3VS delivers the error to the right place in a localized way - backprop is a more global way that may be worse as the local error computation may actually be different in different places in the system.
I would offer that this localized method has different dynamics during the training operation but should be superior in generalization and cross training.

Since the authors are “selling” this method they had to put it in a context that is relatable to the general ML community. Like any new idea - it does not have to “just work,” it has to work as well as the technique most readers are familiar with to be taken seriously. This is one of the stumbling blocks that HTM is struggling with now.

1 Like

Depends on perspective it is not just the delivery its also about the method of error calculation. I should’ve specified gradient descent which is inherent in backprop. I believe it is by far the most successful today at least in mainstream ML. By optimal I mean it has been shown much quicker, effective and easier to control and converge.

1 Like

And not biologically plausible.
This is a showstopper for me.

1 Like

Just an FYI that some of the functions here such as XCAL, CHL, backprop, and BCM are discussed with more details in the CCNBook (https://grey.colorado.edu/CompCogNeuro/index.php/CCNBook/Main). I’ve been using this as one of my default NS reference. This paper also reminds me to continue reading this reference especially the Learning part and probably try the LEABRA sims. As an extremely NS newbie with computing background I find this reference not very difficult to read.

2 Likes

I was about to ask this as a question. Which is much more likely, a messy propagation of error signal or a much more organized described in this paper? My intuition tells me it’s the former and the system can likely diverge. What do you think?

I see critical periods as a possible answer to this pb. Successive critical periods give specific time window in which neuronal proprieties are particularly prone to modification by external experience.

Those critical periods are sequenced in time so that the errors are first learned via Hebbian learning in every areas, and then the correction is more and more focused on areas that haven’t passed their critical period yet.

It is like having a different decreasing learning rate for different areas of the network.

PS: All the discussions here remind me than I need to give the 3VS paper another read! There are still a bunch of ideas that are unclear for me.

2 Likes

My response in a different thread:

This is my attempt to reconcile the 3VS model of CT interaction with the current Numenta model. It is a work in progress. :slight_smile:

4 Likes

This is good start at diagraming at the cortical column level.
You may want to add in Matthieu’s excellent diagrams here:

to support the up & down stream flows.
Use the L2/3 to support the TBT lateral voting. Or my hex-grid thing to really stir the pot!

2 Likes

The more I read the chapters (Network & Learning) in the CCNBook the more I get convinced that gradient descent is biologically plausible. Well of course besides this book is nice to read (it has intuitions too) it is only my reliable reference for computational neuroscience.

The following is probably the best illustration I’ve seen so far regarding how gradient-descent like calculation or “error-driven learning” is propagated and calculated (more details in the book). I always believe that there is some kind of “attractor dynamics” that is happening between group of neurons however I see this as changing constraint satisfaction solutions.

I’m kind of satisfied about this and the detailed explanation of XCAL which is used in this paper. It also somehow provides a theory that can satisfy my questions about chaos in synapse weight self-organization albeit XCAL is a bit more controlled.

Now I’m wondering as to why this type of error-driven learning wasn’t brought up (or maybe I am just naive sorry) in Numenta researches, when error-driven learning is really an old idea or concept? I don’t know, this is just a thought. I also think in mainstream ML, I believe XCAL is less-likely preferred over straightforward gradient-descent (e.g. backpropagation in ML world) because XCAL may be likely to lead to divergence by intuition…

Just a reminder that I will be presenting this paper at tomorrow’s research meeting. Wish me luck.

11 Likes

I think you’ll do well. Looking forward to your report.

4 Likes

Good job.

In any case this:

We de- part from the modulatory notion of Sherman and Guillery (2006), and argue that these weaker 6CT inputs are capable of driving TRC activation by themselves,

As they told you, this Is a no-no. Anatomically is not posible to do that. No only because are at the end of the dendrite, mantain the state across of time (both facilitating and depressing) , but because they are orders of magnitude more frequent than “driver” synapses (both in the thalamus and L6->L4)

2 Likes