Deep Predictive Learning: A Comprehensive Model of Three Visual Streams

You are missing a key feature of the paper: that each of the two streams serves to distribute the error to the other stream. There is no overarching teacher function - the streams themselves are both the teacher and students. The paper describes the method and spends considerable time showing the progress of the process and evolution of the location of training as the learning progresses.
Please don’t get hung up that the use they same description that the non-biological method uses - this is a biologically plausible method to achieve the same type of function and performance.

The same kinds of progression of connections and learning is observed in humans.

Shame, I think I missed this :cold_face: and thanks for pointing this out. So CHL was approximated by XCAL which is a much more biologically plausible function. A thing to note though that XCAL is somewhat inspired to backprop.

Agree, this is not easy to do though. At the end of the day, ML aims to build computational models that scale and apply to real-world problems and in today’s case it’s backprop that is very successful in error-driven learning, and even the XCAL function was emphasized that it can approximate it. So it’s kind of hard to not think about this learning technique especially backprop that is being indirectly used as a benchmark or at least treated as the optimal error-driven learning technique, at least in this paper.

My key takeaway from this paper is that I learned that computationally there is this XCAL that approximates the backprop function which is at the same time biologically plausible. However the plausibility is under the assumption that backprop is the optimal error-driven learning mechanism.

1 Like

No - I think you are mixing some things up here. Backprop is a method for delivering some error function to the right place in a hierarchy. This is distinctly different from the cost/error calculation.
The 3VS delivers the error to the right place in a localized way - backprop is a more global way that may be worse as the local error computation may actually be different in different places in the system.
I would offer that this localized method has different dynamics during the training operation but should be superior in generalization and cross training.

Since the authors are “selling” this method they had to put it in a context that is relatable to the general ML community. Like any new idea - it does not have to “just work,” it has to work as well as the technique most readers are familiar with to be taken seriously. This is one of the stumbling blocks that HTM is struggling with now.

1 Like

Depends on perspective it is not just the delivery its also about the method of error calculation. I should’ve specified gradient descent which is inherent in backprop. I believe it is by far the most successful today at least in mainstream ML. By optimal I mean it has been shown much quicker, effective and easier to control and converge.

1 Like

And not biologically plausible.
This is a showstopper for me.

1 Like

Just an FYI that some of the functions here such as XCAL, CHL, backprop, and BCM are discussed with more details in the CCNBook ( I’ve been using this as one of my default NS reference. This paper also reminds me to continue reading this reference especially the Learning part and probably try the LEABRA sims. As an extremely NS newbie with computing background I find this reference not very difficult to read.


I was about to ask this as a question. Which is much more likely, a messy propagation of error signal or a much more organized described in this paper? My intuition tells me it’s the former and the system can likely diverge. What do you think?

I see critical periods as a possible answer to this pb. Successive critical periods give specific time window in which neuronal proprieties are particularly prone to modification by external experience.

Those critical periods are sequenced in time so that the errors are first learned via Hebbian learning in every areas, and then the correction is more and more focused on areas that haven’t passed their critical period yet.

It is like having a different decreasing learning rate for different areas of the network.

PS: All the discussions here remind me than I need to give the 3VS paper another read! There are still a bunch of ideas that are unclear for me.


My response in a different thread:

This is my attempt to reconcile the 3VS model of CT interaction with the current Numenta model. It is a work in progress. :slight_smile:


This is good start at diagraming at the cortical column level.
You may want to add in Matthieu’s excellent diagrams here:

to support the up & down stream flows.
Use the L2/3 to support the TBT lateral voting. Or my hex-grid thing to really stir the pot!


The more I read the chapters (Network & Learning) in the CCNBook the more I get convinced that gradient descent is biologically plausible. Well of course besides this book is nice to read (it has intuitions too) it is only my reliable reference for computational neuroscience.

The following is probably the best illustration I’ve seen so far regarding how gradient-descent like calculation or “error-driven learning” is propagated and calculated (more details in the book). I always believe that there is some kind of “attractor dynamics” that is happening between group of neurons however I see this as changing constraint satisfaction solutions.

I’m kind of satisfied about this and the detailed explanation of XCAL which is used in this paper. It also somehow provides a theory that can satisfy my questions about chaos in synapse weight self-organization albeit XCAL is a bit more controlled.

Now I’m wondering as to why this type of error-driven learning wasn’t brought up (or maybe I am just naive sorry) in Numenta researches, when error-driven learning is really an old idea or concept? I don’t know, this is just a thought. I also think in mainstream ML, I believe XCAL is less-likely preferred over straightforward gradient-descent (e.g. backpropagation in ML world) because XCAL may be likely to lead to divergence by intuition…

Just a reminder that I will be presenting this paper at tomorrow’s research meeting. Wish me luck.


I think you’ll do well. Looking forward to your report.


Good job.

In any case this:

We de- part from the modulatory notion of Sherman and Guillery (2006), and argue that these weaker 6CT inputs are capable of driving TRC activation by themselves,

As they told you, this Is a no-no. Anatomically is not posible to do that. No only because are at the end of the dendrite, mantain the state across of time (both facilitating and depressing) , but because they are orders of magnitude more frequent than “driver” synapses (both in the thalamus and L6->L4)


Do you have references on this particular point? I am looking for materials on connections between L6 CT axon terminals and the dendrites of thalamic relay cells (TRC).

Nearly every paper I read take for granted that L6 CT cells have a modulatory role, but I’m not sure they have fully investigated this point.

1 Like

This paper from Sherman and Guillery (1998) seems to be the most cited paper when referring to the modulatory vs driver role of L5 PT cells and L6 CT cells.

On the actions that one nerve cell can have on another: Distinguishing ‘‘drivers’’ from ‘‘modulators’’:

Extract about the axonal termination on proximal (from L5 PT cells) or distal dendrites (from L6 CT cells):

A first order relay receives its
driver inputs on proximal dendrites from subcortical sources via
ascending pathways whereas a higher order relay receives its driver
inputs from cells in cortical layer 5 (see ref. 40). The first order relay
sends a driver input to layer 4 of cortical area A (thick line), and that
same cortical area sends a modulator input (thin line with small
terminals onto distal dendrites of the thalamic relay cell) from layer 6
back to the same first order thalamic nucleus.

Next question: if we admit this anatomical observation, could many L6 CT cells give an above-threshold input driving the thalamic relay cells? The quantity may compensate for the individual strength.

Randall O’Reilly (the author of the 3VS paper) would answer “yes”!


I haven’t heard of any studies activating many L6 CT cells and causing thalamocortical cells to fire. It’s hard to tell how many L6 CT cells they activate though. L6 CT cells also activate the thalamic reticular nucleus a lot, which can cause overall inhibition, so it’s not like just activating all the L6 CT cells which target one particular thalamocortical cell. Just activating those CT cells might not cause much disynaptic inhibition via the thalamic reticular nucleus.

A Corticothalamic Circuit for Dynamic Switching between Feature Detection and Discrimination
A corticothalamic switch: controlling the thalamus with dynamic synapses

If you’re looking for search terms, there are loads of studies which use NTSR1-cre mice to optogenetically activate L6 CT cells.

One issue is that L6 CT cells might fire too sparsely to cause thalamocortical cells to fire. L6 CT cells have fairly broad axons in the thalamus, so thalamocortical cells might receive input from a good chunk of L6 CT cells if they form a lot of synapses. For example, in one case their axon arbors cover the whole part of the thalamus corresponding to the CT cell’s cortical column.
Activating a lot of the L6 CT cells which are presynaptic to a given thalamocortical cell might not happen outside experiments, since that could be like 1/2 of all L6 CT cells. I have no idea what the numbers are like though.


@Casey: Thanks for your answer, it helps!

The most enlightening info is in this series of videos.

In any case, the most interesting paper is:

[1] S. M. Sherman, “Functioning of circuits connecting thalamus and cortex,” Compr. Physiol. , vol. 7, no. 2, pp. 713–739, 2017. SCI-hub