Deep Predictive Learning: A Comprehensive Model of Three Visual Streams

This is the 3rd time I’ve tried reading this paper (this time over holiday “break”) and I must say it is not an easy read. Red is stuff I don’t understand yet.

I’m 6 pages into 50+ pages. No pictures yet.


It is the wrap-up of at least half a dozen prior papers and if you have not absorbed them first AND been reasonably familiar with the layout of both the thalamus and visual system - well - all they give is pointers and bread crumbs to the prior material.
I have been following their work and have a nodding familiarity with the referenced material and it was still pretty dense for me. That said, it does show how ALL of that comes together in a tour de force!
It really is worth the work.

1 Like

If you get NOTHING else from this - note that they use the counter-flowing information channels to do some of what back-prop does AND dramatically constrain the degrees of freedom in learning. This speeds up hierarchical learning something wonderful. Unlike normal “random” exploration of the hierarchical search space, the learning direction has a direction. Since this is the prodding of the (sorry Dr Cisek) lizard brain it is guided to the most useful functions to support the dumb boss.


Minus/plus - breaks up every alpha cycle.

This is the core of the 3VS prediction model.

It is a very different prediction method from HTM but very viable and well supported in biology. The slow response of the lower layer is well documented.

  • HTM does prediction from the last cycle to the current cycle using depression of firing potential and through that, winning the column election;
  • the minus/plus thing does it within the cycle between the layers and is used to provide a match between prediction and ground truth, every single cycle. The action is local to the cell.

I see the two methods as complimentary.


Do I need to go deeper and try understanding the DeepLeabra algorithm (O’Reilly 2015)? Or can I think of that as a black box?

I did but you can get a lot put out the 3VS paper without it. I would go this the BB path unless it seems to be getting in the way of understanding.

1 Like

Here is an open question I keep coming back to. @Bitking Maybe you can help.

This connection from L2/3 to L5…

What is this? I understand the 5IB FF connections to the thalamus, but this intracortical connection confuses me. My model has lateral connections between CCs in L2/3 to L2/3, and more in L5 to L5. It must be a link to the past, correct? The previous state of V2 Super, but how does it get to the deep layers?

1 Like

Ok I think I understand how it could happen.

If you separate a CC up into superficial layers and deep layers, there seems to be only one pathway between them, between L4 and L6b. The path I questioned above could be from L4 to L6b, a reference to the previous input state.

Am I understanding this correctly?

1 Like

@mthiboust - Would you like to take a try at Matt’s question?

Oh, the famous 3VS paper! :slight_smile:

I was planning to return back to this paper at some point. I was waiting for Randall’s new paper but it is still not yet published on Arxiv.

@rhyolight: not sure I understand your interrogation. Are you asking about the existing connections between upper and deep layers?

In addition to local and lateral projections in the same layer, each L2/3 pyramidal cell also projects to other distant cortical areas. On their way to distant cortical areas, their axon makes numerous synapses in L5 underneath their soma.

Those connections from L2/3 to L5 cells can trigger an AP in L5 cells only during a specific phase of alpha oscillation when inhibition is removed.


Thanks you guys. Another question.

Is DeepLeabra a DL model that implements hypercolumns as localized DL blocks arranged topologically? Cause that is pretty cool.

How I visualized this is the two streams (feedforward and feedback) as being laid on top of each other. One end starts on the senses, the other in the hard-wired subcortical structures.

The information that is available locally is the comparison of the two streams - at that point.

So instead of using backprop to distribute errors to the the right layer in the hierarchical stack the information is automatically in the right place for local computation.


Definitely going to present this at a research meeting before Jeff, but I still have a lot more understanding to do, because he is going to ask a lot of questions. I plan on putting together an outline of the main points, and a diagram or two of the architecture. I’ll post here for all your review. I am finally getting into the model details.


Much of the paper digs deep into the details of the implementation of these two streams and local interactions. You may want to also look at the paper I referenced in post 5 for more background outside of the 3VS paper. This is important to get past the hand waving and show that the system does work, and “it works just like this.” This is critical to make the work reproducible.

Another big chunk of the paper is working though how these streams train up following this this feedforward/feedback interaction stream. As you look at the models you can see that the learning has an evolution as the training at different levels acts to discipline the “next” stages. Some of this is the development of the “third” stream as the system starts to extract the “meaning” of the data.

Stepping back and looking at the big picture, you can see that the general principle of the feedforward and feedback path interaction could be applied to many different implementations and achieve the same benefits. This is what I was getting at in post 22 above.

1 Like

It also seems to me that any lateral learning we can apply across cortical columns will only accelerate the learning of the system (TBT).

1 Like

I totally agree with your lateral learning comment.

In light of the current focus from Numenta on “the location signal” that is communicated from higher levels in the feedback path, this takes on a slightly different, but complementary, focus.

Instead of the sharp distinction that some object is being learned and recognized you get to a more general learning mechanism. With some tortured meaning of “object” you could say that is objects all the way down.

Thinking of what is known about the early visual system I think that it makes more sense to say that the object-ness is distributed over several layers. The “object” is really the features of an object.

As I have been saying for some time - an object is really a cluster of features that are sensed at the same time. Some of these features are spatial relationships.


More specifically, “object” should encapsulate both internal and positional parameters: what and where.

1 Like

I have some doubts about this paper, although I have not read the whole thing.

[W]e are now convinced that predictive learning must start with as much high-level abstract representation as possible, and focus on learning further such representations as quickly as possible thereafter, because central, compact, abstract representations of things like spatial motion and object properties are essential for successful predictive models. Without these coherent, central, higher-level representations, the lower-level predictions are doomed to mediocrity — they will learn a vague, muddled and incoherent predictive model, which does not then provide a good basis for developing higher level abstract representations at a later stage of learning.

High-level abstract representations are essential because they consolidate and concentrate learning within a centralized set of representations (e.g., about the nature and relationship of different features of an object for the What pathway). These central representations can much more easily maintain this essential information over time, to support consistent, stable predictions about how an object will appear in the next moment. By contrast, lower-level areas such as V1 or V2 are huge and strongly retinotopically organized, such that any given set of neurons only encodes a relatively small portion of the visual world (e.g., around 1 degree of visual angle). Therefore, the encoding of object properties and motion trajectories in such areas must inevitably be highly diffuse and disconnected, with entirely different populations of neurons representing an object at one moment to the next. Such representations provide a poor basis for accurate predictions, given the underlying stability of object properties, and their current motion trajectories, over time (at least over the 100 msec alpha timescale of relevance here). [My Emphasis]

This appears to contradict the Thousand Brains Theory!

This paper disregards Hebbian learning:

Although the biological data, and locality constraints, appear to favor some variant of a Hebbian learning mechanism, computational models consistently show that this form of learning is incapable of solving real-world problems, and that instead some form of error-driven learning is required.

However, unlike Hebbian-learning based self-organizing models, predictive learning can leverage the power of error backpropagation to drive learning in a deep hierarchy of areas, in a coordinated fashion, to produce much more powerful results.

Just because no one has solved a problem yet, does not mean that it is unsolvable. See Fermat’s Last Theorem. Furthermore, even if one theory is wrong, that does not mean that another theory is correct.


Please read the whole thing; they do address the issues that you raise.

Totally agree. Hebbian learning works through positive reinforcement, backprop through inverted negative reinforcement, but both are feedback-first. That is an intrinsically utilitarian vs. cognitive drive. Primarily cognitive would be feedforward-first lateral learning, such as in retina and grid interactions.