Deep Predictive Learning: A Comprehensive Model of Three Visual Streams

jakebruce · November 19, 2017, 11:50pm

A very recent competing / complementary model of deep predictive coding in the brain.

O’Reilly, Randall C., Dean R. Wyatte, and John Rohrlich. "Deep Predictive Learning: A Comprehensive Model of Three Visual Streams."

Abstract:

How does the neocortex learn and develop the foundations of all our high-level cognitive abilities? We present a comprehensive framework spanning biological, computational, and cognitive levels, with a clear theoretical continuity between levels, providing a coherent answer directly supported by extensive data at each level. Learning is based on making predictions about what the senses will report at 100 msec (alpha frequency) intervals, and adapting synaptic weights to improve prediction accuracy. The pulvinar nucleus of the thalamus serves as a projection screen upon which predictions are generated, through deep-layer 6 corticothalamic inputs from multiple brain areas and levels of abstraction. The sparse driving inputs from layer 5 intrinsic bursting neurons provide the target signal, and the temporal difference between it and the prediction reverberates throughout the cortex, driving synaptic changes that approximate error backpropagation, using only local activation signals in equations derived directly from a detailed biophysical model. In vision, predictive learning requires a carefully-organized developmental progression and anatomical organization of three pathways (What, Where, and What * Where), according to two central principles: top-down input from compact, high-level, abstract representations is essential for accurate prediction of low-level sensory inputs; and the collective, low-level prediction error must be progressively and opportunistically partitioned to enable extraction of separable factors that drive the learning of further high-level abstractions. Our model self-organized systematic invariant object representations of 100 different objects from simple movies, accounts for a wide range of data, and makes many testable predictions.

The authors note similarity to an older incarnation of HTM:

Hawkins’ Model
The importance of predictive learning and temporal context are central to the theory advanced by Jeff Hawkins (Hawkins & Blakeslee, 2004). This theoretical framework has been implemented in various ways, and mapped onto the neocortex (George & Hawkins, 2009). In one incarnation, the model is similar to the Bayesian generative models described above, and many of the same issues apply (e.g., this model predicts explicit error coding neurons, among a variety of other response types). Another more recent incarnation diverges from the Bayesian framework, and adopts various heuristic mechanisms for constructing temporal context representations and performing inference and learning. We think our model provides a computationally more powerful mechanism for learning how to use temporal context information, and learning in general, based on error-driven learning mechanisms. At the biological level, the two frameworks appear to make a number of distinctive predictions that could be explicitly tested, although enumerating these is beyond the scope of this paper.

bkutt · November 20, 2017, 2:46am

I do wish they would have provided reasoning for this claim:

We think our model provides a computationally more powerful mechanism for learning how to use temporal context information, and learning in general, based on error-driven learning mechanisms.

Bitking · December 8, 2017, 3:40pm

I am reading this in the context of my “dumb boss, smart advisor” model and I have to say - it’s sending shivers down my spine.

One of the parts that the authors point out as needing more work is the source of the high-level training patterns to generate the seed errors.

If you assume that the older lizard brain is going about its normal behavior in a mewling infant - looking, feeling, tasting, and living in general - and the cortex is getting this as the higher order input for the pattern to seed training - the explanations match up very nicely.

rhyolight · December 8, 2017, 4:06pm

Does the learning model in this paper include the HTM Neuron? Meaning is there a predictive state (dendritic spike modeling)?

Bitking · December 8, 2017, 8:01pm

Very similar, perhaps enough to use it directly. Since is related to the phase between the prediction and update inside a single wave the order of evaluation would serve much the same function.
It also works with a scanning pattern that is similar to the biological model that convolving tries to emulate.
This may well be the first “killer app” that deep learning HTM nay-sayers need to see to be shown that the biologically based model is as capable as the applications that statistically based point neurons are typically used for. It learns in an “unsupervised” manner in a few hundred presentations, not epochs of thousands. And without the forms of back-prop that I think everyone can agree is somewhat of a crutch.

I posted something to Jeff recently that ties in with this:

Please note that this also includes some of the oscillatory/phase involvement we were touching on in a different thread:

In this you stated that your research is looking into phase related processing. This model has it in spades.

Lastly - they mention in passing that this same general system would be applicable to sensorimotor systems. When you get the overall scheme it does seem extensible.

How extensible?

I am struggling to see how one could combine the coritical-IO system with speech hearing & production using this general approach. It may take some time to work this out but I will be mulling to see if it could make sense.

Emotional coloring from the amygdala seems to be an important feature in what has to be stored in the word sequence-grammar & word store.

What goes on in the lizard brain grows every-more important to understanding the cortex.

Bitking · June 26, 2018, 6:58pm

This paper adds some powerful support to the proposed “three visual streams” model.

Now if I can find some papers supporting the proposed plus/minus-phase temporal learning mechanism …

Bitking · February 7, 2019, 4:38pm

Nobody ever said that understanding what the brain does is going to be simple or easy to explain.

I find it hard to imagine that modeling the entire visual hierarchy and all the sub-mechanisms at each level could be much simpler. This is an enormously complicated task and this paper combines years of prior papers into a masterwork showing how all the parts can work together as a system.

The emergent property of counterflowing streams of information interacting to guide memory formation can only emerge at this level of implementation. This is a highly desirable goal.

I am impressed by how much of the visual system they have captured and how well it does work. I am trying to see how to incorporate many of the features of this system in my own work. I think that combining that insight with the HTM SDRs and predictive behavior should work even better. It would be great to see the training time drop from hundreds of thousands of trials to a few dozen.

vpuente · February 7, 2019, 5:08pm

I guess we disagree on that. It should be easy. Otherwise, a minor change in the operating conditions will make it fall apart. Robustness and flexibility comes from simplicity.

My best analogy is a modern processor: if you take it apart, it looks extremely complex with many billions of transistors. But the basic architecture (storage-program) can be explained in a couple of sentences. You need all those transistors to solve an “engineering” problem. Mainly memory and bandwidth walls. You never will be able to understand from the latest IBM Power9 (with 8 B transistors) how storage-program works. It’s much better Numenta approach: try to replicate how Power9 behaves by connecting sparse pieces of knowledge together.

Bitking · February 7, 2019, 5:36pm

If we are arguing just to score points then you win. The brain can be described in a child’s picture book.

I’m not sure if that level of description will allow you to build an entire visual learning system but - yes - there should always be a simple overview. As a teacher of technical topics I will agree with you that the Feynman principle should apply: “If you can’t explain something in simple terms, you don’t understand it”

I was assuming that forum members are the ones that are digging into the 8 B transistor level models to make things work and are still working out how to pull this off. At this level, the explanations can get complicated. As you well know, elsewhere on the forum they are hashing out how to do branch prediction and while the basics of a CPU instruction execution unit are conceptually straight forward - tweaks to make it work better are proving to be less so.

vpuente · February 7, 2019, 6:05pm

No … just to learn a bit more

I know something about that I’ve worked on cache coherence… although at first sight is really complex, the basic principles and why something will work or something won’t, isn’t that hard. ( Computer architecture is half based on intuition/experience half on engineering ).

Bitking · February 7, 2019, 6:22pm

It’s been 40 years since I did bit-slice level CPU design; I made a nice little 8/16-bit CPU based on the 74170/74181/74182/74200 chips. Microcode used to be chip logic and not Verilog code. FPGAs were not really a thing yet. Things have gotten much more complicated since then.

That said - do you have a pointer to different papers that do deep predictive learning?
As far as I know - this is an emergent property of a “larger” multi-layered processing system.

I have been a long-time fan of the Global workspace theory, in particular the work of Stanislas Dehaene. This is also a large scale model of interconnected maps. There are also interactions between the counter-flowing streams that drive much of the interesting behavior in these systems. I see this emergent behavior as a recurring theme in larger systems.

My focus in all this is more oriented to the larger system level engineering and how the maps have to work together. The HTM model is surely part of this on one end of the size scale. The engineering scale at this level covers several orders of magnitude. You have to consider everything from individual synapses to large scale fiber tracts and population densities of fiber projections.

I could be way off base but I do think it’s time to take off the training wheels and put the HTM model in play as part of larger systems. I expect that this will change some of the current notions of what various parts are doing and refine the model.

In particular, I expect that the focus will shift from object representation to communications of object representation and deeper consideration of the distributed nature of those representations. I also expect that the coordination functions of cortical waves will take on more importance than it currently holds in the HTM canon.

vpuente · February 8, 2019, 11:51am

Noup… sorry but I don’t.

The “theater metaphor” is nice.

Indeed… we need to advance, but step-by-step.

mthiboust · September 18, 2019, 10:10am

The reading is not easy, but this “3 visual streams” paper combines lots of interesting ideas about L4/L5/L6/thalamus functions, temporal learning, error coding, alpha oscillation, top-down shortcuts increasing the learning rate, sequential development of the visual pathways (“where”, “where & what”, then “what”).

Combined with other readings, it helps me to conceptualize a different potential explanation of cortico-thalamic interactions (mainly based on an action-based interpretation of perception, and on cortical efferent copies to the thalamus). I’ll post about it when I will have formalized the mess in my mind.

Concerning the paper, I would like to mention some key points that would deserve a discussion:

Intro

The thalamus has a double function: attention (not studied in the paper) and cortical learning (the main focus of the paper). The authors take the visual cortex to explain their theory, but it could be generalized to other areas as well.

Learning

There are two kinds of learning:

Self-organizing learning which extracts statistical regularities (like classic auto-encoders). This is the kind of learning is used to create our internal models of our body and the environment.
Error-driven learning which leverages differences between expectations and outcomes. This is the kind of learning is used to shape our “alpha-long” predictive abilities (the term is mine) and our longer-term reward system.

In predictive learning, the learning could be both self-orginizing and error-driven.
At the beginning of development, cortical areas learn slowly and independently in this manner, until some “fast-learning” cortical areas could then help the other ones by providing top-down inputs for more efficient error-driven learning (more on this later).

Learning over alpha oscillations

Let’s focus on the “alpha-long” predictive ability of the neocortex.
By “alpha-long”, I mean the very short-term predictions of what will be experienced in the next 100ms (alpha oscillation).

The alpha oscillation creates two timeframes:

A 75ms “minus-phase” (= 3 gamma ticks) where the deep layers are isolated from L4 inputs to allow the computation of a pure prediction in L5/L6
A 25ms “plus-phase” (= 1 gamma tick) where the current state of the environment and the ongoing internal mental state of the organism are shared with deep layers.

Error-signal

The error-signal is then computed in high-order nuclei of the thalamus (pulvinar in this example for vision) which receive both:

The current state from L5IB cells (which relay the “ground truth” signal of the current moment coming from superficial layers L2/3 & L4)
The prediction from L6CT cells (which is the result of an interaction with local L5IB and L6CC cells, and long distance top-down inputs from L6CC).

Importantly, this error-signal is encoded by the thalamus as a STDP-dedicated temporal differences so that the cortex knows what synaptic weights to increase/decrease. The Spike Timing Dependent Plasticity (STDP) is a biologically-supported learning process allowing local backpropagation of errors to neighboring neurons.

Getting around the credit assignment problem with development phases

However, we have to keep in mind that an error could have several causes originating either from the given area, from other areas, or from both!

If each cortical areas correct the synaptic weights in parallel, we could end with a non-convergent system where connected areas iteratively adapt and unadapt their weights given the reciprocal changes. That would explain why we have critical periods in brain development.

The authors take the example of the different visual pathways:

First, the “where” pathway can learn pretty easily on its own
Then, it helps the development of an hypothesized “where & what” intermediate pathway thanks to top-down inputs from the stabilized “where” pathway
Finally, the “what” pathway can construct complex spatially-invariant object representations

Conclusion

Overall, their explanations and their computer implementation are full of smart ideas that look very promising. And their model would fit well on top of a L2/3 object representation model (for instance Bitking’s hexgrid or other).

My doubts

However, I still have some reserves on my side. Mainly on those two points:

Given my other readings, I don’t buy their interpretation of L6CT cells. For now, I prefer to stick to Sherman & Guillery’s view of a modulation role of L6 cortico-thalamic cells. I may have a different view that could reconcile those two views, but I need to take more time to think about it.
It is not yet clear for me how the error-signal from higher areas is adapted to be understandable by lower areas, did someone get it?

If you want further reading about the paper:

Online book from the authors: https://grey.colorado.edu/CompCogNeuro/index.php/CCNBook/Main
non-official python implementation of their Leabra algorithm (on which DeepLeabra from the paper is constructed): https://github.com/benureau/leabra

Still thinking about it!

Bitking · September 18, 2019, 12:27pm

I agree - the paper is the culmination of many other papers where they pulled many of the projects they have been working on into a coherent whole. Many papers talk about a single point in great depth - the 3VS paper has has a bunch of ideas that are all inter-related. They demonstrate a model to test the ideas that is one of the most ambitious that I have seen - it encompasses most of the visual processing stream.

If I had to pick out the most important idea it would be the use of top down and bottom up streams as a mechanism to radically enhance learning. This fits well with my understanding of the relationship between the lizard brain and the cortex. The lizard brain has a small amount of hardwired learning that is fed into the system from one end, and the real world from the other end. The lizard brain contribution acts as a lighthouse to constrain the search space. I see this, combined with the ground truth from the senses, as a biologically plausible replacement for back propagation.

As @mthiboust pointed out, the model is based on the leabra algorithm, but the ideas about structural organization should be generally usable with other predictive implementations. It describes what is necessary to incorporate the organization role of the thalamus into the counter-flowing streams to gain the advantages normally thought to be the realm of back-propagation.

The model presented does take thousands of hours of computer time to train. I think that this is due to the use of the Leabra neurons; the training is using essentially straight Hebbian learning rules. As I see it - if it had been built using HTM neurons it would have trained up much faster.

One of the ideas I keep coming back to is the possibility of two predictive mechanisms at work at the same time - the Deep Leabra in the feedback direction in L6 and HTM in the feedforward direction in L5.

Replacing the simple “top level” drive with a simple lizard brain would be a logical extension.

mthiboust · September 18, 2019, 3:37pm

When you say feedback vs feedforward, do you mean error-driven vs self-organizing learning as mentioned in my post ?

Because I think that the Deep Leabra is doing both at the same time.

Bitking · September 18, 2019, 3:46pm

I strongly agree that Deep Leabra does learning in both directions.

I am saying that the proposed Deep Leabra model may work much better with the HTM model substituted in the relevant part of the model.

And of course, as you mentioned, the hex-grid organization dropped into place where it belongs.

The brain has many learning systems.

The Deep Leabra model is essentially based on Hebbian spike timing learning. This is a very slow method; the one-shot feature of HTM is a huge improvement.

Likewise, the attention/activation gating from the RAC that I have mentioned several times before would serve to speed up the learning in other ways. An example:

Lastly - one more concept to work in as you think about what is going on with the 3VS paper: the concept of the global workspace. If the stream from the lizard brain could be thought of as a need state then a match to the sensed stream should be considered as an important event.

Peter_Rigole · September 18, 2019, 7:50pm

I think feedforward means from lower brain levels to higher levels and feedback the other way around, which is meant to speed up learning in the different brain areas (e.g. a higher-level cue via language input can enhance lower-level features such as recognizing a figure in a cluttered image). In Leabra, error-driven learning is indeed balanced with Hebbian learning and both happen at the same time.

mthiboust · October 28, 2019, 8:48pm

In a private discussion, Randall O’Reilly (the first author of the 3VS paper) responded in details to some of my doubts expressed in a previous post in this thread:

I guess that some of you will be interested (he allowed me to post his explanations here).

In brief:

His team is about to post soon a new preprint on Arxiv about a version that works directly on 3D object images
Awake vs anesthetized states matter a lot when interpreting experimental results
Transmission of the error-signal to higher areas only

Randall O'Reilly:

We have just submitted a much shorter and hopefully more comprehensible version that works directly on 3D object images, and compared the model against current DCNN backprop nets, and also human data. We’ll be posting a copy of that on arxiv soon once we hear about its fate…

Regarding your doubts:

I think the most convincing data about Pulvinar that contradicts the S&G view is actual electrophysiological recordings from awake behaving monkeys e.g., Pulvinar nuclei of the behaving rhesus monkey: visual responses and their modulation - PubMed Functional identification of a pulvinar path from superior colliculus to cortical area MT - PubMed

The key point being that these papers consistently show sustained, stimulus-evoked responses that overall look very similar to those from corresponding visual areas. Given that we know that the 5IB driver inputs should only be bursting briefly every 100msec, the only way this is possible is if the 6CT neurons are providing an above-threshold input driving this firing, consistent with the idea that they are capable of driving a top-down prediction, and not just “modulating”. I’m pretty sure most / all of S&G’s work was done in anesthetized animals or slices, and in almost every case that I know of, neurons behave very differently in such states compared to “normal” waking, particularly with respect to things like “up / down” states and bursting etc.

"It is not yet clear for me how the error-signal from higher areas is adapted to be understandable by lower areas, did someone get it?”

There are a few things to say about this. First, and most importantly, the entire model is fundamentally one big recurrent backprop network, from a computational perspective, with error signals reverberating around as temporal differences in all directions. To the extent that these temporal differences “resonate” over time with other such signals, at a given layer, then the resulting synaptic weight changes will accumulate, and so it goes… So it does have a bit of the “black box” “it just works” flavor of backprop networks, but on the other hand, that does seem to work and nothing in the model violates any principles of biology, so… maybe that’s how it actually works!?

The other thing is that the direct prediction error signals coming from the pulvinar layers obey a critical constraint: they only receive from and project to higher layers in the network. So, in contrast to your question, it is only lower areas that are directly sending error signals to higher areas, not the reverse (i.e., higher pulvinar areas do not directly project back to lower areas). Thus, the main question is how higher layers directly use such low-level error signals — e.g., V1-level pulvinar is directly shaping IT-level representations. We think this is important for keeping the “toes to the fire” so the higher areas don’t just drift of and start free-associating… And the above-mentioned backpropagation happening all over the place allows plenty of opportunity for learning more abstract representations via indirect backpropagated signals — the main superficial layer network is really very much like a recurrent backprop vision model (e.g., https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00124/full)

Also, regarding the computational complexity / compute time of the model: it is actually very efficient given how many recurrent connections are being simulated. The “point neurons” don’t take much to compute, and we optimize a lot to take advantage of sparse connections. Also, we have a new simulation framework that is usable from Python, and is written in Go: GitHub - emer/emergent: This is the new version of the emergent neural network simulation software, written now in Go (golang) — there is an as-yet-not-well-documented example of the basic “deep leabra” learning here: https://github.com/emer/leabra/tree/master/examples/deep_fsa, which should be much easier to understand than the full visual model.

I hope it will help to clarify some points and to foster ideas.
On my side, I need a couple of days to fully digest this answer before commenting on it.

gmirey · November 6, 2019, 2:01pm

I was under the impression that the diversity of the inhibitory cells could be sufficient, in itself, to explain theoretically (short of a proof) any of those distinctive frequencies… (combined influences of networks of resonators vs integrators, and so on). I know you studied some of the wavy stuff… do you have an opinion on this?

mthiboust · November 6, 2019, 2:18pm

I was also surprised by his explanation that doesn’t involve inhibitory cells. But nope, I don’t really have an opinion on this currently.

In fact, I am still trying to fully understand his answer.

Topic		Replies	Views
Numenta Research on 3 Visual Stream & Deep Predictive Learning Current Research hierarchy , vision , deep-learning , neocortex , thalamus	3	760	January 27, 2020
Predictions Lounge	2	866	August 8, 2018
Can the brain do backprop? General Neuroscience	29	3204	July 2, 2018
[Paper] Deep Predictive Learning in Neocortex and Pulvinar Related Papers	0	575	June 30, 2020
Predictive Processing vs Predictive Dendrites Numenta Theory	56	2015	April 27, 2021

Deep Predictive Learning: A Comprehensive Model of Three Visual Streams

Related topics