Preliminary details about new theory work on sensory-motor inference



There are several papers that discuss the What and Where pathways. The distinction was first observed in vision. Region MT is the first region in the Where pathway of vision whereas V1 is the first What region. The What/Where distinction was later found to exist in other modalities. I don’t recall the names of the regions but a quick search in Google Scholar will likely turn up papers that detail them. We recently read a paper (at Numenta) about the evidence for Where paths in audition and touch, I don’t recall the name or authors, maybe someone else can provide a link?).

Regarding hierarchy, there is a lot that we don’t understand, please keep that in mind.

  • Going up the hierarchy the representations are driven by broader areas of input. That is generally accepted.
  • A cell or column can only process the information from the part of the input it receives. This makes sense and most scientists accept the idea a small area of V1 just processes its limited input and passes the processed version up the hierarchy to be “integrated” by a higher region.
  • I used to think that was the end of the story, but I now believe it is more interesting than that. Although a small area of V1 can only receive input from a small area of the retina, that small area of V1 can actually learn and model entire visual objects. Each small area, a CC, uses its model to predict what it will see based on knowledge of where on the object it is fixated. If a CC could talk it might say “I know what a pen is. I can only see a small part of a pen at a time, but as long as I know where on the pen I am currently fixated (location) I can predict what I should be sensing.”
  • Most scientists think that a CC in V1 processes its current input and passes it up. What we are saying is that a CC in V1 integrates inputs over time to understand objects that are much larger than it can sense at any moment.
  • There are limits to what any CC can learn (both memory capacity and learning time). So it is likely that a CC in V1 might only be able to understand objects and structures that span a subset of the entire visual field, but it would still be modeling structure much bigger than expected based on the corresponding part of the retina.


Of course, one need not be an Einstein to realize that, in perception, at least, movement is relative, Let’s suppose that your cup is balanced on a model train-car at a Silly-Club banquet; whether you move or the train moves, relative to the room, your purely visual HTM learns from movement.

Ray Kurzweil may have missed one of the model boats when he wrote that the ‘Temporal’ in HTM reflects an over-emphasis on movement as a source of sensory evidence revealing the invariant stricture of our visual worlds. In How to Create a Mind he points out that in George and Hawkins’s implementation of HTM, simulated scanning eye-movements are required to handle static forms like letter shapes. [Why data fire-hoses are NuPIC-friendly].

However, we should not let success in the sensory-motor layers fool us into ignoring a wealth of natural constraints that an HTM working from an immobile monocular eye can use. In The Senses Considered as Perceptual Systems, J.J.Gibson explores a host of sources that would give ample evidence that the cup and the train are not patterns of color glued to the background. And if our monocular eye is moving, all the visible objects stand out from the background.

Binocular eyes get still more evidence, even without movement. But that raises the issue of perceptual fusion. We humans, and presumably real intelligences made of silicon or quanta, don’t normally hear sounds or see patterns of light: We perceive objects and events, and spaces – using all our senses along with memory and other constraints. Even meaningless sounds and patterns of light are, by the time we are aware of them, already fused images. Direct awareness of raw sensory input, unfused, is difficult or impossible. we can’t hear both auditory signals as separate sounds. We hear one sound coming from a particular location. If we are listening through earphones, that location may be in the middle of our head!

Perceptual fusion is also the norm in muti-modality perception. The sound of a drum being played by a visible drummer comes in both our ears – but the event of the sound being created comes also through our eyes, if we have a functional visual system. If the drum is on a big movie screen and the sound comes from a loudspeaker below the screen, the perceived source of the sound is displaced away from the location of the speaker toward the location of the image on the screen,

I suspect that a fleshed out HTM model will reveal perceptual fusion to be a noin-problem. Any HTM capable of multimodal perception will have perceptual fusion dropping into the out basket as an unintended but welcome side effect.


This is a very fascinating view of how hierarchies in the cortex may work. Once more, Jeff, thanks for your very structured and visual descriptions of the generally accepted facts and the new insights you are drawing and postulating from recent research. I will admit that I am one of your followers obsessed with the potential residing in the hierarchical paradigm of the cortex. And I am also very conscious that we first must complete our homework understanding the nature of all the layers in a column within a region. But we seem to have a subset of functions that must involve the interactions of both scopes (the intra-column layers and the inter-regional-hierarchy). The challenge seems to reside in recognizing these interactions from both scales (scopes/dimensions) of activity and how they complement each other to produce the intelligent cognitive and predictive behavior of humans/mammals.

I am really delighted with your explanation of your newer views about how the hierarchy (HTM large scale) might be working, with a small area, CC, learning and modelling entire visual objects. This is basically like having a specialization of certain small patches (segments) within the same hierarchical level taking place, by exploiting the time series memory. It makes wonderful sense to me, because this also creates much more parallelization of the processing. It also add substantial complexity to the localization of where within the neocortex specific objects are being processed. It is a brilliant insight that would explain a lot of things if it holds up to evidence. If we assume that this is indeed the way the hierarchical regions operate, then it will also be very crucial to understand how these subregions get allocated to specific objects. In other words, how the self-organization and optimization of these regions takes place and is controlled. Another question I would have is whether the size of such regions is fixed or variable and can expand over time, to include more objects. There must be some master process, which is universal and distributed that “coordinates” such partitioning and allocation. As you have stated numerous times, the plasticity of our brain is one of its greatest properties. Thanks a lot for this discussion. It certainly inspires me very profoundly.



Thanks Joe.
A quick answer on your last question. The general belief is that the areal size of the neocortex does not change over your life.You don’t grow new columns, the hierarchy is fixed, and the size of regions doesn’t change substantially. Neurogenesis is known to occur in some parts of the brain such as the hippocampus, but as far as I know it hasn’t been shown in most of neocortex.

However, how the fixed resources of the neocortex are used does vary over your life. For example, It has been shown that the body map that exists in S1 is easily modified. If you use a hand tool often then the body map adjusts to incorporate a representation of the tool! Also, if part of the cortex loses its input or part of the cortex is damaged, then the cortex adjusts to the new reality. In our work we assume each region has a fixed size, that the size of each region is not a critical variable, and that each region continuously learns and therefore the representations formed are based on experience.


Very interesting … Would this mean that the models of objects that form in every CC will be more like “part of a whole object”, rather than a “full object” sense.
Why I say that ? Because the CC alone can not form representation of so many objects it comes in contact with, it will overgeneralize so much that it will become useless (limited memory).
“part of object” analogy would also allow the CC to share representations across similar objects (f.e. part of different pens, even including similar objects like pencils).


Thanks for your last reply about the fixed areal size of the regions in the neocortex, Jeff. You also mention, that after damage (or loss of input) to a given region of the neocortex, then the cortex adjusts to this new reality. I assume this comment also includes the cases in which inflicted trauma causing a loss of sight, for example, has shown that the visual cortex is capable of gradually reassigning itself to auditory and speach functions. I also remember watching a film back in my high school days (1980s) in biology class about an experiment in which a subject wore a set of gogles with lenses that produced a perfect inverted image of the world uninterruptedly for several days. For the first 24 hours the subject was not able to walk, or drive or even use silverware at a dinner table, without assistance. However, after 36 hours had passed, the subject started to see his surroundings through the same gogles normally. He was soon able to drive a car and even fly a small airplane while still wearing the gogles. Removing the gogles then initiated the same process of re-adjustment all over again. Both of these examples seem to demonstrate, that our neocortex must have a “master” allocation function that is always active and exploits the high plasticity of out neoortex to optimize its usage and re-adjust to our new realities, including the constant use of new tools or prothesis.

I understand that we do not grow new cortical columns over our adult lives and that the size of our neocortex does not increase over our adult lifespan. However, my intuition tells me that while all cortical regions have universal properties at all levels of the hierarchy, the representations (SDRs) that emerge at much higher levels of the hierarchy are for very different semantic objects and concepts than SDRs at the primary levels. In other words; all cortical regions at all levels of the hierachy in the neocortex have universal functions. But the semantic content being represented in those regions differs very greatly, depending on the level in the hierarchy. And considering that we also know that the synapses of distal dentrites have varying degrees of permanence, then it does not seem all that unlikely, that everytime we learn something new from our new experiences, that we may be adding entire new levels to certain hierarchical regions, perhaps by reassigning (rededicating) certain less used regions. Do you think these ideas are plausible?



I did a rough back of envelope calculation a few weeks ago, it suggested that a CC L4 could learn 100,000 Location/Feature pairs. This is limited but not teeny. It could for example represent 1,000 objects with knowledge 100 locations on each object. In a test system we should be able to show the learning of complete objects in a single CC, in a biological system it would almost certainly require some sort feature decomposition in a hierarchy.

You can get these googles, we borrowed one for a few days at the Redwood Neuroscience Institute. You started noticing improvements in your performance within minutes. I have always assumed that your ability to learn the new reality was just a function of continuous learning, nothing special.[quote=“BrainConstellation, post:46, topic:697”]
my intuition tells me that while all cortical regions have universal properties at all levels of the hierarchy, the representations (SDRs) that emerge at much higher levels of the hierarchy are for very different semantic objects and concepts than SDRs at the primary levels. In other words; all cortical regions at all levels of the hierachy in the neocortex have universal functions. But the semantic content being represented in those regions differs very greatly, depending on the level in the hierarchy. And considering that we also know that the synapses of distal dentrites have varying degrees of permanence, then it does not seem all that unlikely, that everytime we learn something new from our new experiences, that we may be adding entire new levels to certain hierarchical regions, perhaps by reassigning (rededicating) certain less used regions. Do you think these ideas are plausible?

I also assume that the SDRs at different levels of the hierarchy represent different things. And neurons are always forming new synapses and deleting others. But the empirical evidence strongly says that the arrangement of regions in the hierarchy is not changing. All regions are being used all the time but what the regions represent surely changes with every experience. The one exception I know of is, I once read a paper about people born blind, no input to the cortex from the eyes. What would normally be V1, V2, and V4 learned to represent touch and audio. However, the surprising thing was that the hierarchical connections between these regions became reversed such at V1 was hierarchically higher than v2, etc. My guess is that this sort of whole scale rewiring can only occur during critical development periods right after birth and not in an adult. There are examples of adults whose lifelong blindness was reversed via surgery. They never learn to see properly. A fascinating book on this is called “Crashing through” by Robert Kurson.


Hi Jeff,
Thank you very dearly for your very helpful response. In order to understand the neocortex it is indeed very important to understand the empirical evidence on which structures have higher permenance or even life-long permenance. Based on your response, I seems to be, that the large-scale hierarchies are all formed in very early childhood, some perhaps already in the embrionic development. This would also help explain why learning a second or third language, never quite reaches the level of performance of a native language learned in the first 7 years of life. I myself, speak English and Spanish as native languages and German, I learned in College. I use German the most on a daily basis, because I live and work in Germany. While my abilities and fluency in German are almost as good as native speakers, there are always little, hard-wired, quirks, sometimes just in a given transition of phonemes or in the usage of a particular noun. Interestingly, if I focus all my attention on the one flaw, I can train myself to fully elliminate it. We humans are great imitators, and since I also have two native languages as a reference base, that enriches my pool of phonemes and grammars. I am fascinated by the plasticity (formability) we are endowed with, and yet the fixed structures we also have. Thank you very much for recommending the book “Crashing Through” from Robert Kurson, I will make sure to read this. Regarding my language use, I find it interesting that focussed attention is enough to overcome any single flaw. (Getting them all right is another thing. Life is just not long enough, or the motivation also has its limits. :slight_smile: With my current German, it is interesting that Germans ask me whether I am from Holland. So I take this as a great compliment, since Dutch is very close to German).
I am also very positively surprised to learn that your researchers at Redwood Institute also tested those inverse-sight gogles. That was one experiment that never left my mind since High School, and I am now 50. On a more personal level, I wanted to share an interesting anecdote. Back in the 1970’s and 1980’s I attended American Schools in Spain, which belonged to the Department of Defense (DODDS). My high school was inside a US Air Force base (Torrejon, near Madrid) and my Mother taught elementary school on the same base. One day, while waiting to get a ride with her after sports, I went into the Newspaper and Magazine store called “Stars and Stripes” and saw the issue of Scientific American with the special on the brain and neuroscience. I dug in my pocket for all the cents and dollars I had and bought it, which was a rare exception since I was subscribed to Discover Magazine. This issue, which you mentioned in several talks, also inspired me and was probably what turned this subject into a lifelong pursuit. Even when I studied Computer Science at the University of Connecticut, I had a special meeting with the Dean of Engineering and discussed getting an individualized major in AI, because in CS they only had expert systems approaches and I insisted on a more biological and linguistic approach. But she succeeded in scaring me away from that path, with threats of having to get only straight A’s and never stopping till I reach a PhD. I regret every day, that I did not take that path. Now in Germany working for an automotive giant in normal IT for years, I have been offered to join a new lab for research in machine learning. I have opened a discussion on the paths toward machine intelligence and am advocating the HTM concept, while educating on the pitfalls of the other mathematical approaches. My transition to the lab has not yet taken place. It is still pending some approvals for I need someone to replace me. I dearly hope, that I will get a chance to dedicated my work toward this goal. Joe


I believe that this thread has primarily concentrated on sensory input and the interpretation of that input into the neocortex but little on how to “transform” that into an action. I find through my own work (Connectomic AI) that sensory input to cortex to be a much simpler problem than motor output. When Numenta discussed transforms in the Office Meeting video, it took me to the same linear algebra I use in programming quadcopters. This could be very useful but from my work, the problem of managing cortex output to a motor action is not trivial. In brief, and for what its worth, I have found using nervous system emulations and robotics, there are a number of things that need to be considered. To name a few:

  1. There is a right/left/dorsal/ventral component in sensory input and motor output that is built into animal nervous systems. There are “rules” that seem to govern the interconnections between these regional aspects. The symmetrical regions play a significant role in how we process our world and how we react.
  2. Just as sensory input is a reduction (limited number of peripephral/sensory nerves) that expands into the neocortex (a much more vast number), the neocortex reduces into motor neuron/muscle output. There are billions of neurons in our brain but only 640 skeletal muscles.
  3. There is a temporal output (muscle) sequence for given types of input. In the connectome, this is pretty much wired so the network temporal dynamics can determine how we move our muscles. Deciphering this from a connectomic point of view is a true nightmare. There are several muscle movements that are the same at some point in a given action but we use those same muscle movements for different results. As an example, picking up a cup of coffee to take a sip could be similar to the same movement to pick up a marker to draw on the board.

For my neurorobotics/neuroapplications work, I have found that there are patterns of motor neuron to muscle output that I can capture in order to “decipher” and control the actions of what I want to accomplish. Even with a small set of muscles, I see distinct temporal patterns given specific sensory input (usually sensory input is in pairs, left/right or like SPL0001/SPR0001). I capture the resulting temporal sequences and then use those patterns to determine motor control. For example, one can get a pattern of MUSL001, MUSL002, MUSR001, MUSL002, … that has a completely different meaning than MUSL001, MUSL002, MUSL003, MUSL002, …; i.e. just subtle sequence differences can be the difference between lifting a cup of coffee or lifting a pen to write on the board. This is a tedious process but allows me to control a robot or app by keeping “built-in” control features in place and be able to override those features to obtain primary goals. As an example, I can put a prospecting robot in the desert to wander around with a metal detector attached to it and the connectomic structure can allow the robot to roam without getting stuck or falling off a cliff but I can override the roaming behavior when the detector finds a precious metal and I want to mark the GPS coordinates.

I feel that using the cortex alone as a means to determine output will not yield anything very interesting and you need other structures in place to create a true sensory input to motor output system; i,e, you have encoders and now you need decoders but unlike sensory input, you either need one big decoder to handle all actions or a series of decoders that can handle simultaneous actions and most likely, a hierarchical array of decoders. I am following your efforts with great interest because it is a real problem and one I have been struggling with for the last few months. Although I have found a solution, I’m not that pleased with it from a biological POV.




I have a few questions:

  1. Could the “what” pathway include information about movement relative to the sensors, or should frame of reference have no effect on the representation?
  2. What sort of problems were there before the new ideas? I’m wondering if the conversion to object space needs to be before SP, TM, or TP.
  3. If I’m correct, LGN core-type cells only project to V1. Does that mean matrix cells are involved in the “where” pathway, V1 is responsible for both pathways, one pathway is higher in the hierarchy, or something else?


8 posts were split to a new topic: Synaptic Pruning


I just wanted to add the comment, that I have now just finished reading “Crashing Through” from Robert Kurson. Thank you very much for the recommendation, the book is very informative as well as emotional. Kurson does a great job explaining this fascinating case study of long-term vision deprivation as well as delivering a very touching real-life account of a very exceptional person. I enjoyed it both for its science as well as for the very emotional biographical account.

HTM Theory Guide

Regarding neurobiological understanding of vision, I am currently reading a paper called: The Visual Neuroscience of Robotic Grasping, by E. Chilnellato and A.P. del Pobil, Cognitive Systems Monograms 28, DOI 10.1007/978-3-319-20303-4_2, Springer International Publishing, Switzerland, cc 2016. This is very highly recommended reading for NuPIC collaborators. Specially Chapter 2: The Neuroscience of Action and Perception. In this paper you get all the theoretical background on the two main visual information pathways the ventral stream and the dorsal stream. It turns out that Ungerleider, Mishkin and Goodale and Milner, (1982) detailed the structures and proposed that the ventral stream was labelled the “what” stream and the dorsal the “where/how” visual pathway. This corresponds very closely to this latest discussion thread. Check this great document out if you are into the neuroscience. Joe


@BrainConstellation Thank you for that pointer - it does appear to be relevant to our current work. Looks like it was a PhD thesis and the PDF is available on the web. The problem of grasping objects reliably given visual cues requires a number of reference frame transformations that are analogous to the ones Jeff has discussed. The thesis contains a nice review of relevant neuroscience literature.


Hi Subutai, Thanks for your feedback. I can indeed confirm that the document with the PhD work you linked above, does contain the exact same material from my publication, which I had referenced above. It is very interesting in-depth reading, in my opinion. Not being a neuroscientist, this reading takes me much longer to grasp (no pun intended). I would be very, very interested to read your opinion as to how the newest HTM research fits in and possibly compliments this publication with its collection of sources on sensory-motor research, like Rizzolatti and Luppino. Here is a direct link: It is my impression that, while this research thesis does provide a very comprehensive coverage of the regional specialization in the cortex, mostly based on fMRI and TAM explorations, it does not go as deep into the columnar activity patterns in focus in HTM research. But it is my impression that it may provide some good tips on selecting specific cortical targets allowing for HTM researchers to focus more precisely on key regions, for specific sensory-motor transformations. I certainly understand that this type of analysis would take a good deal of time. I would be very interested if, some day, we can read some comments on this. Joe
PS: Interesting that the PhD student is from Castellón, Spain. I am now living in Germany, but I have lived in Spain in the early years of my life and speak the language as a native speaker as well.


Just one question: Would you reject (or discard) the two streams hypothesis (Goodale & Milner, 1992), postulating that visual information in the brain is processed along two parallel pathways, based on your current empirical evidence? Just curious to know what your stand on this is. Thanks in advance.



@BrainConstellation Thanks for that reference. I do think it is pretty relevant. Another, more recent paper I read by Rizzolatti et al is [1]. Graziano’s work [2] is also very appropriate. One of the properties these studies show is that reference frame transformations must be going on somewhere between levels of the motor hierarchy.

For example cells in M1/F1 code for movements that are very local, such as motion for a particular finger joint in a particular direction. But in area F4, cell firing is related to reaching a target location in global body coordinates. For example, regardless of where your hand happens to be, cell firing in F4 might correspond to moving your hand to the front of your mouth. F4 might be specific to a body part, such as your right hand. Cells in F5 however can be independent of body part and might correspond to either hand going up to your mouth, or even lowering your head until your mouth is in front of your hand. In order to actually reach a goal such as put food in our mouth, you have to be able to transform high level intentions in body centered reference frames down to individual joint movements. If you want to understand a movement, you have to go the other way. This is all very consistent with our recent work where a cortical region represents multiple coordinate frames, and must contain mechanisms for transforming between such coordinate frames.

The general work on what/where (dorsal/ventral) pathways is also very relevant. As I see it, the what pathway represents objects and object centered information, whereas the where pathway represents ego-motion and body centered information. Separating out the two solves a bunch of scaling problems (see Jeff’s video at the top of this thread) but it also requires that you have mechanisms for transforming back and forth between the various frames. The latter is a difficult problem that we are currently working on.


[1] G. Rizzolatti, L. Cattaneo, M. Fabbri-Destro, S. Rozzi, Cortical mechanisms underlying the organization of goal-directed actions and mirror neuron-based action understanding, Physiol. Rev. 94 (2014) 655–706. doi:10.1152/physrev.00009.2013.

[2] M.S.A. Graziano, Ethological Action Maps: A Paradigm Shift for the Motor Cortex, Trends Cogn. Sci. (2015). doi:10.1016/j.tics.2015.10.008.


3 posts were split to a new topic: Are there specifically mapped motor areas in M1


Thank you, so much for that explanation! It makes much more sense, after reading your comments on how the two pathways tie in with the HTM-laminar focus on layers 4 and 5. The only part that is difficult to reconciliate between these to levels of abstraction is the fact that the two-stream theory is proposing a regional specialization, while the HTM-Laminar neuro-computational premise is that sensory-motor inference and motor commands are universal in all parts of the cortex. However, based on your answer, I can see that despite the universality of the laminar functions of the cortex, there can still be a split in the input paths (afferent nerve-bundles of axons) which then force the different regions to specialize on the different tasks (the What-task and the Where task). The confusiing, or just challenging part, to reconcile here, is that in Jeff’s explanations above, at least with my first interpretation, most of the “what” and the “where” transformations are being accounted for within the laminar structure (layers 4 and 5) of a single CC mini-region. This insinuated to me, that every single CC mini-region is already doing both the “what” and “where” processing and then we get this two-stream theory with the dorsal and ventral streams, each specializing at a macro-level of the cortex, which seems to suggest a hierarchical level of specialization in the separation of the “what” and the “where” processing. I probably missed some very important cues in Jeff’s video which I will re-watch very soon. I learned a lot from each pass, because it is so full of information. But I take your words, “Separating the two (streams) solves a bunch of scaling problems” to also imply, that this separation of the streams, does fit in with the laminar functions of layers 4 and 5, which are universal in all parts of the cortex, but are probably processing different “content” (the what and where inputs) depending on the macro-region they are in. As you mention, the challenge then is finding the mechanisms for transforming back and forth between the various frames. Perhaps this is solved with some specific hierarchical organization. Then we should find some confluent regions in which the two are merged.

I will read up on your suggestions in the links and re-watch Jeff’s video to close my gaps in understanding this paradigm. Thanks for your pointers and comments.



I see your confusion. I think there are two separate issues and it might be worth spelling it out. The first is that each stream has its own hierarchy. The what pathway has a well documented hierarchy of increasing levels of abstractions for objects. The examples I gave for the motor cortex could be seen as examples of increasing levels of abstractions for body motions.

The second issue has to do with interactions between the two streams and the operations performed within every region and every level of the hierarchy. In the what pathway converting body centered information into object-centered information allows a region (independent of level) to make accurate predictions with a lot less training. Imagine recognizing a coffee cup by touching it with your fingers. If you did this conversion your object representation can be independent of whether your finger is pointed down, up, or sideways. You don’t need to touch the cup in every finger configuration to form an accurate model, even though your sensations are actually completely different in different finger configurations. This is a huge win from a scaling standpoint.

This second issue is more fine grained. It’s the focus of Jeff’s video and our current research. The hypothesis is that this conversion has to occur in every region, regardless of location within the hierarchy. The conversion must be happening in both what and where pathways.

Hopefully this helps clear up the confusion. I am still struggling with how to explain these concepts more clearly. :smile: