Preliminary details about new theory work on sensory-motor inference

in the following videos, SPAUN looks like is doing sensory-motor, but not exactly as the original post,

1 Like

Regarding of simplification about sensors on the tips of the fingers …
From neuroscience course I took at Coursera (Duke University) I learned that actually there are not so many touch sensitive neurons on some parts of the hands (rear part of the hand for example). In fact you cannot distinguish either you touching your hand with one needle or with two (if they close to each other). Because neurons there have some noticeable distance between each other. You can test it yourself. Of course, there are more touch sensitive neurons on the finger tips.

1 Like

Here’s the video of yesterday’s Office Hour:

(Yes, it was midnight for me. I wake up around 4am most days, so it really was the middle of my night).

Thanks all for another great Office Hour. And congrats to @mrcslws on being hired full-time.

Jeff referred to Alex Thomson’s review of work on Layer 6. Here’s the PDF. A broader survey of all the layers by Thomson and Mamy is here.


@fergalbyrne Yep we missed you buddy! Thanks for the specific links by the way. And congratulations @mrcslws!

Once again, a big thank you for the discussions and office hour! From my newbie perspective I think I understood most of the ideas and definitely learned a lot. I do have a few questions, though:

  • Yuwei mentioned a paper about the evidence that strongly suggests sensor location to object location transformations are happening. I believe it had something to do with grid cells in a mouse? May I get the paper’s name or a link to it?

  • Am I correct to say that each part of the retina inputs to different cortical module of V1 much like each finger tip gets its own cortical module? So when we say each cortical module uses the exact same algorithms to learn a model of the world, it’s like saying each finger tip learns a model of the object and when all fingertips are touching different parts of the object we have multiple reference points which gives us a better inference of what the object is.

  • How can I think of “object location” for a single cortical module? For example, when I stick my hand in a bag to feel some unknown object with my index finger I only have one source of information about the object. Knowing where my fingertip is relative to my hand can be considered “sensory location”. Would “object location” be relative to something on the object itself? If so, just the finger tip on an object usually doesn’t give us enough information to infer what the object is.

  • Along the lines of the previous question, it was noted many times that in a region every cortical module performs the same type of computation in parallel with the others and “votes”, or is a union of each cortical module’s object location SDR. I understand the desire to get away from hierarchies as much as possible, but wouldn’t this union SDR imply a pattern across the entire region that would need to be recognized in a higher region?

  • What leads you to hypothesis that most, or perhaps all, processing could be done in a single region as opposed to a hierarchy of regions? Is it based on the extreme generalization of sequence and sensory-motor inference along with the computation power of many cortical modules with 6 layers of mini-columns?

  • An unrelated question to your new research, but why the term “cells” and not “neurons”? Is this just a naming convention that stuck or is there any significance to using “cells” over “neurons”?

Dave D.

I believe @ycui was giving this as an example proof that transformation is happening in the brain. In the case of mouse grid cells, the object is the mouse itself, and the location is its position in a familiar environment. The Mosers’ Scholarpedia article is a good place to start.

Correct. Each CC will have partial evidence of what it is touching, so it’ll have some union of guesses. By communicating among participating CCs, the region (or a higher region) can more confidently identify the common cause.

True, and @jhawkins mentions this when talking about his coffee cup. This is why we spread out our fingers when we explore an unknown or invisible object, and mice do the same with whisker exploration. Conversely, we see the object, we’ll move our fingers to conform with its predicted shape in order to get as fine-detailed confirming information as possible.

Real cortex has plenty of inter-CC lateral connections (this has been known of rodent barrel cortex for decades), and it’s likely that some significant portion of feedback to L1 extends across CC boundaries, so both intra-region and inter-region (hierarchy) processing is used in cortex.

Every region always tries to solve as much of a problem as it can, as using mid-range axons is much cheaper than relying on expensive long-range axons. This is not just a matter of “matter” but perhaps more important is the information cost of long-range links. You can connect CCs just by programming the growth of some axons to be of a certain length (say 2.5mm) in a random direction, but the genetic program to connect separate regions is dramatically more complex.

I think Jeff and Numenta are therefore wise to concentrate on single-region processing. The bandwidth in cortex falls off very strongly with distance as a power law, so implementing intra-region processing will account for 80-95%.

This is another example of taste in this field. Just calling something a neuron does not mean that it has any real resemblance to a neuron as found in cortex. Some sensible researchers in Deep Learning acknowledge this by using the word “unit” to talk about cells/neurons. HTM uses “cell” (I’m guessing because it’s single syllable, @jhawkins and @subutai can confirm) which is in between the two.

1 Like

You might be interested in these articles by the Mosers and others on this topic


Fergal, thank you for those excellent answers. Just a little more on the hierarchy question.

Our goal is to understand BOTH what goes on in each cortical region and how cortical regions interact in a hierarchy. From a pure numbers point of view the vast majority of neurons and synapses are in the cortical regions themselves. One mm2 of cortex has about 100K neurons and 1B synapses. They are distributed in layers (there are actually more like nine layers 1,2,3a,3b,4,5a,5b,6a,6b). This is where the vast majority of the work is being done and we need to understand what all these neurons and layers are doing. For basic functions I start with the assumption that it is being implemented in a cortical region.

The hierarchy is defined by direct “cortical-cortical” connections and “cortical-thalamic-cortical” connections and we need to understand them as well. We are not ignoring them, but these connections are very small in number compared to the intra-cortical connections.

The way I think about it is we seek to understand what region does and by region I include its connections with other regions. For example, I consider thalamic relay cells as part of a cortical region, and therefore perhaps the thalamus is involved in coordinate transforms. But we can’t rely on region B to do something that region A does not do. I hope that is clear.

I want to point out a big exception to this that I talked about a bit in the video and in the office hour. This is the “what” vs. “where” pathways. There are parallel cortical hierarchies in vision, touch, and audition. The regions look similar, so we assume they are doing basically the same function, but the where regions encode body-centric behaviors and representations whereas the what regions encode object-centric behaviors and representations. Our hypothesis is you get a what region by feeding sensory data to the region (into the spatial pooler) and you get a where region by feeding proprioceptive data to the region (into the spatial pooler). Sensory-motor inference requires both types of regions. A movement in object space (e.g. move finger from pen tip to pen barrel) has to be converted into the equivalent movement in body space (e.g. extend thumb 1 cm). The equivalent movements vary based on the current location of the object and your body/thumb. The same conversion has to occur in the opposite direction, a movement of my hand has to be converted into the equivalent movement in object space so that the cortex can predict what it will feel after the movement. This back and forth conversion happens every time you move. There must be major connections and lots of neural machinery dedicated to this conversion. It also has to be fast and local to each CC.

I strongly suspect that Layers 6a and 6b are doing this conversion. These layers are heavily interconnected with their equivalent layers in the what and where regions. For example in a what region, 6a gets input from layer 6 in the equivalent where region. The cells in L6a then project to L4 (they represent 65% of the input to L4). L5 in the what region projects to L6b which projects to L6 in the where region. This is exactly the kind of connections we need to go back and forth between locations in object space and locations in body space. I am optimistic that we can figure out exactly how L6 is doing these conversions and also how L4 and L5 interact with L6.


A post was split to a new topic: The Cortex at rest

@fergalbyrne Woa… grid cells are pretty neat. I learn something new every day… Ok, cool I think I have a good rough understanding of how CCs operate with sensors for sensory-motor inference. Additionally, I now see that it makes sense to define the algorithms for single region processing first because a great majority of the connections are inside a region. Thanks Fergal.

@jhawkins Jeff, I really appreciate the extremely detailed explanation and I think I understand now. So a cortical region (CR) is a considered a general processing unit full of smaller general processing units, or cortical columns (CCs). Every CR does the same type of processing throughout the cortex and every CC has the same underlying algorithms. It’s the underlying algorithms of both CC and CR, using the model of cells, that we are trying to understand and develop.

However, the inputs to these regions is not always similar. Object-centric inputs (vision color, touch pressure, auditory frequencies, etc.) input into a “what” CR and body-centric inputs (eye angles, finger angles, head angles, etc.) input into a “where” CR. Sensory-motor Inference is just a dance between two regions: one converts from an inputted “what” to a predicted “where” and the other converts from an inputted “where” to predicted “what”.

I think I need to draw a diagram to get a better intuition for what’s going on connection-wise, but I feel a lot more clear on the basics. Of course, please correct me if I am mistaken.

This is really cool!

@fergalbyrne, is this similar to C-Space in robotics? as I also have a bit of difficulty believing that the cortex can handle 7 DOFs mathematics calculations for converting from world-space to object-space so rapidly… now C-Spaces handled as multiple 2D maps on the other hand (using something similar to geospatial SDRs) could be a bit more feasible…

1 Like

I’m still trying to wrap my head around the XFORM idea. It’s quite an intriguing idea. Although I’m not quite sure that I’ve grasped the core of it. At first I couldn’t quite connect the dots about what problem was trying to be solved and the need to introduce the concept of “object space” into the model.

If I’m to understand correctly you’re trying to find a bridge between the “what” and “where” pathways/regions and by introducing the concept of “normalized object space” in the “what pathway” it would be possible to create object representations that are more or less invariant to distance or orientation.

Given this… I could imagine a scenario where you have an “object” representation within the “what pathway” for a circle that is somewhat invariant to distance. Proprioceptive inputs into the “where pathway” (maybe in the form of eye convergence for depth perception) would drive an XFORM function that would affect (scale) the effective “motor inputs” to the “what pathway” allowing for the predicted sensory inputs associated with the motor inputs to be more stably predicted even when taking distance into account. If the circle is far away the relative angle of eye movements to trace the circle with your eyes would be smaller and if the circle was closer the angles would be greater.

Am I partially on track with the concept?



Hi John,
Yes, you are on the right track, but the problem is deeper than scale or translation invariance.

Our brains learn the structure of thousands of objects in the world. My coffee cup is one such object. When I touch my coffee cup my cortex is constantly predicting what I will feel on my fingers as I move my fingers and grasp the cup in different locations. It is easy to experience this. Close your eyes while touching a familiar object and then imagine a movement of one finger and you can anticipate what that finger will feel after the movement is made. This tells you that you have a model of the object that includes what features exist at different locations on the object. When you move your finger the brain knows what feature will be in the new location.

You can touch an object with different fingers, different hands, the back of your hands, your nose, etc.and still make predictions about what you will feel. This tells us that the model of the object is not specific to any particular part of your sensory space, nor is it particular to any specific location or orientation of the object relative to the body.

So the problem is: we learn the structure of objects in one location using one set of sensors but can apply that knowledge to different sensors and different locations/orientations. This basic problem has been known to roboticists for many years. It is fairly easy to define an object’s features in a cartesian coordinate frame that is relative to the object, and define a sensory organ’s location in another cartesian coordinate frame that is relative to the body. If you know the location of the object in body coordinates, you can do the math to know the location of the sensory organ in the object’s coordinate frame.

The cortex has to be doing some version of this, and it has to do it constantly, every time you move any part of your body. It also has to be fast. In our current thinking, each small section of each cortical area (a “cortical column”) has to make this transformation, somewhat independently of the other areas. For example, each small part of your hand has to calculate where it is relative to an object somewhat independently of the other parts of your hand. This is what we were referring to as the XFORM. The XFORM converts a location in body space into a location in object space, and vise versa. It is extremely unlikely that the cortex is doing this XFORM using mathematical techniques.

We didn’t start by trying to understand the what and where pathways, but in hindsight they are almost certainly part of the brain’s solution to the XFORM problem. Where pathways form representations in body space and what pathways form representations in object space. What and where regions are connected via long range connections in Layer 6. Layer 6 projects to Layer 4 and these connections comprise 65-75% of the synapses in L4. Our current hypothesis is that the XFORM is occurring in L6.

There is a lot of data in the neuroscience literature that gives clues to how this is happening and some of the data is contradictory. We are working our way through various hypotheses.


This is true but it could be simply be that moving 1 cm to the right with your finger, or your nose or your lips is the same thing for your cortex, not necessarily that he has an object-model? I wonder if you inverted the direction (like when artists invert photographs to draw them?) if the brain would not be confused of what he is dealing with?

1 Like

When palpating an object like a pen or coffee cup, the orientation of the object relative to the body is constantly changing, so moving 1 cm to the right does not have the same effect.

Also with sensory-motor inference you can move in any direction for any amount. E.g. I can move my finger from the barrel of a pen to the tip or from the bottom of a pen to the tip, etc. If you try to learn all possible movements you run into a combinatorial problem in both learning and memory.

1 Like

Hi Jeff,
I have been following this fascinating new thread and I understand most of the underlying theory of CLA, HTM and SDR´since I joined the community back in 2013. (I am now also an avid fan and supporter of your ideas within our automotive company in Germany). I have a short question which I feel needs clarification, at least for me to feel sure I am following your thoughts correctly. (This may be repeating some of Fergalbyrnes points, but I beleive it is very important). You speak of the XFORM taking place in L6 and L6 projecting mostly to L4. This process of object mapping to sensory mapping (object model == body map) in my view is probably involving multiple layers of regions in the HTM and is not occurring just within one cortical region. In your post from 13 days ago you nevertheless conclude, that the “What” pathway “has to be fast and local to each CC”. I am sure nothing is being ignored in your current hypothesis, but I still find the possibility compelling that the “What” process is taking place higher in the hierarchy than the “Where” process. That would explain why we can vary the orientation of the object and the relative body surfaces used for contact and still have a stable invariable object model, which only exists in the higher cortical regions within this hierarchy. Is this consistent with your current thoughts? Or are you actually looking only at the predictions taking place exclusively within one cortical region? Thanks for your clarification in advance. I would like to add that I certainly agree that indepently from the level in the hierarchy, there is a fenomenal process taking place in the L6 which is universal to all regions and is playing a key role. But do we have data on how fast the information flow is within the hierarchical levels? In vision this seems to be occuring very fast. Thanks again.

Joe Perez

1 Like

Hi Joe,
One of the core principles we adhere to is that all cortical regions perform the same basic functions. If you accept that principle then what ever happens “higher in the hierarchy” also has to occur in every region, including primary input regions. Of course the hierarchy is important. As you ascend the hierarchy the higher regions receive input from a broader area of sensory input. So complete stability during visual saccades might only be observed at a higher region such as IT, but stability will still be observed in V1, albeit maybe over small saccadic movements. Every region is doing the same thing does not mean that every region builds equivalent representations or that hierarchy isn’t important.

As far as we know What and Where regions exist at all levels of the hierarchy for all modalities. The L6 connections between What and Where regions also exist at all levels of the hierarchy. I am not certain, but I believe there isn’t a 1 to 1 correspondence of what and where regions. So the What hierarchy might have 4 hierarchical regions whereas the Where hierarchy might have 3.

I don’t quite follow your suggestion that What processing should occur higher than Where processing. My current assumption is that the What and Where pathways are interconnected at all levels and they are mutually independent, but we don’t understand this process.

My comment on speed was mostly to remind ourselves that the body-to-object and object-to-body coordinate transformations must be performed for each and every movement and sensation. Therefore it has to be fast.

1 Like


I really appreciate your replies to this thread. They’re really helpful! Please help me refine my understanding of Numenta’s hypothesis the Cortical Column functionality for sensory-motor Inference in a Cortical Region. I apologies if I’m repeating what you’ve already stated:

Let’s say we’re looking at a CC of a what Cortical Region. The input to a CC is the resultant SDR from an encoder. The entire CC spatial pools from this input to select active minicolumns. From the CC’ s resultant active minicolumns from the spatial pooler, the hypothesis is that Layer 6 cells in our what region CC look at a connected where region’s output. The what region CC in layer 6 transforms the where region’s body-centric representation (i.e. sensor angle or position) to our what region’s object-centric representation (i.e. location on a pen). Continuing the hypothesis, this L6 XFORM gets sent to L4, which is 65-75% of L4’s input (what’s the other 35-25% then?). The L4 of our example what region recognizes patterns and makes predictions based on the L6 XFORM to object-centric representations.

My current assumption is that the What and Where pathways are interconnected at all levels and they are mutually independent, but we don’t understand this process.

Actually, this leads me to a somewhat related question: What examples exactly in the brain are a what or where region? As an example is V1 (or any other lower region) considered a what or where region? Or is it both?


1 Like

Hi Jeff,

Good to see your new direction.

I have a hint from biology that may shed some light.

I have a rare visual defect in my right eye, where both the primary and backup muscles to look “up” do not work (double elevator palsy). When I look up and to the right, I see double. Streams from both eyes are XFORMed based on the view location. My left eye views that location, and the right eye view is pegged at horizontal for that quadrant. It’s view is superimposed at the anticipated viewing location.

(I have learned to tilt my head back when I look up, and to close my right eye when looking way up)
It’s been this way all my life, I haven’t adapted to it.

I believe this implies that the XFORM is top-down/hinted, rather than observed. Do you agree?

– Glen

1 Like

Hi Jeff,
Thank you for the responses you provided above. I highly appreciate that you are sharing these very new insights with all of us in the community. Since you stated:

I will try to provide my best answer to this question is as follows: I can clearly understand that you now view the “what” and “where” regions as a fundamental building blocks present in every region at all levels of the HTM hierarchy, with the XFORM taking place in L6. While you have much deeper understanding of activity within cortical column layers backed by evidence from multiple sources, I am just looking at the bigger picture of what must be taking place and varying from one level in the hierarchy to the next higher level. While each hierarchical level (or region within) certainly have universal features (which may indeed include the “what” and “where” in all) I think we would both agree (based on your work in “On Intelligence”) that as we reach higher levels in the hierarchy, the subjects or objects being represented by the SDRs are more abstract concepts, because the higher regions are merging inputs from multiple sub-regions and at some point we are even merging inputs from different senses, like vision and touch and perhaps also memory recall. So my question is whether it is not conceivable to you, that perhaps the “what” concept is actually emerging as a higher, more abstract concept (with a stable SDR) further up in the hierarchy? But I take from your answers that this motor-sensory paradigm is believed to contain the “what” and the “where” concept in every cortical region at all levels. So I take your answer to imply that higher hierarchical levels do not only have more stable SDRs, but also representations encompassing more features from its many sub-regions.

Thanks in advance for correcting any misconceptions if you see any.


1 Like