Preliminary details about new theory work on sensory-motor inference

Thanks for saying that. We were really unsure about posting it. As you can tell there are many details that are not worked out yet, but in the end, this is all part of our open research philosophy.



BioSpaun, like it’s predecessor Spaun is a neural simulation, not an AI/Machine Learning system. As such, it attempts to model neurons, networks and circuits in a much more biologically accurate way, as opposed to HTM which attempts to describe a functionally equivalent implementation of a portion the brain (specifically, the neocortex) algorithmically.

In other words, HTM is an algorithmic implementation of what HTM researchers believe is going on in the neocortex. It is not a direct simulation of the biology itself.

On the other hand, Spaun (and BioSpaun) are functional abstractions themselves to a great extent - not nearly as biologically realistic as, say, the Blue Brain Project (BBP) which attempts to model down to the ion channel or even molecular level in some cases. In short, all three of these systems are very different in their implementation as well as their goals. The BBP goal is to model the brain as precisely as possible, the HTM goal is to implement intelligence in a way analogous to way to how the brain implements intelligence, and Spaun is somewhere in between - leveraging computational neuroscience to try and understand the brain but at a higher level of abstraction than the BBP.


Don’t worry @Subutai!

My standard for “cleanliness”/“completeness” with posting is anything up to and including whatever you or Jeff might talk about in your sleep! :stuck_out_tongue:

1 Like

I would second that question.
How do you extract the transform-information/operation and how do you modify the SDR that is coming from Spatial pooler, so that it look the same (transform free) to the TM ?

@mraptor I would venture to guess that the only thing the TM will know is the transformed SDR as it results from the SP processing.

That’s just my “intuition” but of course wait on a Numenta rep to verify…

I am back from traveling. I will try to answer a few questions prior to our office hours tomorrow. I will work from Fergal’s list of questions.

We haven’t yet settled on the language to use for the new theoretical ideas and in the recording I didn’t define my terms carefully. A mini-column is about 100 neurons in a very skinny column that span all layers of a region. They are about 30um wide and 2.5mm tall. Mini-columns are a physical reality and we have proposed a function for them in the spatial pooler and temporal memory algorithms. The output of the SP is a set of active mini-columns. The new theory I describe in the video does not change anything about mini-columns. We assume they are still there and performing in the same fashion. What is new is we are modeling L4 and L3 cells in each mini-column, whereas the TM models just L3 cells. L3 cells get almost all their input from other L3 cells (which is what we need for sequence memory). L4 cells get 65% of their input from L6a, some from the equivalent “where” region, some from L2, a few from other L4 cells.

The TM requires a set of mini-columns to work. We typically use 2048 mini-columns, if you go much below that then some of the properties of SDRs start to fail. A couple of thousand mini-columns is the smallest amount of cortex that can actually perform TM. This is roughly equivalent to a patch of cortex 1.5mm x 1.5mm. We didn’t call this a “column” or anything else.

In the new theory we are sticking with the same basic dimensions, just adding more layers. I have been thinking of touch by imaging a small patch of sensors on the tips of each finger. Each patch would feed into a 2048 mini-column patch of cortex. I am pretending there are no other sensors on the hand. This is a simplification, but I believe it keeps the important attributes without throwing away anything essential.

So we now have multiple 2048 mini-column patches of cortex, one for each finger tip. We need a way to refer to them. In the recorded talk I just referred to them as “columns” but we may need a better term. These columns are roughly equivalent to barrel columns in rat.

The important attributes of the “column” are it receives a bundle of sensory input bits and that all these bits are run through a spatial pooler. There is no “topology” within the column, or put another way, all the mini-columns in the column are trying to inhibit all the other mini-columns. This makes it much easier to understand and model, yet allows us to build systems with multiple columns.

Yes. We are trying to understand a 1.5mm x 1.5mm patch of cortex with all layers. That is the goal. The hope is all the important functions of a cortical region are represented in this small patch. It is well known that cells in some layers send axons longer distances within their layer. This occurs in layers 3a, 2, 5, 6a, and 6b. The idea we are pursuing is that the representations in these layers can be a union of possible values and that the “inter-column” projections are a way for multiple columns to reach a consensus on the correct value.

I am sorry if I was confusing on this matter. The model assumes that L4 cells receive a “location” representation on their basal dendrites. Our current best guess is that this location SDR is coming from L6a (again L6a is 65% of the input to L4). L6a and L6b are massively interconnected to the equivalent layers in the equivalent region in the “where” pathway. The basic idea is cells in the “where” pathway represent a location in body or sensor space, this gets sent to L6a in the “what” region (the region we are concerned with), L6a converts this body-centric representation into an object-centric representation. Similarly, a location in object-centric representation is passed to L6b which is sent back to the where pathway. Somewhere along the way it gets converted to body-coordinates.

If you think about what it takes to move your fingers and to predict the next input you realize that the brain has to continually convert between body-centric to object-centric coordinates and vice-versa. This need has well known in the robotics community, all we are doing is bringing it to cortical theory and trying to understand the biological mechanisms.

There is a lot we don’t understand about the location coordinate transformation. In one possible implementation it requires a transformation of the sensory input as well as a transformation of the location representation. IF that is true I propose it is happening in L4 itself. Over the past few weeks i read some papers that suggest we learn objects in body-centric coordinates but then we mentally “rotate” our models to fit the sensory input. This is more than I can write here. We can leave it as transformations have to occur continually and rapidly yet there is a lot we don’t understand about it.

Yes, that is my understanding. This new theory that a column is actually a model of entire objects provides a simple explanation for what has been a mysterious phenomenon.

I didn’t understand this question.

I know. The specificity people report is somewhat contrary to very notion of common cortical function. When I think about this specificity I don’t think it is wrong, but I suspect it is misleading. For example, V4 is often associated with color processing. But input to V4 had to come through V1 and V2, the data was there all along and was also processed in V1 and V2. It is not possible that V4 processes color and V1 and V2 don’t. The biggest problem with most of these studies is it is hard to find cells that reliably fire in response to a stimulus. To get around this problem the animal is often anesthetized, and/or the stimulus is made very unnatural. The simplest example is that cells in V1 behave completely differently when and animal is awake and looking at natural stimuli than when not awake and exposed to gratings.

I am looking forward to more discussions on this topic.


That is what I was saying … but this means that whatever comes from Spatial Pooler has to be transformed … what exactly is the transformation is decided outside of TM.

My question is how do you extract the Transformation-Operation, so that you can apply it before SDR enters TM.

 stream ==+------> SP ===> Apply T ==> TM
          |-> extract. T ------^

Thanks @jhawkins for answering so fully.

On nomenclature, perhaps “macrocolumn” or CC (Cortical Column) would be the right name for the 1-1.5mm squared patches? Cortical Column is the neuroscience name used by Hinton in his work on Capsules (his name for CCs) here/PDF which are functionally equivalent at least in his theory. Macrocolumn is a quite common name for the barrel-sized columns, and is what Rod Rinkus uses (he calls them MACs) in his SDR-based Sparsey system (see this post from last weekend).

On my question about coordinate systems, it seems likely that cortex uses distances and directions which are related to the motor actions needed to navigate (palpate or saccade) the object, rather than distances and directions in external units which relate to the intrinsic dimensions of the object itself. For example, a 42" TV at 10 feet would have the same saccading “size” as a 21" TV at 5 feet - the visual system treats both as the same. Similarly, experiments with reversing glasses show that we can very quickly learn to redefine up and down in terms of saccading outcomes.

These motor-defined coordinates are used to perform navigation over the object, so they are a kind of object coordinate system, which are relative to some reference point on the object (eg the centre of a TV screen, the central axis of a pen). The “where” pathway will also represent the egocentric position and orientation of the reference point, allowing us to navigate from one object to another, remember where we put an object, reach for one, and so on. [Edit: and a few years ago, I broke my leg stepping down onto a rock because my new glasses made me underestimate the depth by less than 1cm].

On specificity of regions in cortex, the idea does not contradict the generality and commonality of cortical function. What appears to happen especially in visual cortex is that V1 and V2 are huge generalists, extracting as much structure as they can from a very wide data bus. After that, genetic and developmental programs seem to differentiate the kinds of data which flows between regions. This appears (according to the talk below) to be a combination of differential projection and differential synaptic preferences. Since the same differentiation persists within and across species, it must have a genetic component, which is likely reinforced by experience and pruning during development.

It is likely that V4 is better (per neuron) at processing colour, and its output has better colour-related information, than V1 or V2. Similarly, MT is better (per neuron) at motion-related processing. Both these areas feed back to V1 and V2 and no doubt help them with their own processing of colour, motion etc. We can imagine a very early visual cortex which only had V1/V2, and then more specialised areas were added in later as the organism evolved.

Even within V1 itself, there are two predominant designs in mammals: the pinwheel style of bunches of minicolumns which are each sensitive to an orientation, a colour contrast, or a motion (as found in most higher primates), and the salt-and-pepper style where such bundles cannot be detected (as found in rodents). This may relate to the much larger hierarchies in mammals which feed back highly specialised signals to primary cortical regions.


I am sharing this image to confirm the quote, and to make the obvious point that layers of the CC are interconnected to other remote CCs ( via white matter Axons) and to the limbic and brain stem areas, this is all part of feedforward and feedback to each CC. To me, THIS IS the hierarchy of the brain, that connects sensory-motor and controls the attention to each different parts. In all what I am seeing in this thread, H in HTM seems to be focused on the same CC. What am I missing?

“a picture is worth a 1000 words”


@lilacntulip you’re not missing anything, it’s just a bit more complex than your diagram suggests. Every layer communicates with every other layer in both directions within a CC. Every CC connects with neighbouring CCs by layer and with connections from one layer to another, and there are more long range inter-CC connections in regions. Every region is connected to dozens of other regions, primarily in some stereotypical way seen across individuals in a species, often across species, but also in small but significant amounts, with other regions based on an individual’s genetics and experience.


in the following videos, SPAUN looks like is doing sensory-motor, but not exactly as the original post,

1 Like

Regarding of simplification about sensors on the tips of the fingers …
From neuroscience course I took at Coursera (Duke University) I learned that actually there are not so many touch sensitive neurons on some parts of the hands (rear part of the hand for example). In fact you cannot distinguish either you touching your hand with one needle or with two (if they close to each other). Because neurons there have some noticeable distance between each other. You can test it yourself. Of course, there are more touch sensitive neurons on the finger tips.

1 Like

Here’s the video of yesterday’s Office Hour:

(Yes, it was midnight for me. I wake up around 4am most days, so it really was the middle of my night).

Thanks all for another great Office Hour. And congrats to @mrcslws on being hired full-time.

Jeff referred to Alex Thomson’s review of work on Layer 6. Here’s the PDF. A broader survey of all the layers by Thomson and Mamy is here.


@fergalbyrne Yep we missed you buddy! Thanks for the specific links by the way. And congratulations @mrcslws!

Once again, a big thank you for the discussions and office hour! From my newbie perspective I think I understood most of the ideas and definitely learned a lot. I do have a few questions, though:

  • Yuwei mentioned a paper about the evidence that strongly suggests sensor location to object location transformations are happening. I believe it had something to do with grid cells in a mouse? May I get the paper’s name or a link to it?

  • Am I correct to say that each part of the retina inputs to different cortical module of V1 much like each finger tip gets its own cortical module? So when we say each cortical module uses the exact same algorithms to learn a model of the world, it’s like saying each finger tip learns a model of the object and when all fingertips are touching different parts of the object we have multiple reference points which gives us a better inference of what the object is.

  • How can I think of “object location” for a single cortical module? For example, when I stick my hand in a bag to feel some unknown object with my index finger I only have one source of information about the object. Knowing where my fingertip is relative to my hand can be considered “sensory location”. Would “object location” be relative to something on the object itself? If so, just the finger tip on an object usually doesn’t give us enough information to infer what the object is.

  • Along the lines of the previous question, it was noted many times that in a region every cortical module performs the same type of computation in parallel with the others and “votes”, or is a union of each cortical module’s object location SDR. I understand the desire to get away from hierarchies as much as possible, but wouldn’t this union SDR imply a pattern across the entire region that would need to be recognized in a higher region?

  • What leads you to hypothesis that most, or perhaps all, processing could be done in a single region as opposed to a hierarchy of regions? Is it based on the extreme generalization of sequence and sensory-motor inference along with the computation power of many cortical modules with 6 layers of mini-columns?

  • An unrelated question to your new research, but why the term “cells” and not “neurons”? Is this just a naming convention that stuck or is there any significance to using “cells” over “neurons”?

Dave D.

I believe @ycui was giving this as an example proof that transformation is happening in the brain. In the case of mouse grid cells, the object is the mouse itself, and the location is its position in a familiar environment. The Mosers’ Scholarpedia article is a good place to start.

Correct. Each CC will have partial evidence of what it is touching, so it’ll have some union of guesses. By communicating among participating CCs, the region (or a higher region) can more confidently identify the common cause.

True, and @jhawkins mentions this when talking about his coffee cup. This is why we spread out our fingers when we explore an unknown or invisible object, and mice do the same with whisker exploration. Conversely, we see the object, we’ll move our fingers to conform with its predicted shape in order to get as fine-detailed confirming information as possible.

Real cortex has plenty of inter-CC lateral connections (this has been known of rodent barrel cortex for decades), and it’s likely that some significant portion of feedback to L1 extends across CC boundaries, so both intra-region and inter-region (hierarchy) processing is used in cortex.

Every region always tries to solve as much of a problem as it can, as using mid-range axons is much cheaper than relying on expensive long-range axons. This is not just a matter of “matter” but perhaps more important is the information cost of long-range links. You can connect CCs just by programming the growth of some axons to be of a certain length (say 2.5mm) in a random direction, but the genetic program to connect separate regions is dramatically more complex.

I think Jeff and Numenta are therefore wise to concentrate on single-region processing. The bandwidth in cortex falls off very strongly with distance as a power law, so implementing intra-region processing will account for 80-95%.

This is another example of taste in this field. Just calling something a neuron does not mean that it has any real resemblance to a neuron as found in cortex. Some sensible researchers in Deep Learning acknowledge this by using the word “unit” to talk about cells/neurons. HTM uses “cell” (I’m guessing because it’s single syllable, @jhawkins and @subutai can confirm) which is in between the two.

1 Like

You might be interested in these articles by the Mosers and others on this topic


Fergal, thank you for those excellent answers. Just a little more on the hierarchy question.

Our goal is to understand BOTH what goes on in each cortical region and how cortical regions interact in a hierarchy. From a pure numbers point of view the vast majority of neurons and synapses are in the cortical regions themselves. One mm2 of cortex has about 100K neurons and 1B synapses. They are distributed in layers (there are actually more like nine layers 1,2,3a,3b,4,5a,5b,6a,6b). This is where the vast majority of the work is being done and we need to understand what all these neurons and layers are doing. For basic functions I start with the assumption that it is being implemented in a cortical region.

The hierarchy is defined by direct “cortical-cortical” connections and “cortical-thalamic-cortical” connections and we need to understand them as well. We are not ignoring them, but these connections are very small in number compared to the intra-cortical connections.

The way I think about it is we seek to understand what region does and by region I include its connections with other regions. For example, I consider thalamic relay cells as part of a cortical region, and therefore perhaps the thalamus is involved in coordinate transforms. But we can’t rely on region B to do something that region A does not do. I hope that is clear.

I want to point out a big exception to this that I talked about a bit in the video and in the office hour. This is the “what” vs. “where” pathways. There are parallel cortical hierarchies in vision, touch, and audition. The regions look similar, so we assume they are doing basically the same function, but the where regions encode body-centric behaviors and representations whereas the what regions encode object-centric behaviors and representations. Our hypothesis is you get a what region by feeding sensory data to the region (into the spatial pooler) and you get a where region by feeding proprioceptive data to the region (into the spatial pooler). Sensory-motor inference requires both types of regions. A movement in object space (e.g. move finger from pen tip to pen barrel) has to be converted into the equivalent movement in body space (e.g. extend thumb 1 cm). The equivalent movements vary based on the current location of the object and your body/thumb. The same conversion has to occur in the opposite direction, a movement of my hand has to be converted into the equivalent movement in object space so that the cortex can predict what it will feel after the movement. This back and forth conversion happens every time you move. There must be major connections and lots of neural machinery dedicated to this conversion. It also has to be fast and local to each CC.

I strongly suspect that Layers 6a and 6b are doing this conversion. These layers are heavily interconnected with their equivalent layers in the what and where regions. For example in a what region, 6a gets input from layer 6 in the equivalent where region. The cells in L6a then project to L4 (they represent 65% of the input to L4). L5 in the what region projects to L6b which projects to L6 in the where region. This is exactly the kind of connections we need to go back and forth between locations in object space and locations in body space. I am optimistic that we can figure out exactly how L6 is doing these conversions and also how L4 and L5 interact with L6.


A post was split to a new topic: The Cortex at rest

@fergalbyrne Woa… grid cells are pretty neat. I learn something new every day… Ok, cool I think I have a good rough understanding of how CCs operate with sensors for sensory-motor inference. Additionally, I now see that it makes sense to define the algorithms for single region processing first because a great majority of the connections are inside a region. Thanks Fergal.

@jhawkins Jeff, I really appreciate the extremely detailed explanation and I think I understand now. So a cortical region (CR) is a considered a general processing unit full of smaller general processing units, or cortical columns (CCs). Every CR does the same type of processing throughout the cortex and every CC has the same underlying algorithms. It’s the underlying algorithms of both CC and CR, using the model of cells, that we are trying to understand and develop.

However, the inputs to these regions is not always similar. Object-centric inputs (vision color, touch pressure, auditory frequencies, etc.) input into a “what” CR and body-centric inputs (eye angles, finger angles, head angles, etc.) input into a “where” CR. Sensory-motor Inference is just a dance between two regions: one converts from an inputted “what” to a predicted “where” and the other converts from an inputted “where” to predicted “what”.

I think I need to draw a diagram to get a better intuition for what’s going on connection-wise, but I feel a lot more clear on the basics. Of course, please correct me if I am mistaken.

This is really cool!