Preliminary details about new theory work on sensory-motor inference

A post was merged into an existing topic: Are there specifically mapped motor areas in M1

Thanks for spelling out the separation of the two issues, we are discussing. I can now see how they both tie in to the object-centered transformations, while the two main pathways (ventral and dorsal) each have their separate tasks and goals.

I can imagine that the object-centered transformations require some involvement of object-memory. Object recognition can only take placed for objects that have previously been learned.

I am very excited about the new progress being made with sensorimotor inference! I’ve watched the Office hour and Jeff’s latest whiteboard talk a few times now to better understand it. I’ve been working on my own application of HTM for a couple years that I can’t wait to update with an implementation of ‘real’ sensorimotor theory. Recently I’ve been trying to model L4 sensorimotor temporal context by simply connecting inputs of motor command SDRs to distal dendrites of the L2/3 pyramidal cells, which I think is a naïve approach that Numenta attempted at one point (the idea came from old discussions on nupic-theory). In my application, motor commands consist of encoded representations of the function API of ELF executables. It’s still a work in progress. Trying to abstract this proprioceptive/somatic transformation to my non-biological application is mindbending.

1 Like

Or those found to be close enough by an implied category or analogy?

.[quote=“cogmission, post:75, topic:697”]
Or those found to be close enough by an implied category or analogy?
Yes, I would fully agree with that statement, since SDRs that overlap (even slightly) need to be closely related. In my opinion it is the hierarchical level that determines whether it is the object category that matches or a more specific individual object. But, I guess that in some specific regions, it may be the amount of overlap, that determines whether it is only the general category being recognized or a narrower identification which is taking place

One of the difficulties I am seeing in trying to test my own theories is the fact that when position is encoded with feature, you tend to see a lot of overlap due to the cells representing the same position on different objects.

Now of course “position” can be encoded in the active columns, and “context” unique to different objects can be encoded in the active cells within those columns, but the difficulty comes when the position columns burst or have multiple predictive cells, and you end up encoding significant numbers of the same cells into multiple different object representations.

This is actually a similar problem I am seeing with sequence memory (which I mentioned on another thread), so guessing there is probably a common solution for both cases.

I made some tests with my childrens.
I was sitting on the coach with the closed eyes and music in the headphones, and they were bringing some random objects from the room to my feet, so I could palm them(with my feet). I tried to define the objects by tousling them for about 10 seconds. And if I failed I reached them out with my hands (and then defined them in about 1-2 seconds most of the time).

The only objects I could define closely same well was those that are deforming (squeezing, squazing). A sock, a soft tube of cream, a toy ball. It was like discerning the patterns of pressing-yielding. But that was something different from the touch censoring.
I failed to perceive most of the strict, hard-cased objects.
That was easier when I was able to roll the object, or to put it on the side. But I feel that it is also something similar to pressing-yielding patterns detection.

Also, first experiments were without the music, so I easily defined the object by the sound they made when I shift or roll them.

Gonna repeat this experiments with my nose and cheeks area. :slight_smile:


Jeff and I talked a lot about this last week, and I recorded the session. Have a look:

1 Like

The slides that Jeff details in second video around 2:00, gives excellent detail of the inputs and outputs for layers L2/L3A, L4, L5, and L6A.

I gather from this that in a hierarchy of multiple regions, the FF output from L2/3A in the lower region becomes input to L4 in the higher region, and output from L5 in the lower region becomes input to L6A in the higher region. If that is correct, then it would imply that output from L2/L3A in the lower region also becomes input to L3B in the higher region.

This would imply that L3B in the higher region gains some location information (given that L2/L3A represents a set of features plus their locations). Just wanted to confirm whether that is a correct conclusion (or if there is additional output that is not depicted in these slides which does not contain location information).

EDIT – slight revision to my above conclusion: combining this with previous diagrams, I gather the FF output from R1:L5 passes through thalamus while FF output from R1:L2/L3A does not, but both are input to R2:L6A, and output of R2:L6A becomes input to R2:L5 (R2 being the higher region in a hierarchy). But my question is still the same – is it accurate to assume that R2 gains some location information from output of R1?

Thanks for the videos! I’ve only begun to digest the first 3 videos so far and I have two questions:

  1. What exactly is an “allocentric location” in terms of neuron activations? A core hypothesis presented is that every sensory region receives both sensory data and allocentric location data. My best guess is it’s like the grid cells briefly mentioned during the discussion. When you see an object through a straw you see a specific blue pattern of colors your brain previously learned is location “(0, 0)”. You then move your straw-eye right and see a red pattern your brain learned is location “(1, 0)”. You can then predict certain patterns will be in certain locations and therefore predict a learned object. If this is true, then the egocentric location for the straw-eye example would be the motor feature of the eye at certain angles. However, I thought grid cells were used to help an intelligence know where it is in an environment. Perhaps grid cells are both used for allocentric and egocentric locations, then?

  2. Does the specific order of patterns in a sequence matter for Temporal Pooling? I used to think yes, but now I’m leaning towards no. I used to think the Temporal Pooler is a set of neurons that remain active when observing a sequence of patterns. The sequence had to be specific because past context matters for accurate predictions. After all, you cant jumble around the notes in a song and make predictions of upcoming notes. However, now the Temporal Pooler observes a sequence of feature-locations and these locations provide the “order of patterns”. Before I’d have to run a finger along the object in a specific trajectory to identify the object. Now it seems I can poke the object in random locations as well as move my finger around the object. This leads me to believe “sequences of feature-locations” is a more fundamental concept than “sequences of patterns” and is a very powerful idea.

I moved your post here because @jhawkins is more likely to see it in this forum.

1 Like

I will attempt to answer some of these questions. In no particular order.

-The term “allocentric location” means a location in a reference frame relative to something else. It contrasts to “egocentric location” which means a location relative to your body. I am looking at a chair in my living room right now. The char has a location relative to the room, that is an allocentric location. The chair also has a location relative to me, that is an egocentric location. If I move, the egocentric location will change but the allocentric location will remain the same.

When the brain builds a model of all the things in the world, those models must be expressed/stored in allocentric coordinates. A coffee cup is defined by a set of features relative to each other (allocentric). If I want to grab the coffee cup then I need to know where the cup is relative to me (egocentric). Our big insight is allocentric locations are being used in all sensory regions.

Grid cells encode where an animal is relative to some space, typically a cage or room in a maze. Grid cells allow an animal to know where it is in that space and therefore they play an essential role in navigating. Because grid cells give the same location regardless of what direction you are facing they could be considered allocentric. (There are separate cells that encode your direction.) Grid cells are similar to what we need for sensorimotor inference. For example, when recognizing an object via touch, the brain needs to know where your finger is relative to a touched object. Or the brain needs to know where on an object part of your retina is sensing. We are currently studying the literature on grid cells to get clues as to the exact neural mechanisms the brain might use to create allocentric locatoins for sensorimotor inference.

-“Temporal pooling”. refers to an operation performed by a layer of neurons. “Pooling” means the cells in this layer will fire the same pattern for a set of input patterns. Each input pattern represents part of the same thing, so, as the input patterns change the pooling layer stays constant. For example, when moving my fingers over a coffee cup, the input patterns from my fingers will change. However, temporal pooling layer will maintain a constant activation pattern which corresponds to the cup.

The “temporal” part of temporal pooling refers to how the temporal pooling layer learns.The TP layer assumes that patterns that follow each other in time are likely part of the same thing, so the TP layer learns by forming synapses to subsequent input patterns over time.

Temporal pooling can be applied to both sensorimotor sequences and high-order sequences…Sensorimotor sequences will typically not be in any strict or repeatable order. High-order seequences,like a melody, by definition follow a defined order.

Now onto Paul Lamb’s questions.

As far as we know, L5 projects through the thalamus and terminates in L6a, lower L5, L4 and lower L3. L2/3 projects directly to other regions and terminates in L4 and lower L3. These probably also connect to L6a and lower L5 , but I don’t know if it is known.[quote=“Paul_Lamb, post:80, topic:697”]
But my question is still the same – is it accurate to assume that R2 gains some location information from output of R1?

I wouldn’t say so. L2/3 represents objects. The location information is no longer available. Say L4 represents features at different locations on a coffee cup. L2/3 pools over L4 patterns and represents “coffee cup” There is no longer any location information in L2/3.


Good point – in a sense there is still location information present in the encoding (since the “object” is in the simplest sense a sub-sampling of a union SDR from individual feature/location SDRs), but that information has become encoded as part of the semantics of an object (allowing various different “coffee cups”, “mugs”, “thermoses” for example become semantically similar to various degrees depending on their feature/location overlaps).

I am attempting to build out an implementation of the 2-layer circuit, and have made some assumptions on a few points. I thought I would ask for some critical feedback to improve my understanding of the theory. A couple of my assumptions are probably wrong due to misinterpretation of elements of the theory.

For reference, I drew up a quick diagram of current implementation. This differs from the slides that Jeff outlined, due to some gaps in my understanding which I’ll explain in a bit.

The first assumption I made is that temporal pooling in Layer 2/3A is done by forming proximal connections with outputs from Layer 4, rather than distal connections (this aligns with the slides). The cells in Layer 2/3A become active when their counterparts in Layer 4 become active, and they stay active and continue to grow more proximal connections over multiple time steps. This seems to be a safe assumption, since cells representing the whole object in Layer 2/3A need to become active when a subset of feature/location inputs occur. This “whole object” activation would seem to be necessary for the feedback to Layer 4 to work, since only putting cells in Layer 2/3A into predictive state could not be used to transmit information back to Layer 4.

My second assumption is about the feedback from Layer 2/3A back to Layer 4. I have currently implemented this as cells in Layer 4 forming distal connections with active cells in Layer 2/3A. This puts all feature/location cells for the object in Layer 4 into a predictive state. This is where my diagram above differs from the slides that Jeff outlined (he indicated that these are apical connections). However, that doesn’t align with my understanding of the purpose for this feedback (indicating a likely gap in my understanding). The purpose of this feedback as I understand it is to put cells representing the object in Layer 4 into predictive state in order to bias them. However, from conversations on other threads, my understanding of apical dendrites is that apical input causes cells to become active if they are in a predictive state from distal input. If that were to happen in this system, predictive cells caused by allocentric location distal input in Layer 4 would become active by apical input from Layer 2/3A, rather than the desired behavior of cells only becoming predictive as needed for biasing. I think I am missing an important point of the theory here.

My third assumption is that there is a slight change in how synaptic permanence decay happens. In previous implementation, if a cell was in a predictive state in one time step but then not active in the next time step, there would be decrements to the permanence for the synapses which lead to the prediction. I haven’t tested this, but intuitively this seems like it would lead to lead to Layer 4 forgetting features of an object if they are not all sensed within a close time-frame. Maybe this is desirable (perhaps using a smaller permanence decrement rate) To address this I modified the logic so that as long as a cell that was predictive in one time step is still predictive in the next time step, the permanence for the synapses which lead to the first prediction are maintained. It is possible that the gap in my understanding about apical feedback is what lead me to this problem, though, so I’m thinking there may not actually be any issue here once that point has been clarified.

1 Like

@jhawkins That clears up my confusion. Thank you.

@Paul_Lamb I am also unsure about the function of apical dendrite segments. My understanding is the apical dendrite affects the pyramidal neuron differently in different layers, so that may be where confusion arises. I think the apical dendrites in L4 would have to affect the predictive state of L4 cells because an intelligence that knows its observing an object would be bias towards a more narrow set of possible future feature-locations.

As a related aside, here’s how I think a L4 cell responds to proximal, distal, and apical dendrites. For now I put a question mark for how the apical dendrite affects the cell states because of the confusion:


After reading and watching a number of explanations of your sensorimotor inference ideas, I think I have a fairly good grasp of the concepts you are trying to express. In many ways, these ideas fit comfortably within my own hypothetical mental models that I have been developing over the past few years while following your progress. There is one thing, however, that I keep stumbling over. Whenever you refer to allocentric coordinates as being in the frame of reference of the object, I find myself doing a mental substitution for a concept of feature-proximate locations. From my math/physics/CS background, object coordinates connotes a preferred origin and orientation for the reference frame that is unique to the object, and every point (feature) on that object has a unique location and orientation with respect to that frame. To me, the feature-proximate reference frame concept carries with it a notion of relative distance and orientation of one feature to another.

In the case of your coffee mug example; Imagine that I touch the middle of the handle with an index finger. From just the sensation of that one feature, I can probably recognize that I am touching something that feels like a coffee mug handle and also that there is one direction that is noticeably shorter than the other. However, I cannot tell which direction is towards the top of the mug. As I put down more fingers, more of the shape of the handle becomes available until my pinky either hits the bottom of the mug, or the rim around the top. I guess my point is that until you actually touch something besides handle, you wouldn’t be able to tell if you are touching a coffee mug or a something else, like say a bracelet. In either case, with each finger that falls on the object, you update your internal representation with respect to those observed features (and the relative positions of your fingers due to proprioceptive inputs), and not with respect to some preferred coordinate system of a specific object that you may or may not be able to determine uniquely from the available inputs.

TL/DR: The use of the phrase ‘object reference frame’ when discussing allocentric coordinates has connotations that may be confusing to some. Would you consider using phrasing that connotes feature-centric reference frames instead? Of course, feel free to disregard this suggestion if your mental model differs from that which I have outlined above. If that is the case, I would appreciate if someone could elaborate on the differences so that I may update my own models.


EDIT: I just had another thought. Is it plausible that there might not be any internal coordinate system representation at all. Rather, relative coordinates might simply be encoded as some combination of the current state of the proprioceptive system and the set of recognized features.

1 Like

The function of apical dendrites is not completely known. In L5 cells it is been proposed that apical combined with distal can cause the cell to fire. This makes sense to me as I consider the L5 to L6b to L5 path to be the primary path downward in the hierarchy. If a region up in the hierarchy wants to create a high level motor command, for example, it needs for this command to propagate down the hierarchy without any feedforward sensory input.

Whether any other pyramidal cells can be activated via apical input is not known. Some scientists have told me they haven’t seen it, but maybe someone else has knowledge of this. My current assumption is that L4, L3, and L2 cells require FF input to fire.

BTW, there is confusion as to whether L4 cells get feedback from L2/3. Some studies show L4 cells lacking an apical dendrite altogether. The cells are then called “spiny stellate” cells. That doesn’t mean they don’t get feedback from L2/3, just that there isn’t an apical dendrite. The Human Brain Project has a very cool website
This site shows ALL the connections between layers and cell types (for a rat I think). They show substantial connections from L2/3 to L4.


I agree that “allocentric” is confusing to many people. I am open to other terms. I didn’t follow your explanation of “feature proximate”.

We are currently using grid cells as a model. The brain solved this “location” problem a long time ago for navigation. Grid cells are part of that solution. It seems likely that the mechanism was preserved to use throughout the neocortex, so we are starting with that. This kind of representation of location (which is very similar to the GPS encoder we created a few years ago) doesn’t let you determine the distance between two points, and it is dimensionless.

Yuwei and Marcus at Numenta have been working on this, maybe they would like to chime in.

1 Like

Thanks, it seems like I have the correct understanding for this part of the theory then. Feedback from L2/3A is meant to put cells in L4 into predictive state, not to activate them. From an implementation perspective, there wouldn’t really be a functional difference then whether apical or distal dendrites were used, except that it would provide some additional flexibility to use different sets of parameters between the two if needed (for example if you wanted to have different limits for number of synapses, permanence increment/decrement rates, etc. depending on if the input is coming from the location stream versus feedback from L2/3A).

Gotta love Blue Brain.

I agree grid cell activations are inherently dimensionless but I think they could also give an intelligence an intuition for distance, kind of like “dead reckoning” where I can move in the dark and still have a general sense of how far I’ve moved. I have not yet searched for actual papers, but my surfing brought up this image and paragraph found on this webpage which explains the concept further.

Modular Organization of Grid Cells.Within a local regions of entorhinal cortex all grid cells have the same bump spacing (scale) and orientation (angle of the rows and columns), but differ in bump placement.The local regions are called ‘modules’. One can think of all of the cells in module sharing a single bump pattern that can be moved horizontally or vertically, with a different x- y- offset for each cell. This means that as the rat moves, different grid cells in the module are excited. The distance and direction of movement determines the specific sequence of grid cells that will fire. Conversely, the sequence of activated grid cells signals speed, distance and direction of movement. This is thought to be the grid cell code.