Follow-up question to Podcast 1 with Jeff (location, orientation, and attention)

I very much enjoy reading the posts, but agree that building it in software is a better use of your time. Once it’s modelled in code, each components behaviour can be more easily communicated and it can even be used to drive visualisations.


On the contrary. There are separate areas of the brain dealing with riding a bike and with playing music, and according to the podcast each of those areas maintains its own location relative to the task which that area is doing. The ears have a location relative to the start of the song, and the legs have a location relative to road. It seems that each area of the brain is just doing its own thing.

I think all of your thought experiments require attention and possibly memory. My understanding of grid cells is that they are an entirely unsupervised system and thus should be capable of functioning independently of other brain areas. I’m sure that attention can manipulate and interfere with cortical grid cells, but I also think that the grid cells can function just fine without any attention or memory.


Yes, agree that is what the podcast is describing and part of current HTM theory. My confusion comes from the fact that this view seems to be contradicted by the “wormhole” thought experiment in the podcast (which, if thought about more deeply, reveals that object space is global across all sensors in relation to what you are attending to)

Yes, I think this is a key point. The confusion will likely only be cleared by better understanding how attention works. I’m comfortable with saying that the HTM theory for location/orientation encoding across the neocortex could still be accurate, and that the mechanism of (conscious) attention happens at a higher level of abstraction than that.

1 Like

Good conversation. I see two related, but different, problems being discussed. One is about attention, whether there is a global position and orientation, and the other is about how learning occurs across sensory areas, such as learning a cup with my left hand and recognizing it with my right hand. These are both topics we have discussed at length at Numenta so I can share some of our thinking, perhaps that will be useful.

First attention. I like to distinguish between what “you” are aware and what you are not aware of. I believe that most of what is happening in the neocortex is not available to introspection. There is a simple non-dualistic explanation for this. If we assume that only the representations at the top of the hierarchy are available for episodic memory and for verbal expression then most of what happens in the cortex will be invisible to these mechanisms. When I drive a car, part of cortex is attending to varied items on the road, making decisions, and taking actions. If this can be handled lower down in the cortex then I will not be aware of this activity. At any moment I can direct my attention to this activity and this cause it to rise to the top and be available for introspection and verbalization. Also, if the lower regions experiences something that they can’t handle that will force my attention (bottoms up) to these items. I will stop talking while I attend to anomaly.

The thought experiment of being pushed in a chair is perhaps not the best to illustrate these principles. Instead imagine that you are walking around the room and touching the cup. You can attend to the cup and then attend to the room and never get confused. We do this kind of thing all day long. By having someone push the chair, the normal mechanisms for keeping track of location (path integration based on motor) are lost. The only way to keep track of your location in the room is to constantly attend to what you are seeing. When you walk yourself, you don’t need to attend and you won’t get confused. So, I believe attention is occurring everywhere in the cortex. However, only one attended thing can make it to the top. That is what you are aware of, the rest is not available to instropection.

The second topic is how can one finger learn an object but then another finger recognize the object? Or how can part of an object be learned with one finger and other parts of the object be learned by another? Or how can I learn an object by looking at it and then recognize the object by touch alone? We call this the “disjoint pooling” problem (not a great name). I believe I wrote about this in another forum post. Briefly, we don’t know how it occurs, but we have several ideas. 1) Information spreads laterally from a column that is getting input to adjacent columns that are not. This is documented and the spread occurs in L3 and L5. I believe this is part of the solution. When column A learns something it trains its neighbors if they are idle. 2) The other involves the hierarchy and attention. Imagine I have two hierarchies, one for my hand and one for my foot. The hierarchies are separate (disjoint) except the top most region is shared by both the foot and hand hierarchies. I then learn a cup with my hand. This means all the regions in the hand hierarchy learn the cup, including the top most hierarchy. The regions in the foot hierarchy have learned nothing. What happens if I try to infer the object by touching it with my foot. Unlike the hand, this cannot be done low down in the foot hierarchy. What we do is attend to one part of the foot, say the big toe, and with attention move the toe, sense a feature, move the toe and sense a feature. These features will be something the toe can recognize like an edge or a rounded surface. These basic features are passed all the way up the hierarchy to the shared region where recognition occurs. You can’t do this without top down attention to the toe. As I said, these are just ideas. The problem has a solution.


I particularly like this concept of “rising to the top” (it would explain a lot of things). I’ll have to do some thinking how to actually implement a mechanism like that (it would require something different than a traditional view of hierarchy I think). Seems like more of a “tangle” than a “blockchain” (reference to IOTA)

image :wink:

1 Like

I’ve given some thought to how the concept of something lower “rising to the top” of a hierarchy might be implemented. This is what I’ve come up with so far (please feel free to tear this apart if I am way off the mark).

The first thing that becomes clear, is that there isn’t any obvious mechanism for literally pushing something lower in a hierarchy up through each level to the top (at least I couldn’t imagine one that seemed plausible).
Instead, the top of the hierarchy must have direct connections to each of the lower levels. This is of course
a deviation from the normal view of hierarchy, so open to criticism here.

Borrowing some ideas from the Global Workspace paper that @Bitking referenced in his Grids to Maps thread, you start with feed forward input traversing a hierarchy level by level in the traditional sense:


Next, you add direct connections from each level to the top of the hierarchy. The top node will be receiving anomalies from each level, and sending stimulation:


The signals will compete at the top node, and the most interesting/anomalous signal will be selected. The originating node will be stimulated. This stimulation will combine with the feed forward signal, and excite the node:


Each node will have lateral connections to other hierarchical branches across other modalities. When a node is excited, it will send a stronger signal from its lateral connections. In this example, let’s imagine this node in the hierarchy is related to sensory input from your hand, and the anomaly was an unexpected bump on your favorite coffee cup:


The lateral signal will recruit nodes from other hierarchies and modalities. In this case, lets assume there is a connection with a hierarchy related to sensory input from your eyes. Whatever the eyes were attending to before subconsciously will be overruled, and they will now be recruited to help resolve the anomaly with the coffee cup:


At this point, the global (conscious) attention has shifted to the coffee cup, and now coordinate spaces across the various sensors involved are all in relation to the cup.


Quick question, when you say “foot and hand hierarchies”, are you talking about a direct touch sensation only, or does it include anything motor related? For example does it include holding a toothpick in your hand and using it to examine the shape of the cup?

If I may, I’d like to offer a modification to the thought experiment of @Paul_Lamb that may get at one (or more) of the points that @jhawkins has made. Imagine that you are in a gym and are holding a basketball, and that somewhere in the gym is a hoop. Initially you know where you are with respect to the hoop, and could probably easily shoot the ball towards the hoop and come pretty close. Now, imagine that you are blind folded and told to wander around the gym for a while (10, 20, 30 seconds), and then take a shot (or throw the ball towards the hoop). Chances are, the longer you wander, the worse your final aim will be, but odds are that you would still have some sense of the general direction of the hoop. It would be interesting to see if there is a significant difference whether you were attending to the gym or the ball during the wandering, but I suspect that in both cases your aim would be much better than if someone were to wheel you around the gym in an office chair before taking the shot.

The point of the exercise is that there are many ways in which your brain can maintain spatial awareness, and it’s remarkably clever at picking up on subtle cues for maintaining relative orientations and positions. I often find that when I close my eyes and wander in an environment, that I attend to audible cues as reference points. Sometimes, I imagine a visual representation of the audible sources embedded in my surroundings shifting as I move, essentially binding the sounds to locations. I do much the same thing when I’m wandering around my house in the dark at night. I find that I’m usually able to put out my hand to find the door frame within a few inches of where I expect it to be.


Now, I’d like to get to the crux of the problem that I have been trying to grapple with ever since the topic of grid cells was introduced: What is the mechanism that drives these cells to fire in the grid pattern? With only raw sensory input to go on (i.e. no explicit position information) how do the cells know they have arrived at a given location/orientation. I could understand it if a particular body pose (prorpioceptive inputs) generated a certain SDR that happened to select for a given set of grid cells. However, if my understanding is correct, these same grid cells would still be active regardless of the body pose when the body returns to the previous location. I could also understand it if a given combination of environmental cues gave rise to an SDR that would also select for a location/orientation representation. But what causes the same cells to fire in such a regular pattern w.r.t. location (hexagons!) even if the environmental cues do not have a corresponding regularity?

I suppose it all comes down to path integration. It’s not the position, it’s the motion. Is there a network of cells that recognize temporal sequences corresponding to spatial translations and rotations? Are these transitions somehow translated into recurring patterns of grid cell activation?


I’m not clear on this either, but in some sense I believe this has to be learned (since it applies to abstract dimensionless concepts as well as the more obviously spatial ones, and can be applied to weird physics like Portal or Paper Mario). Perhaps a clue can be found in another phenomenon that occurs if you reach into a black box to feel an object (or in your example when you take off your blind fold after wandering a bit around the gym). You started with one spatial sense (perhaps random in the case of the black box, or perhaps drifting from reality in the case of the gym). As you get more sensory clues from your movements (touching with your fingers, saccades from your eyes, etc), you then recognize (based on the sequences of actions/inputs, or voting between sensors) a location on the object (or in the room) that you remember, and the spatial sense suddenly “snaps” to the one that is remembered.

This same “snapping” strategy could potentially be used to learn different paths to the same location on a new novel object, room, or concept as well. The system might continue to build out a new random set of spatial information as it performs actions (likely relying on semantic similarities with other previously learned objects, rooms, or concepts, so probably not usually completely random). When it recognizes a location it has been previously, it can snap back to those representations, and associate it with the motor action performed at the previous location.

1 Like

This is an area of active research. My favourite theory is (Kropff and Treves 2008). They propose a spatial pooler (its not called that in their paper though) with two additional mechanisms. One of the mechanisms causes grid cells to respond to large contiguous areas of their input, the other mechanism (named fatigue in their paper) shapes those large receptive fields into spheres. Then the spatial pooler’s competition packs these spheres in as tightly as possible, which on a two dimensional plane yields a hexagonal grid.

Kropff and Treves, 2008:

My guess is a lot of the input to entorhinal cortex about walking is pre-processed so it doesn’t need to handle a huge sequence of exact muscle positions and movements. For example, it might get a subcortically produced movement direction or movement direction change signal.

The purpose of path integration is sort of to get rid of the sequence aspect. I don’t think it’s feasible to learn every single sequence between locations on an object. Those sequences don’t apply to every object because the solid object is in the way of many of them, so if it does path integration only by pooling sequences, it would take forever to learn.

Some theories use oscillations (with or without phase offsets that produce a travelling peak along the cortical sheet) to do path integration and form grid cells. Oscillations are periodic like grid cell response fields and can form hexagonal grids with multiple oscillations interfering.

I’m extremely biased but I think the hypothesis I described in a post is close to the truth.

I think there needs be some form of pure path integration. Otherwise, it would have to experience every path and pool them, like the old temporal pooler or union pooler was meant to pool sequences. It’s physically possible to do path integration without that, so I don’t think it does path integration by pooling sequences.

Or at least, that’s probably not the only way. There are probably multiple complementary methods of path integration at play. Some sort of automatic, sensory-insensitive version of path integration, and then some sort of flexible, sequence learning path integration (perhaps related to object behavior).

The automatic path integration brings you from point A to point B on a fairly routine journey. But then from point B to point C, you take an elevator, which the automatic system can’t handle. So you learn sequences from point B to point C. The automatic system gets you to point B and point C other ways, so you already know what those places look like, and you just need to learn that one sequence. As a more grounded example, moving your fingertip through the air leads to consistent transitions between locations, but with object behavior and not being able to phase through objects, you also need flexible, learned path integration.

There are like ten functional layers, twenty if you count what and where pathways separately, and thalamus, basal ganglia, and so on, so there’s plenty of room for multiple forms of path integration.


There is sure to be a great deal of pre-wiring involved, given that eons of generations have been born onto a planet with some stable physical parameters (I like to use the example of wildebeest infants, which are able to run from predators within hours of birth). I’m sure you are right that there are a lot of potential mechanisms that the brain can leverage (many of which have been around a lot longer than the neocortex).

1 Like

There is a mechanism for this. It is widely believed that the thalamus is integral to attentions. The most important input to every region goes through relay cells in the thalamus. These cells have two modes of operation (burst and tonic) plus there is an inhibitory network in the thalamus. The relay cells can be switched between relay and burst modes by either a top-down feedback signal from the receiving region or a very strong signal from the lower sending region. The idea is that an unexpected input causes the relay cells to attend to the unexpected input, and also the top higher region can direct attention as well. For example, I can tell you to attend to some area of your visual field (top down). Or, if something unexpected happens your attention will automatically go there (bottom up), you can’t prevent it.

This is an interesting question that we are still trying to understand. Switch to vision. Do the grid cells a column in V1 represent the location of the eye in the space of the viewed object or do they represent the location of actual feature on the object? With touch it is easy to imagine that the location represented by a column in S1 is both the location of the skin and the sensed feature, but as you point out you can touch something with a tool such as a toothpick. Do the grid cells represent the location of the finger or the location of the tip of the toothpick? I believe the cortex represents the location of the sensed feature and not the sense organ. This is cleaner and more powerful, however, it then begs the question, how does a column know where the sensed feature is? How does it know the location of the tip of the toothpick? We don’t know. We have some ideas but no answers yet.


Thanks, I just want to make sure I understand the mechanism you are describing here for the bottom-up route.

It sounds like you are describing the nodes in a hierarchy routing their most important input through the thalamus between levels of the hierarchy. Something like this:


When something sufficiently anomalous occurs (I’m assuming some competition here), the thalamus gates input from other regions:


And global attention shifts due to feedback from the top of the hierarchy cascading down (essentially the anomalous node gets an overwhelming vote, due to other input being blocked from traveling up the hierarchy):


The mechanism is simpler than you are describing. Take two regions, R1 projects to R2. The feed forward connections from R1 to R2 are routed through the thalamus. It appears that the thalamus plays a role in what part of the output of R1 is attended to by R2. What exactly attention is and what exactly the thalamus does when it passes on the signal is not known. Much of the anatomy and cellular mechanisms are known, but the function is not clear. If attention is related to the “burst vs tonic” modes of the relay cells, as some believe, then both top down and bottom up input to the thalamus can direct attention. I was only letting you that a bottoms up control of attention is both possible and some of the mechanisms are known. I would recommend reading Murray Sherman’s book about the thalamus if you want more detail.


Ok, so the thalamus is essentially responsible for establishing and/or enforcing the context (the thing that should be attended to). How specifically it does so is not entirely known (but in theory could involve something like the global workspace, or something else entirely).

I’ll play around with some of these ideas then and see what works (probably will deviate from the biology for now). I can always go back to the drawing board later when more is understood about the mechanism in the future.

Thanks again for taking the time to reply to my queries!

1 Like

So I thought I understood what you were describing, until I found mention of “Displacement Modules”, which I don’t understand. But I thought I’d share what I thought anyways since I think it’s interesting:

Layer 6 contains grid cells, which are organized into mini-columns. The mini-columns accept distal input from other layer 6 grid mini-columns, so that layer 6 forms a temporal memory, representing the current location in the context of the previous locations. Layer 6 would represent the current location in a trajectory of motion, if that makes sense? Then layer 5 would be doing temporal pooling over these layer 6 cells, which would cause layer 5 to represent the overall trajectory of motion. Layer 5 then projects to the muscles which drive that motion.

When the animal wants to go somewhere, the thalamus simply activates the location in layer 6 where it wants to go. The layer 6 mini-columns burst which represents every trajectory passing through the destination, which in turn activates the layer 5 actions which pass through both the current location and the destination.

A separate set of layer 6 grid mini-columns is doing something different, by accepting distal input from layers 2/3 which represent the current object being sensed. These grid cells represent the location on the object, without any memory of how it got to this location. These cells are specific to both the object and the location on the object. These grid cells project to layer 4 where they’re used to predict sensory features given the sensors current location and the object being sensed.

I am interested in reading that paper which you mentioned about animals finding their way home in the woods, If you would be willing to share the citation.

Paul Lamb’s point about recognizing with one hand an object that that hand never felt before is not just a problem for later. I think we can draw some conclusions immediately, or at least raise some questions.
Lets suppose that your finger traced out an object with a peculiar shape. Lets also suppose that there are no lateral connections between your the finger on your hand and your big toe. Later you are blindfolded, and trace an object with your big toe, and recognize it as having the same shape as the earlier object you traced with your finger.
The implication, if we believe Numenta’s theory, is that there is a higher level representation somewhere, maybe more abstract, that has bidirectional connectivity with both the finger and the toe. It has to be bidirectional, because the toe has to send up some pattern that is on some way similar to the pattern that was sent up by the finger, and the abstract pattern has to tell the toe what it is feeling.
But that raises another question. What is similar about the two patterns - the one sent up by the toe, and the one sent up by the finger?
Is it the SDRs for the features being sensed? I would think not, no two SDRs are alike.
Is it the location vectors for the toe and the finger? Again, Numenta theory says no, in fact even adjacent fingers that connect to adjacent columns produce different location vectors for the same location on the same object. A big toe would also have different location vectors.

So this raises a problem.

The big toe pattern has to be similar in some way to the finger pattern. The big toe pattern is a paired location vector with a sensory vector. But if the features are not similar, or the location grid cell patterns are not similar, or both together are not similar, then how can that abstract pattern at a higher level know that the finger and the toe felt the same object?

Here is one possibility Perhaps the set of displacement vectors ARE the same for both the finger and the toe. But according to Numenta theory, they are not - displacement vectors are dependent on the location vectors that they are computed from, and so are unique to each sensory patch!

I don’t really see this as a problem. I don’t talk much about the “what” and “where” pathways and how they converge, because I don’t know much about that yet, but if you consider the egocentric locations of your sensors, they should represent space in a comparable way.

1 Like