A group of questions based on reading about the two grid-cell models

I’ve been reading the preprints of the newer papers on the site, and I have a superficial understanding, but not a sufficient understanding.
So here are some questions, for anyone who wants to answer them:

  1. It seems that in the paper “A Theory of How Columns in the Neocortex Enable Learning the Structure of the World”, the assumption is that as you explore an object (say with one fingertip and your eyes closed), you have a location vector and a sensory vector. The location vector synapses on the distal dendrites of a cells in a minicolumn. The way the location vector is created is not a topic of that paper. In addition, there is a ‘output’ layer that has an arbitrary representation for a particular object, and which stays constant as you explore the object with your finger.
    But then there is another paper titled “Locations in the Neocortex” which came out later, which dispenses with the output layer, and has a sensory layer that modulates the distal synapses of a location layer, and vice versa. Furthermore, the location layer in this new paper is made up of several grid cell modules, each of which have cells that connect to the entire sensory layer.
    Is my understanding correct?

  2. The second article goes over standard grid-cell theory a bit, and says that the very first time that a rodent is released into an artificial environment (lets say a walled space with various features at different points such as a tree and a stream), each grid cell module will start off with one bump (and only one bump) at a random point. If that is true, then if the mouse were released for the first time via a different door into the same environment, would it start off with the exact same bumps? If not, how do you have constancy in learning locations of features of the environment?

  3. Suppose a model based on time (the old sequence model where the current vector makes predictions via modulatory synapses on cells), you have learned a sequence of features A,B,C. Another learned sequence is E,C Just for now, suppose that ‘C’ is not represented by a vector of many minicolumns, but just by one mincolumn. In that case, the firing of one particular cell in that minicolumn (when it is not bursting) could correspond to the previous history ‘E’, and the firing of another cell in that mincolumn would correspond to the history A,B. Is that correct?

  4. Why grid cells? According to the various papers, the representations of location are ‘dimensionless’. Not only that, but grid cells mean that the ‘origin’ of the object being looked at can be translated in space, in other words, recognition is invariant to origin. Grid cells don’t explain orientation invariance and scale invariance, thought there is a suggestion in one paper that if the location vector also gets inputs from head-direction cells it would represent both location and direction, and orientation invariance (but not scale invariance ?) would be achieved. So here is where I am confused: by ‘origin’, do the authors you mean the ‘origin’ relative to the environment (room) that the object is in? Or do they mean you can start feeling the object starting at any feature, and that feature would be the origin? In other words, that the order of the sequence of touches would not matter? I still don’t understand why grid cells make the ‘origin’ unimportant, but it would help to know what is meant by ‘origin’.

  5. In the first paper, it says that lower levels might not sense enough of the environment to form a model of a big object. So would they learn ‘parts’ of a big object - like the leg of an elephant? What happens if the sensory patch that feeds the lower level strays to the ear of the elephant?

  6. In the later model: If your first encounter of an object activates random bumps in the grid cell modules in one column in your neocortex, will that happen 5 days from now if you encounter the same object? It would seem that you have to start off with the same representation.

  7. In the later model, there are two layers that interact. As they interact, they narrow down unions of possibilities in both layers. Let us suppose that the location vector is made up of 3 minicolumns. Two of them have one cell firing, but the third is bursting. Might that mean that the first two minicolumns have narrowed down the area to a small continguous area in space but the third is required to narrow down to an even smaller area in space? (see picture)

    That’s all for now, thanks in advance.

1 Like
  1. yes
  2. You need place cells for this, which can identify a room based on a landmark in the room, the grid cell modules “bumps” anchor to the place field. This is ongoing theory, so we don’t have all the answers yet.
  3. Yes, there would be another cell in that “C” minicolumn that represents ABC, but it would not be firing for the EC sequence.
  4. Why grid cells? Because they are observed in experimental neuroscience, and our mission is to understand how intelligence works in our brains, so understanding grid cells seems really important.
  5. It would be modeling an ear of an elephant.
  6. Once you’ve learned an object, you compare incoming sensory input to all the objects you’ve learned, narrowing down until you match the object (5 days from now). You don’t start with the same representation, but you end with it by narrowing down SDR unions through sensory movement in object space until you are left with the object. (related: How does short term memory work?)

I think there is something wrong with this assumption. The layer of cells getting location input can’t make any assumptions about the structure of the layer providing the input. This is a core principle of the Thousand Brains Theory that we talk about. As long as the location layer output is stable (meaning locations are unique given consistent input), that’s all that matters.



Thanks for the answers, I see I was off base in some places. Let me do some speculating here, and if I’m still going off base, let me know:

  1. There have been a group of encoders developed for the Spatial Pooler. A simple one is for consecutive numbers. So an encoding for a number like 5 overlaps a lot with 6, but not so much with 8. So in that case, we deliberately design SDRs to reflect a similarity in the meaning of the input.
    Presumably, in the pooler layer, the SDR representation for 5 will also be more similar to that of 6 than of 8, even though the connections to the pooler layer are random (though you can have a radius beyond which there are no connections to a minicolumn)
    Now lets take a 2-D location input into a 2-D spatial pooler. If this input represents location, then location is automatically topographically organized so that similar locations have a similar representation. I would think that again, even though the connections to the pooler layer are random, that the upper layer SDR patterns for ‘top left’ location would be more similar to ‘near top left’ than to ‘center’ or ‘bottom right’ for instance.
    You say that a goal of the theory is about keeping representations unique.
    And you also point out, in the later theory, a location layer is interacting with the sensory layer above it, and is modified by that sensory layer and also modifies that sensory layer. So the location is not just location, its location for a particular object and may even be modified by the particular feature at that point.
    Likewise, at the sensory layer, the sensory SDR is not just a feature, its a feature at a location.
    I got confused because I thought a representation in the location layer of (lets say) the center of the object is more similar to a location slightly to the left of that location (also on the object) than a representation further away.
    If you have a feature ‘f1’ that activates a union of location representations then these representations have very little in common because each one is in a different object-space… So as you narrow down the union, you are narrowing between disjoint areas, areas that in the real world ‘out there’ might not be close to each other at all.
    On the other hand, if you have an SDR where a few minicolumns are bursting instead of having one cell predicted, and you manage to eventually predict the correct cell in each, you are also reducing ambiguity and getting closer to the right location representation, but does that mean you are getting closer the way a person gets closer to his car in a parking lot as he walks toward it? Or are you jumping about randomly in the world ‘out there’?
    I don’t think that the above is so important in understanding the theory, but there usually is an assumption in neural nets that similar input in the outside world gives rise to similar representations. In fact, this has to be true due to ‘noise’.

  2. As far as the question of what happens when you enter a new environment for the first time, I understand that your grid cells ‘anchor’ to that environment, as if a grid of invisible lines is laid down so that the top left corner (lets say) of the grid is in a particular distance from the top left corner (lets say) of the environment. I understand that Numenta tries to be as true to the biology as possible, but I was wondering why the brain itself uses grid cells. Maybe its a natural outcome of solving the problem of navigation, but are there particular advantages? One advantage the Numenta paper does mention is that if you take a direct route from place A to place B, as opposed to a circituitous route from place A to place B, it doesn’t matter, the same grid cell will fire at ‘B’ in either case. But the paper also says an advantage is ‘origin independence’, and I did not understand what that meant - maybe it is another way of saying the same thing.

  3. Finally, could you look at the theory as if it is about sets of pairs?
    For instance, you have a set of locations that are paired with a corresponding set of features. { (location1, feature at 1), (location 2, feature at 2)… }
    In the earlier theory you had predictions pairs as well, but the pair was {(past history, current item) , (past history incorporating current item, next item)… }
    Maybe other pairs could exist, such as (velocity1, feature1) or (distance, feature1).

Thanks again for your answers. This forum in general is really interesting, and every question gets a reply.

1 Like

I need to clarify that this is not how it works in HTM theory. We use an entire layer to represent location, a layer using SP as described in Locations in the Neocortex: A Theory of Sensorimotor Object Recognition Using Cortical Grid Cells. We are not encoding location using an encoder like you describe, it’s more complicated than that. I talked about it more in the HTM Hackers' Hangout - Mar 1, 2019 near the end of the stream. I suggest you try to read and understand the paper cited above.

Path integration is a better term than “origin independence”. It means once you learn how space works wrt your body, you can use that representation in all situations.

Re #3 above, I’ve talked to Jeff about this, using the term “map” instead of “sets of pairs”. Maybe listen to this podcast with Jeff: part 1 / part 2.

1 Like

I looked at the two podcasts you referred to on the representation of ‘location’. They were good! Do you any others?
I was inspired to have a few more questions. I should look at the latest HTM-school videos and so forth, but let me start with these:

  1. If you were to take one column’s layer 6, and take all the grid cells in it, sort them by the module the correspond to, and then concatenate their activities into one big vector, would you have an SDR?

  2. You say that ‘uniqueness’ is what is important, specifically: that “locations are unique given consistent input”. You also talk about a ‘space’ consisting of these vectors. So let me see if I get this correctly.
    Suppose I am looking at a wall lined with bookshelves in my room. I have an intuition that I’m using Cartesian Coordinates - or at least a 2D map where points are organized in 2D space. Now are you (actually Numenta) claiming that each point is really corresponding to an arbitrary SDR (as in the SDR you can create by unrolling grid-cell modules into an SDR?)

  3. If so, do these unrolled SDRs have any particular relation to each other? The reason I ask is that back in the days I was experimenting with the Spatial Pooler, if you fed an encoder into a spatial pooler, even though the resulting SDRs in the spatial pooler looked random, there was a measure of similarity. Similar objects had greater overlap of bits (active neurons) than dissimilar objects. This overlap is known as Hamming Distance.
    But with grid cells, I don’t think “Hamming Distance” is the measure of similarity. If two locations are close to each other (in object space), does that mean they are similar in any way? I did read that Marcus (on your team) came up with a way of finding differences between two locations of an object. So does this mean that these unrolled grid-cell vectors are not completely arbitrary, and that there is an algorithm for finding which vectors are closer (in distance) to a particular vector than others?

  4. You talk about Object space. So if I put an apple in the middle of my table, and then compare it with an equal size orange in the middle of my table, would the location vector that represents the center of the apple be different than the location vector that represents the center of the orange? If they are different, then a location vector represents not only a location, but an object. Also, if too many objects were learned, the same exact location vector could represent one location on one object, and a totally different location on a different object.

  5. Generally, predictive signals represent some kind of context. In the case of the sensory layer, you are saying that the context is ‘location’. In the case of the location layer, I’m not sure what the context is, though I know it comes from the sensory layer. But thinking about the sensory layer, suppose we look at one minicolumn with N cells. Suppose a particular set of minicolumns stand for one feature. Would the bottom cell firing by itself in all those minicolumns, versus the top cell firing by itself in all those minicolumns, represent the exact same feature, with just a different context?

  6. The idea of a moving rhombus through each module confused me because of tiling. if cell ‘A’ fires at regular intervals as you go through space, then it would seem to me that any moving rhombus would often have to return to the region around cell ‘A’. Maybe you can point me to some grid-cell basics article that would clarify that point.

  7. On the topic of object compositionality: Once you have a set of displacement vectors that characterize a mug and the logo on it, and maybe the handle of the mug, you have a representation of the mug purely as displacement vectors. These displacement vectors are also unique to the mug, the same distance in the mug versus the distance on a vase would not have the same displacement vector. Is there some kind of mental ‘stack’ that keeps track of the fact that you zoomed in (maybe recursively) into a subobject of a subobject of an object, so that you can backtrack to the original object? Does this compositionality idea possibly apply to any hierarchical structure, even abstract ones?


I will leave it to @rhyolight to answer questions about how Numenta thinks grid cells and layers work.

As far as your nibbling at the edges of object representation (questions 4 & 7) - we were toying with that same question in another thread. When you are thinking about apple vs orange on the table you have to sort out where the properties are being represented. I see that this is more a question of hierarchy and not so much as a layer thing; the H of HTM. The WHAT stream is processing the cluster of features that make an object, the WHERE is dealing with the location. These fragments are combined at about the level of the temporal lobe into an experience.

I refer you to this post on the WHAT and WHERE streams, perhaps more detailed than the current question warrants:

This is how I visualize the various maps and streams working together - a bit cheesy but very visual:


Please do this before I spend time answering your detailed questions. It looks like these videos will inform many of them.