Since 2017, a series of Numenta papers have explored location and object recognition, the most recent being 2019’s “A Framework for Intelligence and Cortical Function Based on Grid Cells in the Neocortex” (A Thousand Brains) and “Locations in the Neocortex: A Theory of Sensorimotor Object Recognition”. Recurring features are HTM cells, grid cells, and a location signal, amongst other elements.
At the research meeting on 27th of July, Jeff Hawkins described some of his current thinking, varying some of the previously held hypotheses. Key ideas mentioned included (my transcriptions): “the input is just a flow pattern, can’t tell if it arises from moving in a direction or changing orientation”, “there are just movements, and they don’t have to correspond to a dimension in physical space”, “the cortex models everything in these movement vectors”, and “all important path integration happens in 1D”.
In this post, I set out to rethink the way in which location works based on Jeff’s latest thinking.
To keep things grounded, I will begin in the world of the parietal cortex, at the end of the dorsal pathway. This is an area concerned with change and action. The research gives us some known functions:
- the LIP (lateral intraparietal area) holds a map of the saliency of spatial locations, and the attention paid to them.
- the MIP (medial intraparietal area) encodes the location of reach targets in nose-centred coordinates. There is evidence that these are remapped as the eyes move.
- the AIP (anterior intraparietal area) is responsible for grasping and manipulating objects through visual and sensory inputs and connections to the ventral premotor area.
Idea: I propose moving away from the representation of location using 2-D coordinates via the grid cell, and to a representation of location based entirely on movement and the potential for movement. Notice that a grid cell is also a good solution for representing a vector, the magnitude and direction expressible as the location of the bump within its 2-D space, relative to some origin.
Think first about the egocentric point of view. Suppose we allow reach targets to be extended to give us vectors to any point of interest in our saliency map. Then we can build a representation of a location by sampling the current active set of vectors, themselves represented using one or more grid cells (a single grid cell can of course encode multiple vectors).
A thought experiment. If I look around, then close my eyes, I can probably stride forward pretty confidently to touch a piece of furniture that I had seen ahead of me. However, if I wait a minute, first, I will be much less certain. Confidence weakens with time. I think the location model is a dynamic picture that needs to be constantly refreshed, typically from visual input. With our eyes closed, we often rebuild our confidence by reaching out and touching something.
The various vectors are active in various strengths, the strongest, presumably, being in front of us and closest to us. Now consider moving towards an object in front of us. The vector to it becomes the most active, and drives the motor connections. This vector and all the others are updated as we move (1-D path integration?). We can use secondary vectors to help determine the motion. For instance, if I am a rat moving down a tunnel, I can monitor the sequence of vectors to the tunnel wall on my left for anomalous changes, to detect when I veer off course.
Now, think about exploring an object using touch. Suppose the input is not a series of location/feature pairs, but instead a series of vector/feature pairs, with vectors in local nose coordinates, i.e. assuming a continuation of previous direction. As with the existing theory, this requires being able to express movement information in allocentric form.
A realistic stream from a blindfold exploration would probably start with a series of random movements, many leading to no feature. Then, as features were discovered, movements would become purposive, to try and get from one no point to another. The stream from a visually guided exploration would presumably move from one visible feature to another in a more directed fashion, as with the streams used in the papers.
The learning process is much as before, except that now learning is captured as sequences of vector/feature pairs. Just as with learning location/feature sequences, this also captures a route, but in a form that is easier to express as motor output. But this switch does not make any difference that I can see to the use of location data in the “Thousand Brains” theory. If order only slowly emerges, or more generally, if features are not always encountered in the same order, then multiple routes will be learnt, but that should not cause any problems.
Now to the classic logo on a mug scenario.
To break the problem down a little more, I will first consider a variation that avoids the need for complex mapping between frames of reference. Suppose I am walking in a field, and encounter a crop circle consisting of a flattened circular area containing a regular pentagon of depressions where ‘landing legs’ made contact.
I postulate that a second worldview corresponding to the crop circle will be constructed, with its own saliency map and stream of allocentric movement vectors. By focusing my attention, I can shift between the two worldviews. Note that if I’ve been walking around inside the crop circle, exploring it, and I suddenly wonder where I am in the field, I will typically look up, which is consistent with the idea that I need to refresh my broader sense of location, since it has faded while I was paying no attention to it. There is no transform between the two views: instead, I move in each of them in parallel. There is no re-anchoring, and no need for displacement cells. Each of them has its own set of locations, which, within the crop circle at least, can be associated with each other in pairs.
However, I think that the notion of a join between the two worldviews is best thought of at this level as restricted to their shared boundary. In this case, I can consider the worldviews to be locally coextensive, because I can switch from one worldview to the other at any point. However, if I had encountered a six-foot closed dome in my field, this would not be the case. If I make my dome house-sized, then climb on top of it, I still have two worldviews. However, if I look up from my lofty perch, the broader one probably just places me at the dome, leaving the dome-shaped view to locate me more precisely.
Connection is a richer idea still. For instance, if I see a rectangular bin standing on the floor next to a wall, I will assume that it is not permanently fixed there. However, if I see the same bin halfway up the wall, I will assume that it is fixed, because otherwise it would fall. Clearly, there is more involved than visual and movement input.
So, coming back to the mug, I again postulate that I build a second worldview for the logo. This time, as well as the allocentric movement requirement, I need to be able to build a salience map for the logo, which also seems to demand some sort of transformation of visual input. I don’t know how this works, but observe that, granted this extension, the case should present no more difficulties than the crop circle.
Thinking about compound objects in general, vectors seem at least anecdotally natural. Personally, I find it a lot easier to send a little imaginary ant scurrying around the surface of a compound object directed by nose-coordinate vectors than I do to envision how the same object is put together using local coordinates.
A vector approach seems to make a lot of sense on the dorsal path, which is all about movement. It is less clear if the same applies on the ventral path, with its more static viewpoint. On the one hand, it is known that both dorsal and ventral paths are affected by both magno and parvo input, and why multiply mechanisms? And as Jeff Hawkins points out in the Thousand Brains paper, the neurobiology is very similar. Or could it be that the ventral view of the world just models object decomposition in terms of visually observable boundaries? Or it could be that the same biology is being used differently i.e. grid cells in their entorhinal 2-D mode. In a world of voting, there’s nothing wrong with having two substantially different mechanisms at work in the same area.
The same question arises again when considering the frontal lobe. However, vectors seem to make a lot of sense when thinking about higher-level concepts. When I learn a subject, I become more comfortable with it as I learn more touch points within it, their features and how they can be manipulated, and the valid routes that compose them. An actionable framework of understanding. When a child is learning how to solve simple mathematical word problems, they are learning how to translate words into a route of vectors connecting the successive mathematical operations that need to be composed, and simultaneously exercising and reinforcing the underlying framework of mathematical concepts.
Finally, speaking as a HTM learner, I am still very much in the process of building my own HTM conceptual network a.k.a. climbing a learning curve. I’ve tried hard not to make any really egregious mistakes, or waste anybody’s time, but please grant me your forbearance wherever I have failed.