In the recent live stream, Jeff introduced a big idea that the initial function performed on input that comes into a cortical column, is to pool those inputs with an orientation signal. In the discussion, he concluded that the output of this function would be equivalent to place cells.
This theory makes a lot of sense to me, and fills some gaps in the earlier models like sensor orientation and scaling. However, where I became a little stuck in understanding the theory is with what happens after that initial function. In the whiteboard diagram, Jeff depicts this output of place cells being used as the input to a layer which gets its context from grid cells representing location. This seems redundant to me.
Grid cells represent a specific point on a specific object, but isn’t that the same information as the output of the first function (which is also a representation of a specific point on a specific object)? It seems like the output of the first function is the location, and thus another layer of grid cells to depict location should not be necessary. Additionally, I really wasn’t able to identify on the whiteboard which of the layers would represent a stable representation of the object.
Just from the lines on the whiteboard (not being a neuroscientist, this could be a stupid conclusion), it would make sense to me if L5 were performing a pooling function, and its output depicted stable object representations. This would explain the lateral connections with L5 in other columns (for voting) and the projections through the thalamus to another region (supporting composite objects and hierarchy). From this perspective, the output of L2/3 could be interpreted as a stable representation of the object from one position, and the output of L5 as a stable representation of that object from all positions.
But this still doesn’t explain what is happening in L6B, which is that layer labeled as “location” on the whiteboard. Why would L5 need input from L2/3 representing a specific location on a specific object, and from L6B also representing a specific location on a specific object? Why is this layer needed (it seems redundant)? Is this related to displacement and composite objects?
Anyway, I’m still mulling over the ideas. Figured I would post my initial thoughts and have some discussions with the smart folks in the community here to try and smooth out the rough areas.
I’m pretty sure I understood that one system (layers 6b & 5?) is for placing in lateral space and another system (layers 6a & 4?) in radial space. When you move in a straight line the first system changes. When you turn, the second system changes. When you move and turn at the same time, both systems are impacted. And they feed off each other when necessary (through layers 3/4).
I’ve been thinking about this the whole morning, and I would like to speculate a bit. (Please take this with a grain of salt):
When you move in a straight line, essentially you’re considering features in 2 dimensions. Features come closer or go further. Every feature on that straight line moves with an equal delta.
Only when you add a second system, that somehow works with an angular input, your brain can build a model in 3 dimensions.
So it is conceivable that there are animals with a more primitive cortex (fish? snakes?) that only see the world in 2 dimensions.
And to go completely nuts: what if we added a third system to that stack? Would we be able to intuitively consider the world in 4 dimensions? 8-D.
As I understand it, when you move in a straight line both systems would change, because the representation for orientation is unique to a specific point on a specific object (i.e. the same orientation at a different point on the same object would have a different representation). When all orientations for a point are pooled, the result is a representation that depicts a specific location on a specific object.
Anyway, this is definitely the area that is causing my confusion. It seems redundant that you would need two layers both representing the same information (a specific point on a specific object), one providing proximal input to a layer, and another providing distal (context) to that same layer. Like you, I’m still mulling this around in my brain, though, so there may be a perfectly obvious reason for this configuration.
Again, I’m not certain of this. But don’t forget that a feature is a single point. It’s a very small part of an object, captured with a small part of your sensory input array. I could imagine that your fingertips each project dozens if not hundreds of such points. And that for each one somewhere in your neocortex this enire double system is at work.
When you, Paul, move through the world in a straight line, it’s almost impossible for your radial system not to be affected, since you have steroscopic vision, stereosonic hearing, many fingers, etc. But in your brain, each independent input signal is treated separately at first, and combined later after going through both displacement systems.
I agree the explanation makes sense (integrating linear and angular movement), but I’m having trouble relating that to the layer connections. Specifically, this is the area that is confusing me:
That taken with the fact that part of L5 appears to be involved in voting, and part involved in hierarchy, it seems L5 should be representing a stable object representation. I can’t visualize how you could get that from forming representations of “place cells” in the context of “location” (seems redundant).
Paul, remember much of what Matt and I discussed yesterday is speculative.
I propose there are two different metric spaces, one is radial and one is linear. You are correct that the output of the radial space defines a location in the world, like place cells. But that system only tells me what will happen if I change my orientation while staying in that location. It doesn’t tell me how to move from one location to another. It is like standing at an intersection in a town. I can recognize where I am by looking in different directions, but I need a 2D map to know what will happen if I walk east for a block. Knowing where you are is not sufficient to build a model of the town. Grid cells can predict your new location when walking east, but head direction cells are needed to predict what you will see. You need both.
It isn’t clear that the system needs a temporal pooling layer for the object. You need one for place (L3) because this input to the grid cell/L5-L6b network, but the L6b-L3 sensory motor mechanism works fine without temporal pooling. You only need temporal pooling if you require a stable representation of the object. I am not saying we don’t have a stable object layer, only that it isn’t needed. The other L5 cell type might be doing something else.
Does this address your questions?
Thanks for the reply, @jhawkins !
Yes, I have since been thinking about idea that some of the output from L5 going through the thalamus and to some other column (maybe higher in a hierarchy or part of a composite) as part of that other column’s input, a stable representation from that input would form in that other column’s L2/3 (performing the same cortical algorithm). So from that perspective, it might not be necessary to form a stable object representation in the first column (assuming the hierarchy or composite includes a path for feedback). The representation of a “Place” in that other column could serve the purpose of a stable “Object” representation for the first column.
So putting aside hierarchy for a moment, and thinking about what is happening with the two signals in L5, its input is a representation encoding the semantics of a “Place”, without context of “where” that place is in relation to anything else on the object it is a part of. The “where it is” context could then come from L6B.
I honestly don’t think special place, grid, orientation, etc cells need to be programmed/invented. I’ve thought for a while now that they are just naturally arising firing patterns that receive input from multiple sources. It makes sense that they are found furthest from the input if that’s accurate. If the brain is nothing but an experience capturing device that captures many different features and their variants within context of others, I think its likely that grid cells and place cells and such are just deep SDRs of a sort.
I think for that to naturally arise you need a bit more structured SDRs though. I think the Kanerva style SDCs might not get good results because of the ability of firing cells to drift and skip around all over the layer. But if you did have enough stable structure, the temporal context of the features a system is seeing/feeling, the highly multimodal nature of the input, the feedback from the physical actions, and culminating right before the hippocampus (plus information looping around the hippocampus as well I suppose) seems to me that place cells and the like are probably just deep multimodal feedback driven SDRs.