Marcus Lewis presents an unsupervised learning technique that represents inputs using magnitudes and phases in relation to grid cells. He proposes an alternate view of grid cells that enables the creation of maps of novel environments and objects in a predictive basis.
Marcus first gives an overview and his assumptions on the core pieces of the algorithm and provides examples to support his viewpoint on grid cells in mini columns. He then presents a simulation and shows how the technique can be implemented in artificial neural networks and biological tissues.
References in the presentation:
Benjamin Dunn - Toroidal topology of grid cell ensemble activity
There’s so much to unpack here. I really need to sit down with @mrcslws for a serious discussion about the ideas he’s expressing here. My own thoughts on grid cell modules have been running pretty much in parallel to what he’s been presenting over the past year.
At about T+8:00, @subutai asks if the input needed to be continuous in time. @mrcslws replies that there are not specific assumptions about discontinuities in the input, but that in this work he was focused on continuous inputs. @jhawkins then commented that some seemingly discontinuous events, such as saccades could be interpreted as really fast motions since path integration is occurring during saccades.
I’ve given some thought to continuous input (e.g. persistent sensory input) vs. discrete input (e.g. sudden shift in a significant portion of the sensory input). It occurs to me that while continuous path integration of physical (or conceptual) space is likely to be enabled by the grid cell modules described here, the discrete jumps in input (like tapping an icon on your phone, or walking into a different room) would require another mechanism to reanchor the grid cell modules to the new context. Whether these are place cells, or a specific population of cells which respond to the input (e.g. by filter matching), would probably depend on the specific context.
The 2D polar coordinate frame is a good place to start. By training the input filters to recognize shifted periodic features, it would seem like you are training a sort of Fourier Transform filter. This filter would essentially extract the real and imaginary components of spatial (or temporal) frequency responses to the input.
Now, as @mrcslws points out, restricting these filters to two cells anchored to two specific phases (90 degrees apart) is a problem for biological plausibility since it would require both positive and negative response values in order to represent a complete ring. @subutai is correct in proposing that adding more cells to the module and forming an over-complete basis set would be the way to address this issue. In the diagram below, you can see that a minimum of three cells would be needed to represent an entire ring wile avoiding negative response amplitudes from the grid cells.
With more grid cells in a module, each representing a finer grained (higher resolution) phase shift for a particular feature, it may be necessary to include inhibition to reduce the response of the filters that are close to, but not the best match to the phase orientation of the input.
When you mention 2D polar coordinate frame, I’m not sure if you are seeing that as an implementation detail. Perhaps you are confusing the pair of scalar activations with neurons/cells. The intention was to model a ring (representing a 1D grid cell module) with a pair of scalar activations. A single pair of scalar activations could model any number of cells in the ring (i.e. more or less phase resolution). I’m not sure if Marcus states the actual phase resolution he modelled but I remember the number of 10 cells being mentioned several times.
I am fully aware of what the A/B cells are representing. They are representing a Cartesian coordinate pair (a,b) in a complex plane (a + ib). These coordinates can be mapped to a pair of polar coordinates (r,phi), where r can be interpreted as the strength of the match to the filter, and phi is the phase shift of the filter. With the filters shown, it appears to be effectively equivalent to extracting something like Fourier coefficients.
As Marcus points out on a couple of occasions, this interpretation is likely only going to be relevant to this specific example. These filters arose as a response to his specific choice of objective/loss function. The system was trained to find the filters that would not only respond to features in the environment, but also allow the strength of those activations to be used again to reconstruct the environment as a linear combination of those filters. The reconstruction error is part of the cost function. If one wanted to take this approach of learned filters to next level, one should take a look at the work of Laurent Perrinet, in particular this.
Probably the most interesting thing about this presentation is that it shows a plausible mechanism for how one might be able to generate a prediction of future inputs using a linear combination of basis filters that are each shifted in space by an independent phase parameter, and that these shifts in amplitude and phase can be driven by grid cell like activity. No filter by itself represents an object, but the overlap of several filters in a common location does.
Now, here’s the crux of the matter for me: How does one form a persistent representation of objects/features in an environment even when those features are no longer being observed in the sensory input? It’s all well and good if you have a perfectly clear overhead view of the environment as shown in this example. But how does one still manage to navigate towards hidden or obscured objectives with only a portion of the environment available for observation at any particular moment?
I think the work done here by Marcus is one part of the puzzle. The other part has to do with some feature of recall in the short term memory that I’m only vaguely remembering from one of the earlier presentations this year (perhaps one by Florian). If I can recall the details, I will start up another thread (or revive a previous one).
OK but I don’t think the intention was to model two cells. They form a model of a ring - it is not intended to be a biological model - just a clever way or modelling a ring in a machine learning environment. I guess Marcus knows how to implement a ring with neurons/cells.
The model he presented is allocentric and assumes there are other cells that can maintain the relative positions of objects, boundaries etc (they form the input). It is not based on visual input - it is like a view from above (not literally a view from above). So the line of sight is not an issue in what he presented. As long as the relative positions of objects are stable then the agent knows their position, for example the object might be behind you and you still know where it is. Or another example, you might look around a room and construct a model with the object locations etc then shut your eyes or walk backward and you maintain that map.