I’ve noticed lately in the research streams that Numenta seems to be having a hard time reasoning about the properties of the models that the HTM algorithm generates. I came up with a fairly straightforward model of it a few months ago, but figured it was simple enough that Numenta probably had come across it. After the recent research streams however, and how much @jhawkins has been stressing about not understanding the origin of grid cells, I’m starting to think this may not be the case.
I’d like to first point out that my reasoning here is based mostly on some basic properties of spatial pooling, and relatively few other assumptions. As for hard evidence, I don’t have any quite yet, though I plan to push some NuPIC models through UMAP, which I’m pretty sure should allow for most of the phenomena I describe here to be directly visualized. I’ve been a bit busy as of late however, and so it’ll be a little while before I have the time. If someone else wants to set up the visualization for me, I’d be happy to discuss what that would take. It should be fairly easy; I’m just short on time and am not great with Python.
I hope this solves some problems Numenta is facing.
The Inner Workings of Spatial Pooling
Imagine the space of possible input SDRs to a macrocolumn. The input to the column is some point within this space, which moves each timestep. This, over many timesteps, traces out a complex path through the input space. The space of observed inputs is, in practice, a subspace of the total input space.
The spatial pooling algorithm is fundamentally an algorithm for distributing Receptive Fields (abbreviated RFs) across this space. Each time step, it performs a k-nearest-neighbors search across all RFs in the macrocolumn (ignoring ambiguity, unions for now), and activates the corresponding neurons/minicolumns. Hebbian learning is applied to active neurons, moving their receptive fields toward the input point. Boosting can attract RFs that lie far outside the input subspace until they lie within it.
The subspace mapped by the RFs can be thought of as the subspace of the model.
This has several effects:
- Eventually, all RFs overlap with the input subspace.
- Eventually, the model subspace overlaps the explored input subspace completely.
- The density of RFs in a given portion of the input subspace is correlated with how much time the input point spends near it (more frequently visited regions are modelled in more detail).
- RFs can be thought of as “landmarks”, keeping track of common reference points in the input subspace.
- The selected RFs in any given timestep form an SDR that serves as a coordinate, representing location in the model subspace based on providing a set of nearby landmarks.
A Brief Aside on Temporal Memory (not directly relevant, just related and interesting)
The role of Temporal Memory in this interpretation is not to model the coordinate space, but rather to model paths commonly taken between landmarks. Within a TM layer, each minicolumn represents a landmark within the input subspace, which each neuron in that minicolumn represents a set of possible paths that may have been taken to that point. A TM layer encodes temporal sequences as “path spaces”.
The extra information encoded in an SDR where active minicolumns fire normally as opposed to all active minicolumns bursting (encoding no temporal context) encodes the temporal context as a fuzzy intersection of path spaces. Each neuron has its own path space that it responds to. These spaces may be somewhat large and ambiguous, however the effect of adding more neurons/minicolumns to the representation is to generate a “true path space”, which can be thought of as the space of possible paths that lie at the intersection of all the path spaces provided by active neurons.
As more path spaces are contributed to the temporal context, they constrain the space of possible paths toward the current input even further. In the case of a minicolumn bursting, it contributes “all possible paths”. The intersection between a path space P and the “space of all possible paths” is of course P.
The introduction of non-temporal information into prediction (apical dendrites, etc.) simply shifts the model to including other information into these paths. In this case, the temporal context may not consist purely of path spaces, but also perhaps “action spaces”, or “column voting spaces”.
For example, a minicolumn with the majority of its distal connections coming from motor commands rather than local connections would base its predictions on motor spaces; the predicted neurons are those whose “motor paths” overlap with recent motor activity.
GCM Dimensionality and SDRs
Suppose we want a spatial pooler to map out receptive fields across an N-dimensional manifold. The total number of landmarks to exhaustively map the space scales with the volume of the manifold (adjusted for density). As volume scales exponentially with the number of dimensions, this very quickly demands far more RFs than we actually have!
Given an input space of say, 1000 bits, an input that traces out the full 1000-bit space with an even distribution, and an RF for every point in that space, we’d expect HTM (in theory) to eventually evenly distribute all these points, making a fully saturated map of the space, with 1 bit in the output SDR for each of the 2^1000 possible inputs. (This is of course just a theoretical statement about the general approach to distribution, and I am of course ignoring sparsity here.)
However, SDRs have some interesting properties that come into play here. Any given landmark has a corresponding coordinate in the space of input SDRs (the center of its RF). Nearby landmarks will have similar SDRs, or else they wouldn’t be nearby (nearby meaning they are similar enough that an input point near one will often activate both).
Suppose there is a collection of grid cells forming a 1D basis vector. We can expect the neurons along this basis vector to have significant overlap with the points in the manifold that project to them. In this case, we can expect that, lacking sufficient neurons to properly fill the manifold, a more achievable local minima for the spatial pooler would be to fill points along just the basis vectors. The grid cells along these projections would likely find themselves centered on an SDR at roughly the average of all the points that project to them.
Of course, exactly how many RFs are available to perform this mapping depends on many RFs are available to the macrocolumn, and how many are being used to map other portions of the input subspace. If there is just one manifold to map, the macrocolumn may have hundreds of RFs to throw at the map, and perhaps may even be able to form a 3 or 4 dimensional grid cell module if necessary (this would likely be somewhat rare). If the manifold has a larger number of dimensions, a more achievable local minima would be to create grid cell modules of slices or projections of the manifold. A 7-dimensional manifold, for example, may be more easily modelled with a combination of two 2D GCMs and a 3D GCM.
The input subspace may also contain several disconnected manifolds, which would need to be modelled separately. For example, suppose we had two manifolds, A and B, and the input spent 70% of its time in A, and 30% in B. We’d expect to see HTM model A with ~70% of its available RFs, and B with ~30%. We’d likely expect to see a larger number of 2D GCMs in A than in B, assuming that A and B have a similar number of dimensions. This may not be the case however, as A could perhaps be very high-dimensional while B is very low-dimensional (2 or 3), meaning that while A has more than twice as many landmarks available for mapping, it would also have an exponentially larger space to map.
It’s also worth noting that higher-dimensional GCMs may not be regular. Manifolds with more than 1 dimension may frequently contain holes that cannot be mapped (for example, while a collection of GCMs may be capable of representing a location inside a wall, you may find yourself physically unable to occupy that space). In this case, higher-dimensional GCMs will likely not allocate RFs to such locations. High-dimensional GCMs may, as a result, look less like grids, and more like a collection of landmarks.
Place Cells and Phase Transitions
While a high-dimensional manifold may not be able to be fully mapped without resorting to lower-dimensional projections, there may be regions of the manifold that are visited frequently enough that the spatial pooler allocates an entire RF to it. In this case, the resulting RF isn’t found along a basis vector or projection, but rather at a specific location. Considering this to be a grid cell may not make much sense, and it makes much more sense to refer to it as a place cell.
While mapping a particular space, the spatial pooler may find itself with a spare RF to reallocate. Perhaps it decides to boost an RF serving as a landmark in a region of the input subspace that hasn’t been visited lately. Slowly, such RFs would be reallocated to mark common locations inside other manifolds. A 2D manifold represented by a pair of 1D GCMs may slowly be filled with place cells if the input point begins to explore it more heavily.
Eventually, as the pair of 1D GCMs transitions into a 2D GCM, the original 1D GCMs would eventually become active less frequently, as the new 2D grid cells would be closer to the input point. Eventually, boosting would cause these now redundant 1D GCMs to be repurposed toward mapping other parts of the input space.
This process can be thought of as a phase transition from one dimensionality of GCM to another.
Phase transitions may also work in the opposite direction, reducing dimensionality of GCMs. If a manifold mapped by a 2D GCM is less explored, boosting would repurpose its RFs to map out other portions of the input space. Eventually, as the 2D GCM would begin thinning out, the remaining RFs would find themselves moving to a different local minima. They would be expected to move toward the 1D basis vectors, being the new local minima.
Sparsity, Unions, and Manifolds
When inputs become more ambiguous, SDRs become less sparse, and begin modelling a variety of different potential objects as a union. Properties of SDRs allow for these different objects to still be differentiated. However, each object in the union can also be seen as a union of yet simpler objects. Even in cases where an SDR might not be thought of as being in a union state, it can’t really be said for certain that it is a single object.
Arguably, in much the same way that the context component of an SDR can be thought of as a path space, the spatial component of an SDR (which RFs are active) can be thought of not as a specific object, but rather as a space of possible objects (the space of all possible input SDRs similar enough to each other to generate this specific output SDR).
However, in much the same way that different path spaces intersect, these object spaces intersect too. A given active RF can be thought of as the representing the space of all possible objects close enough to activate it. Multiple active RFs together represent the space of possible objects that would activate all of them at once. As more RFs are added, the intersection shrinks exponentially. If the sparsity of the input SDR is limited, the sparsity of the output SDR will be limited as well, as adding too many neurons would quickly shrink this space down to nothing.
More activity in the input creates some complex behavior. Because a k-nearest-neighbors selection breaks down as an accurate model of RF selection during unions, the interpretation of intersecting object spaces changes (this applies to path spaces too).
In this case, rather than the object space being the intersection of the object space of each contributing RF, it is instead the space that lies at the intersection of at least k contributing object spaces. If the input SDR is a union of 2 SDRs, we can expect 2k contributing object spaces, and roughly 2 disjoint spaces. Sometimes these spaces will overlap. This starts to become a significant factor at the same time that the SDRs in the union begin overlapping too much to be differentiated.
As more objects are added to the union, the object space begins to expand. While this allows one to represent more points in parallel, this also increases error rates. If too many objects are added to the union, the object space becomes saturated and nearly the entire space lies within it.
However, we can also go in the opposite direction as well. Individual objects can be thought of as a union of simpler object spaces. They are close enough to be considered more or less the same object (roughly k active RFs, perhaps somewhat close together), but aside from that, they behave similarly.
In this case, an object is a collection of its features, and its features can change independently of each other. A feature may lie at the intersection of several projected GCMs. As the input changes, the activity of the projected GCMs changes. Temporal memory would be able to predict each independently, based on context and overlapping context spaces.
However in other cases, features may interact. Applying temporal memory to these manifolds can be thought of as turning them from simple coordinate spaces to attractors, or as projections or holograms of a phase space. While in some cases, the previous state of just that one GCM may be sufficient for accurate predictions within it, it may also be just a projection of a larger manifold, and thus be closely connected to several other projections. In this case, you’d expect to see different projections/basis vectors of a single manifold strongly connected to each other in their temporal connections.
If you’re tracking a moving ball, its x coordinate at the next time step won’t just depend on its previous x coordinate; it’ll also depend on its previous y and z coordinates, as well as perhaps other factors such as velocity vectors and other objects in the environment it could collide with.
Navigation and Landmarks
Now for something completely different. A higher-level perspective.
Think about how you navigate a familiar space. The most important objectives in navigating a space is, of course, to figure out your location, and determine which direction to travel in.
The way this typically operates in humans is to base your location on the relative locations of familiar landmarks. If you sit at your desk, you may see your computer, your coffee mug, and the surrounding items around your desk. If you turn your head, you may see the window, and perhaps a very recognizable building down the street.
Being a location you find yourself in often, you can expect your brain to have place cells dedicated specifically to your office and common points of interest nearby - your computer, your desk, the window, the building down the street that can be seen through the window.
If you suddenly found yourself transported from home to your office, you’d be able to quickly determine from your surroundings where exactly you are. Presumably, there should be some pretty strong associations between the place cells for your office and cells representing these familiar experiences.
It’s likely that the way that humans perform path integration is different from how computers perform it. While they may find optimal paths, computers will often use pathfinding algorithms that do so through what is close to a brute-force search. It’s clear that this isn’t the approach that humans take; we don’t find ourselves thinking through every possible path (far from it), and often prioritize paths that are familiar or easy to think of, rather than paths that are absolutely optimal.
Path Integration
Think about how you make it to your office. You don’t plan out every possible step; you remember familiar landmarks, and remember which direction to take from each landmark to get to the next. The circuitry required for this doesn’t demand any expensive search algorithms; a sequence of landmarks and associated actions/directions to take from one to get to the next.
If you find yourself asking for directions to an unfamiliar location, you don’t ask for a precise step-wise path. You ask for a sequence of recognizable landmarks, and the direction to take to get from one to the next.
Place cells, marking landmarks, can be expected to be fairly robust. The place cells associated with your office don’t exist in isolation; they’re strongly associated with nearby items. If you find yourself in your office but find your desk missing, you’ll still know exactly where you are when you look out the window and see the same building down the street.
It can be expected that more abstract spaces can be mapped in similar ways. If you are trying to solve a problem, you have a set of familiar sub-problems you expect to face. In these cases, place cells may not be locations in space, but rather specific situations and problems. In much the same way that you map out your commute based on familiar landmarks and where you expect familiar actions to take you from each one (essentially breaking path integration into a collection of smaller, simpler path integration problems), you solve general problems by breaking large problems into smaller problems that you are more familiar with.
When you come across a common problem, you can quickly recognize a collection of familiar properties it has, and have a library of known actions and strategies for breaking it up into smaller pieces, and how you expect each problem to affect it. You can run through common sequences of events to imagine how each strategy will break it up, and look for familiar consequences that suggest ways to break it down further.
Wow, that’s a bit dense. I’ll try to add some visual aids if I can find the time. Otherwise, a wall of text is at least better than nothing.
I’m open for discussions on this. I hope it leads research into a useful direction.
Edit:
I figure a simple and less technical explanation (TL;DR?) might be worth putting here to more clearly explain the point of this thread.
Numenta has been trying to understand lately on how the cortex models abstract concepts; we can tell it clearly uses grid cells to model locations in physical spaces, but what about abstract spaces like language and math? When it does use these grid cell modules, does it use 1D GCMs? 2D GCMs? If it uses 1D GCMs, why does it appear that it uses 2D GCMs to model physical locations?
This thread is my attempt to explain how a little exploration into how the spatial pooler and the math of SDRs interact leads to an answer to this. It’s a complex answer, and suggests that the brain can build arbitrary N-D GCMs, but only if it has enough receptive fields (the number of RFs needed for a detailed, “optimal map” would scale with the volume of the space, and thus exponentially with the number of dimensions).
If it doesn’t, it opts for a combination of place cells and lower-dimensional projections. If the dimensionality of the projections gets low enough, they start to look like 1D and 2D GCMs. The mechanisms of the spatial pooler (really hebbian learning in general, since this kind of applies to the temporal pooler too) should even create phase transitions between dimensionality of GCMs as the network explores certain parts of the input space.
I also made a point that if you look at how humans navigate these spaces, and the role that grid cells and place cells seem to play in this, the strategies humans use for navigating physical spaces seem to match well with strategies used for abstract problem solving (navigating abstract spaces). In fact, it seems to me like the mechanisms to perform navigation of complex abstract spaces would actually be fairly simple based on these observations (mostly just a matter of replaying learned sequences, with a little reinforcement learning to find good sequences), though that’s a conversation for a different time.