It seems clear that objects are being stored in cortical layers. In the TM layer(s), these objects are sequences. In the sensorimotor layer(s), these objects are spatial. I have been brainstorming about ways to retrieve these objects. Here are my thoughts about it. Please correct me if I’m wrong.
Object Representation
In Temporal Memory ™, objects are ordered lists of active columns (sequences of spatial patterns).
In Sensorimotor Inference (SMI), objects are sets of active columns (allocentric information).
In either case, distal context is not a part of the object representation, although it plays a crucial role in learning the objects. In the TM, distal connections are what arrange spatial patterns into objects by recognizing their temporal order. In SMI, distal connections restrict object expression by enforcing a motor context for spatial patterns.
Each spatial input could represent many objects, and I would like to identify and extract potential objects given a set of active columns representing one spatial input. This seems like it must be possible, but I’m not sure how to do it.
Open Questions
Would I also need distal context to properly identify potential objects?
Is a pooling layer required to identify an object?
I believe you will need the distal context (i.e. where the features are located on the object) in many cases, but could depend on how similar the objects are or how specific you want your identification to be. The distal context will become important when you need to be able to distinguish objects which have different arrangements and/or counts of the same set of features.
Consider for example some 3D objects like “cube”, “pyramid”, and “octahedron”. All of these contain common features like corners, edges, and sides – all would probably be represented by the same columns. To distinguish between them you need to know either the positions of the features or their counts. You can get both of these elements of information from the distal context. On the other hand, if you just needed to identify that it was a “3D shape” (i.e. a class of objects) or if returning that “it is either a cube, pyramid, or octahedron” is sufficient, then identification by columns alone should work.
I believe you will also need this in most cases, but depends on how unique the features and/or their positions are. The pooling layer will also potentially have a more complete concept of the object (in case every column below hasn’t sensed every feature on the object), so more likely to have an accurate prediction. If looking at only a single input, if that feature+location is common to many objects, you could identify a list of potential objects but not any specific one. Another way to approach this one would be to look at both the active cells and the predictive cells (since predictive cells will be driven in part by the pooling layer) – would also get you to a specific object. At first this could still identify multiple potential objects, but once a few features had been sensed, it would lock into a specific one. Active cells in the pooling layer is still probably a better option, though.
Reading your post again, I think I might have misinterpreted what you were saying. I think you might be saying, “given a set of columns, how can I return a list of potential objects?” If so, then one way to approach this would be with SDR comparisons.
In my 3D objects example, given columns for corners, edges, and sides, one strategy would be to create an SDR with the one bit set for all cells in those columns, and zero bit for all cells in other columns. Then compare that to the SDRs for the various shapes to get an overlap score.
In this case, you would need the distal context to create the input SDRs for the potential objects that you are checking for.
Also, this should work for either the lower layer or the higher pooling layer (since both have representations of the objects). The representations in the pooling layer may be more complete.
I’d like to clarify my understanding just to be precise. As I understood it from your recent videos with Jeff, SMI and TM have the same functionality: sequences of patterns. The difference is L3b TM encodes sequences of features(aka spatial patterns) while L4 SMI is TM which encodes a sequence of feature-locations (spatial patterns with locations).
For your questions I may be able to help brainstorm:
For SMI, a set of specifically active cells(SDR) represents feature inputs with feature-location context. Touching an object for the first time you get a set of active columns saying “hey I feel a bump!”. The active columns burst because you have no prior context of bumps in a certain area. This yields a specific SDR that says “hey, I feel a bump but no prior context!” Certainly from this SDR you should be able to identify a list of potential objects that have this same feature. The list would be quite large for a SMI that’s learned a lot of objects with this feature!
As you move the sensor over the object, you get a sequence of SDRs. In this case you need the distal context based on how TM operates: previous distal context activates current specific neurons in a column. The sequence of SDRs says “I feel a bump, then a divot, then an edge when I love my finger with a specific trajectory”. A much narrower set of potential objects has this quality, but you need a pooling layer to recognize the sequence of SDRs. I think it’s this pooling layer that’s essential for more powerful object recognition.
An interesting feature would be an object classifier akin to the classifier used for sequence memory to predict likelihood of future inputs. This is one feature of NuPIC that I haven’t studied a whole lot yet, but as I understand it does this by holding lookup tables for active cells which contain a running average of inputs at the specified number of time steps in the future. It requires tagging the inputs into a finite set of buckets.
This approach wouldn’t quite work out of the box for object recognition since the inputs don’t happen in any defined order, and the inputs themselves are only features (not objects). One possible modification would be to create buckets for possible objects, and do the tagging during a training phase.
This would build lookup tables for active cells (either the lower layer or pooling layer should work) which reference likelihoods of different objects. Once training is complete, a random object from the set (or a novel object with semantics similar to a trained object) can start being sensed and the lookup tables used to predict what it is.