Correct me if I’m wrong, as there was old temporal pooler that is now the temporal memory, so I may actually be speaking about something else. So here it goes:
As far as I understand, if we take two regions in the hierarchy the bottom region would receive a sequence of inputs. As long as the sequence is correctly predicted the active cells of the bottom region would be pooled over time and become an input to the higher region. In other words if the A->B->C->D sequence is correctly predicted by the bottom layer, the higher region would get a union of all active cells for A, B, C and D. When the prediction breaks ( predicted D->X, but got D->Y), pooling stops. This way we learn A->B->C->D as an object the commonly occurs in the world. As far as I understand higher region can bias the bottom region into perceiving A->B->C->D, when it correctly recognizes the sequence as well by top->bottom feedback. If I haven’t messed up anywhere, and please correct me if I did, then my question is:
What is the neuroscience behind this, considering that bottom and top regions operate on very different time scales? If we say that the ability (temporal memory algorithm) of the bottom region to learn A->B->C->D is very much explained by the spike timing and it all just work beautifully, the upper region would have to perform the same thing but on a scale that is 4 times slower to learn the transition from one object (A->B->C->D) to some other object (that can be another sequence that will be pooled together)
If we take a hierarchy of 4 regions, then the top region should operate on even slower time scale. How does that work, if it works the way I described? Could there perhaps be a limit on how long a sequence is when the input cells are pooled over time?
I am also very curious on the details of how layers and regions form stable representations using TP. You may have already read this but Jeff outlined some ideas that linked to biology.
My guess in regards to the time scale issue is that subsampling will activate the segment that represents the temporal pattern before the pattern completes. A,B might be enough to trigger the whole sequence in the layer/region above, then top-down inference can occur while the pattern is still unfolding below. This can also account for invariances. The subsampling of the temporal union of A,C,D can still activate the representation of A,B,C,D above and infer the full pattern back down - if there isn’t already a representation of A,C,D.
This is just speculation but its makes sense in how unions and subsampling work. Hopefully we’ll come across some more detail soon.
A little writing about neuroscience might soothe my mind today…
We use the term “temporal pooling” to refer to a process whereby a set of neurons stays active even though the input patterns to the set of neurons is changing. Temporal pooling is a many-to-one mapping; many input patterns map to a single output pattern.
This type of process is well documented in neuroscience. For example, as your eyes move the input pattern to the neocortex is constantly changing yet several levels up the hierarchy we find neurons that are stable and specific to the object being viewed. Think of it this way, the ability to classify a set of input patterns as “something” requires temporal pooling, so we know it is occurring wherever classification is occurring.
Using feedback projecting in the opposite direction, a single “higher level” pattern will invoke a set of input patterns. Typically the backward projection does not activate the input neurons but only depolarizes them, to bias them.
In HTM today we don’t model individual spikes, neurons are actively spiking or not. Temporal pooling only requires individual output neurons to recognize multiple input patterns. It is just a many to one mapping. As long as one or more of the input patterns is active then the output neurons will stay active. I don’t see any timing issues with this. Let me know if you are still concerned about timing issues.
We believe temporal pooling is being performed on both high-order sequences such as temporal memory, and on sensory-motor sequences which do not occur in a predictable sequence, but instead depend on what movements you make. As you might know we are currently working on a paper and simulations on how the basic mechanism used in temporal memory can be applied to sensory-motor inference. As part of this work we have started to model, in detail, temporal pooling over sequences of of sensory-motor inputs. The results so far are encouraging in terms of capacity and performance. In short, it seems to work pretty well.
Where in the neocortex does temporal pooling occur? We believe it occurs in several places. At the moment we are modeling L2 pooling patterns in L4. L4 changes with each movement of a sensor, such as a finger or the retina, and L2 temporal pools to create a stable and unique representation of the object being sensed. I also believe L5a is temporal pooling input patterns from L6a. I used to believe temporal pooling occurred between regions in the hierarchy. It still might. But I am now very confident that temporal pooling is occurring several places within each region. Therefore temporal pooling does not HAVE to occur between regions. As long as it is occurring somewhere in the feed forward path up the neocortex we are good.
Jeff, thank you very much for the reply. It’s the first positive thing that happened this morning. It is clearer now and I’m very excited to read the paper on sensorimotor inference!
Do I understand this correctly that the condition for pooling is predictability and perhaps some capacity limit? I.e. a layer is pooling on the inputs for as long as those inputs were correctly predicted. Is there a theoretical limit on how long can the process be? Could the manifestation of the hierarchical nature be partially caused by the theoretical limit of the pooling process and the need to remember very long sequences?
Yes, the condition for temporal is predictability. As long as the inputs are correctly predicted then the pooling layer keeps the same activation state and learns to pool the changing input.
(It gets a bit more complicated when considering multiple columns. L2 cells connect to other L2 cells in nearby columns. We believe these inter-column connections allow columns to vote, or reach a consensus, on what object is being observed. In the multiple column scenario if several columns think they are observing object “A” and another nearby column isn’t sure, the columns that think it is “A” will bias the pooling layer in the unsure column. Feedback from L2 to L4 in the unsure column biases the input to be interpreted as part of “A”. In summary, a column that can’t predict its input on its own may be able to predict its input based on the belief of neighboring columns. E.g. if you look at an object through a straw you might have to move the straw multiple times before recognizing the object, but with a full retina you can recognize it in one glance. Or, if you touch an unknown object with one finger (no vision) you probably can’t tell what the object is without multiple touches, but if you grab the object with multiple fingers you can often tell what it is in one touch.)
Capacity. I was worried about the capacity of temporal pooling. E.g. L2 cells have a finite number of synapses. These synapses have to recognize multiple patterns in L4 and they have recognize multiple patterns in L2 in their own column and in nearby columns. This presents a limit to the number of objects that can be learned, the number of features per object, and the number of columns that can vote together. We have done some analysis and simulations of this. So far it looks like L2 can learn quite a bit under a reasonable set of assumptions. The capacity goes up the sparser the L2 activation is. Empirical observations are that L2 is indeed quite sparse. This is one of the reasons experimentalists have difficulty finding active cells in L2.
Even though temporal pooling capacity is large it still has a limit. This is why a single region such as V1 or S1 can’t possibly recognize complete objects that span the full region. We still need a hierarchy. We don’t yet have a comprehensive theory of hierarchy. That is something I am working.
Thank you! This was very insightful. When you speak of L2 capacity, do you infer the number of different sequences L2 can potentially learn or the length of a single sequence? My concern is in saturation of inputs within one sequence.
As I understand it, pooling represents the union of all active cells within a sequence. Depending on the length of the sequence this has an upper limit of having all L4 cells being a representation of a very long sequence. That would make it impossible to tell one very long sequence from another long sequence. Is there a SP between L2 and the union of all active L4 cells in the given sequence?
Good question. Earlier I said there were two types of sequences, one is a high-order sequence (like a song) and the other caused by movement. The latter type is bounded. If I am touching a pen with a finger there are only so many places on the pen I can touch. The number of unique locations that L4 might see is determined by the coarseness of how location is encoded, but it has an upper bound. That is what we are working on now.
Using temporal pooling for high-order sequences is more problematic. Sequences, such as a song, can be very long. It isn’t possible to form a stable representation via pooling for long sequences. There are several ways this might play out. Is it possible that temporal pooling doesn’t occur for high-order sequences? Does temporal pooling occur over highly repeated sub-sequences? We don’t know the answer to this yet. It is on my list of things we need to work on.
Thanks! That’s the answer I was looking for. I didn’t think in terms of some of the inputs having the upper bound on diversity just by the nature of those inputs. That definitely helps to think differently about the problem of storage in the L2.
My intuition on the high-order sequence is that it should be multi-level / hierarchical. I remember reading that some CNN’s like HMAX were roughly based on the visual pathways, where diversity of the inputs is quite large - so you do need a hierarchy of features to represent complex objects. Makes it much easier on pooling as well. Don’t know how close HMAX was to the visual system in terms of information flow, but intuitively it feels like the visual modality can solve the pooling capacity problem by arranging regions in chains / hierarchies.