TM’s ability to generalize


#1

I’ll mention that yes the representational capacity of HTM is astronomical due to large, distributed sparse patterns inherent in SDR encodings and local learning rules. Learning one thing doesn’t significantly alter previously learned information because the model adjustments are completely local. This is a fundamentally different approach than globally training a DNN with any gradient-based procedure where adjustments occur over the entire network. Catastrophic forgetting is an obvious consequence of global learning strategies.

My 2 cents from messing around with temporal memory the past year is that the representational capacity in even a modestly sized HTM network (2048 columns, 32 cells per column) is so astronomically large and it’s near impossible to discern between semantically similar temporal sequences in it’s current form. Assuming 2% sparsity of active cells, the number of different possible sets of active cells is 65536 choose 1311 which is an unimaginably large number. Should each configuration represent a distinct context? No way…there’s no need for that much discernment. However, in the current form of TM, each configuration essentially does represent a distinct context with respect to which lateral dendritic segments will be activated leading to predicted cells for the next timestep. There’s a serious lack in TM’s ability to generalize among semantically similar contexts.


Why is HTM Ignored by Google DeepMind?
#2

I thought that semantic similarity is handled by Spatial Pooler (the minicolumns). Same column but different neurons would mean similar contexts. If something is represented by the same exact neurons, it would qualify as the same context not similar. Firing same or overlapping neurons would not be generalization. In the current theory, mini-columns generalize not individual neurons. Either it is not the TM’s job to do that, or we should change our approach as you suggest to incorporate that to TM. It is also important to note that we have not mastered our hypothesized best solution to generalization yet; Temporal Pooling.


#3

I’m not sure if this is correct but I may be wrong. The SP handles semantically similar static input states but not semantically similar temporal contexts. Sets of columns are meant to distinguish static input states, right? Encoders as well as the spatial pooler are static algorithms without any prior information contributing to new column activations (ignoring permanence updates in SP). Same column but different neurons means distinct contexts for the same input state, but not necessarily conveys any knowledge or representation at the level of semantically similar contexts.

Additionally, we have that different within-column active neuron states, each representing a distinct context, can lead to entirely different next timestep predictions in general. Is there any mechanism in TM that ensures semantically similar contexts (sets of active cells in a column) lead to similar predicted cell patterns? Each new context, no matter how similar to other known ones, is treated as an entirely new, distinct context in its cellular-level encoding which leads to bursting situations where they generally shouldn’t happen.

Regardless, the combinatorics is still massive. In my previous example, 32 cells per column, each set of active cells a new context, that means there are (32 choose 1) + (32 choose 2) + … + (32 choose 32) different contexts possible for a single column. Assuming each static input state is composed of 2% of columns, you would multiply that number by 41. This many possible contexts (billions of them for each static input state) would be fine if there was a way to discern between semantically similar contexts but it seems to me that each one is considered completely distinct from the rest. The combinatorics blow up even more when you start to consider chains of activation and arbitrary lengths of context.

I’d be interested in researching how the brain manages to discern temporal similarities and how that might influence new generations of HTM.


#4

I remembered just now something I saw awhile back that is related to what I’m saying. See

For an illustrated related consequence of my observation/argument.


#5

I think I understand what you mean but I do not think temporal similarity is possible without somehow reducing or sampling multiple transitions into a single activation. TM only samples from the previous activation so it cannot create that semantically similar temporal context. It can only represent a transition. I would speculate that Temporal Pooling is what you want.


#6

I definitely agree that temporal semantics will be one of the products of Temporal Pooling. That said, there is actually an interesting case where TM can represent semantically similar temporal contexts. That case being a fuzzy prediction in TM resulting from a novel input with other (non-temporal) semantic similarities to previously encountered inputs.

For example, lets say the sequence learned is:

My dog ate my homework

Then you input:

My cat ate …

At this point (because dog and cat have some spatial overlap), you can get a fuzzy prediction for “my”. This new context for “my” should share temporal semantics with the representation “my after ate after dog after my”.


#7

This assumes that you do all the parsing locally.

I would much prefer a companion grammar parser that cooperatively pushes semantic slots at the word store that also pushes against a script/frames store.

Systems level programming for the win!


#8

Agreed. I am working on something similar for my job. Sentences are actually more useful to deal with in a feature/location type of scenario (where the “location” signal depicts grammatical concepts), rather than as high-order sequences.


#9

Bingo! For the purposes of accurate prediction at least, that would be invaluable. I think it’d replace the need for any backtracking. We know any form of context exists in some hierarchical organization with known subsequences within subsequences within etc…Do you think temporal pooling is achieved within a single region of cortex, or only arises within a hierarchy of regions?


#10

We think temporal pooling happens in a single region of cortex. It does not require hierarchy. See the Columns Paper and the two-layer circuit. The output layer is doing a form of temporal pooling, but not over sequences. But concepts are similar.


#11

I am intensely curious where you think this function is localized.
Does this correspond to a particular neural substrate?
If cortex function is uniform across the dinner napkin why is it localized to a certain region?

Do you have any theories on the sequences question?


#12

There are no papers with details yet, but that’s something that’s coming. The closest public talk I’ve seen is this one:

I’m going to do another HTM Chat with Jeff soon, so I’ll be sure to address this.


#13

I was worried about much the same thing when I first came to the HTM theory.
Then I followed Jeff’s advise that neural network theory should be inspired by the biology so I went back to the brain to see what it was doing with this.

Image a single pyramidal neuron’s dendrites, branching through space around the cell body. Note that as the dendrite branches as it radiates it tends to fill space in an almost constant density forming a circular receptive field.

The cells are packed much closer together than the diameter of this field, resulting in overlapping receptive fields. Note that the scale in this drawing is way off - the cell bodies in columns are spaced somewhere around 0.03 mm, with a receptive field around 0.5 mm. (about 15 columns in any direction having some degree of overlapping fields)

Note the red dot of an incoming axon, bringing some activation from far away. Note also that this impinges on more than one dendrite. If the cells are distributed topographically in relation to the applied semantic content the shared bits are also related in meaning. I do think topology matters and this would be a good example why.

in SDR parlance this results in shared bits in the topographically related SDRs.

0000001110011000000000000000000000000000000000000111000000000000000000000000000

0000001110011000000000000000000000000000000000000000000011100000000000000000000

Other dots of activation would work much the same way…