How does the distribution of input bits across input values affect the result of spatial pooling?

Imagine I have a sequence like ABCDEF (or any arbitrarily complex sequence and number of input bits), and another sequence which is ABCDEX (or any sequence with the only difference being the last value) - if I feed these values into an HTM, adding one single input bit to represent whether the sequence is of the first or second type, is it impossible to predict the final letter?

My thinking is that each individual input neuron can contribute at most one single vote for the presence of a pattern in the input, and at minimum 0. Does that imply that depending on how many neurons are used to represent specific inputs, some patterns are just unreachable or unpredictable? or even that some will be oversampled and more weight will be given to them in voting for a specific pattern? Because no matter what that 1 bit is, it will be overshadowed by the combined votes for all of the other elements in the sequence.

Just wanted to clear up my understanding, thanks in advance.

1 Like

HTM theory, and presumably biology, avoids encoding two different things - X and F in your case - with a single bit of difference.

I assume you are not familiar with the SDR concept and various encodings. https://www.numenta.com/assets/pdf/biological-and-machine-intelligence/BaMI-SDR.pdf

PS and in HTM you can’t feed the sequence as a whole, but one input as a time. Not that you can’t assume some feeding it a whole bunch of past “tokens” together as in transformers, it just isn’t what happens at HTM level.

Beware also that each token in transformers is quite a long fixed length vectors not some single bit. Making for a very wide input.

2 Likes

I am vaguely familiar, I understand that they are meant to encode semantic representations which vary together, but I couldn’t find anything on this specific case and I wanted to check my understanding. Because of the winner takes all algorithm, does that also imply that the relative size of different patterns in the input space can unevenly affect how they are detected? How is that not a major concern with using the algorithm, are all the values often just perfectly balanced in practice?

1 Like

Ah, actually I don’t think I was clear about my problem. I’m not saying that X and F are encoded with one bit of difference, I’m saying that given the sequence ABCDE (one by one of course) and one bit determining whether the string is of type 1 or type 2, its impossible to predict the last letter, given the other letters

1 Like

Each input letter gets assigned a SDR - which is a vector of sparse bits eg. 1000 bits long.
HTM sees one vector at a time first
“A”, then
“B”

“E”

Now if it expects that “F” follows will light up all bits representing “F”, if it expects “X” it lights up a different SDR with (almost) entirely different bits, and if it predicts both, lights for both X and F
Which is a bit confusing but it never becomes a matter of one single bit decision

PS
“Recognizing a sequence” only means it predicts the next letter (or SDR)
It doesn’t make any cognitive jump like “ABCDEX” means apples and “ABCDEF” means bicycle.

2 Likes

Yep, I’m familiar with SDRs. My question is, can it ever unambiguously predict X or F? Technically it has all the information to make an unambiguous prediction with one single bit, but because the input values (in this case, the letter encoding, and the string-type encoding) are scaled based on their relative size in the input SDR, the one bit of information will never be enough to overcome the voting of the other bits. And because all HTM connections are either 0 or 1, it can’t suppress invalid connections from the other bits which predict X and F simultaneously. Also, to be clear, when I say predict the next letter, I’m making an implicit assumption that the spatial pooling bits can be mapped back to the input space to get a letter back out, instead of predicting some abstract representation of patterns which describe the letter, where multiple simultaneous predictions are possible.

And I understand that part about recognising sequences, are you confused about what I’m implying? I’m not making any claims about how those sequences map to real things, just that mathematically, the input size heavily skews the identification of spatial patterns towards parts of the input space which take up more bits. Could you address that point in particular, as that seems to be what I’m stuck on, or perhaps point out where I’m misunderstanding the implementation

SDRs are quite noisy so it can’t depend on a single bit of difference to distinguish between the two. What it can pick up however is differences in deeper past contexts - which means what else was there before A that was followed by X after E and what else was preceding A last times when F followed E.

1 Like

No. At best, it predicts both.
What follow on presentation is either recognition -or- bursting if it is a novel SDR.

1 Like

Spatial pooling makes no learning across time. So it isn’t an actual part of sequence prediction. It is more like a pre processor that transforms a not-so-nice input into a nice SDR - with fixed length and fixed sparsity.
TM picks up the SDRs generated by SP and makes time predictions.

What you seem to suggest is you consider the SP should encode the whole sequence “ABCDEX” in a single SDR? and “ABCDEF” another SDR? and because there-s a single letter difference between them, there-s a chance the SDRs encoded by SP in these two cases will have only one bit difference?
That would be a subpar performance, it should do much better than that.

1 Like

And mapping it back assumption… well… maybe approximately.
Actually there is no such decoder there, you’ll have to make it. And there is an approximation expectation - reconstruction will be something that resembles (== has a pretty high overlap) with the original.

1 Like

I think it’s interference between signals you’re concerned with.
That’s exactly what SDRs are for.

They are sparse and distributed.
Their sparse nature by itself makes it so that unrelated signals overlap unlikely.
Plus their distributed nature goes one step further by rendering few coincidental overlaps insignificant.

I feel like your question is more related to TM(how prediction works) rather than SP(how the input is encoded without the context).
And the properties mentioned enable even a single SDR to reliably represent multiple possibilities(contexts), implemented by the union operation.
So for most cases, ambiguity is fine and the TM predicts all the possible inputs.

Similarly, these properties of SDR not only improve the robustness on the “working memory” but also on the long-term memory residing in the synapses as well.

1 Like

No, you’ve misunderstood me. I mean turning each letter “A” “B” etc, into a spatial pooled output [0101001…] then using that to predict the next spatial pooling with temporal memory [1010111…] then turning that back into a letter in reverse. I know its not common but I mean that sort of prediction, I know that those systems are independent. I want to ignore the one bit difference for now. More generally, is it true that larger input features (subsets of the full input SDR, like time, or some other feature of the input) dominate pattern recognition proportionally to how large they are?

For example, if I had some SDR of combined auditory and visual input, and they contradicted each other, the larger of the two features in terms of bits would most likely “overpower” the spatial pooling of the other, because they represent more votes for that pattern, and WTA means that patterns which don’t pass a threshold will not be activated.

1 Like

Going from SDR to back original input is not easy.

Even if you can, the prediction will be ALL the possible learned inputs. So that will be both F and X, and depending on what you have using as training sets, even all sequences that had ???CDE?, ???DE?, and ???E? as priors.

All HTM can “predict” is that it has seen that SDR transition before. Other connections may have been formed in prior training set that give it more combinations that are “not surprising.”

Cortical IO does some of what you are asking for by forming a SOM map outside of HTM to give “neighborhoods” of similarity. That is NOT a 1:1 match, just a similarity based on distance between hits.

2 Likes

Oh, I think I misunderstood your question.
I guess it was more about how you decode predictions from TM?
I can see how naively collapsing the TM predictive cells into SP columns then simply back-projecting them to the input cells might cause unreliable decoding.

If it’s the case of where you have only a finite number of input patterns as it is for characters rather than say, for images where it quickly gets out of hand, I think I’ve got an idea.
I don’t know if something like this has been tried before but…

(Assuming TM works without problems as in only few columns burst when a “correct” input is introduced,)
You enumerate every input pattern and feed it to HTM.
Then you check how many columns burst.
If only a handful of columns burst, you add the input pattern to the list of possible future inputs.
Revert the system back to the previous state and repeat.

If you feel like it would be too costly to iterate through all inputs, you could use the naive collapse-and-back-project method in advance to rule out the obvious non-contenders.
Maybe some modifications would allow this trick to work on images? I don’t know. :man_shrugging:

1 Like

If you just want to get predictions out, the classifier can do that. There are videos about it if you search numenta classifier on youtube. It recall it doesn’t actually use the predictive states, instead something like associating the sequence info represented by the TM with the next raw element in the sequence. I could be remembering wrong though.

1 Like

For example, if I had some SDR of combined auditory and visual input, and they contradicted each other, the larger of the two features in terms of bits would most likely “overpower” the spatial pooling of the other, because they represent more votes for that pattern, and WTA means that patterns which don’t pass a threshold will not be activated.

Could someone please answer my main question? Ignoring the whole reconstruction thing for now, is it true that if an input pattern is small enough, it just won’t be able to “win” enough votes to pass the winner takes all filter?

1 Like

It seems like the kind of thing you’d need to test out. If there are independent sub-patterns, some spatial pooler cells (with the right random connectivity) might end up learning the patterns with fewer bits. Those cells might end up having a high gain value, to force them to fire often enough.

2 Likes