The coding of longer sequences in HTM SDRs

The issue of coding longer sequences in an SDR has come up in my Chaos and sequential AI models thread with @complyue . That we need to represent letters with SDR, so we can capture code for paths back along a sequence:

I recall some work done by Felix Andrews @floybix on this, and discussed on the predecessor to this forum, back, initially together with me, starting in 2014. As I recall, basically, making sequence predicting links from only a random selection of nodes in an SDR for a state, could distinguish a path, back, before that state, and onwards, so that the path through that state would be distinguishable in subsequent states too.

This should be important, because the immediate task we are looking at is to represent words as sequences of letters. So we want to have the code for a letter like ā€œeā€, in a word like ā€œplaceā€, also be distinct according to the path which arrived at it, through ā€œpā€, ā€œlā€, ā€œaā€, ā€œcā€. So, in a way, the SDR for ā€œeā€ in this context can be distinct, although sharing enough commonality with ā€œeā€ arrived at in paths through other words, that it can still be identified as a state for ā€œeā€.

Firstly, is this something still in HTM? Or if not is there an archive online anywhere for the pre-HTM Formum, NuPIC discussion list where it was discussed?

Iā€™ve been able to find a few links to Felixā€™s work online. It mostly centered around his port of HTM to Clojure, called Comportex. I see there is still a github repository:

Thereā€™s the document on Felixā€™s continuation of our sequence experiments:
http://viewer.gorilla-repl.org/view.html?source=gist&id=95da4401dc7293e02df3&filename=seq-replay.clj

I think this short demo is of the same (in 2016):
HTM higher-order sequence learning - SDRs

Thereā€™s also this presentation by Felix at the 2014 fall hackathon. Though it may just be for the Clojure port generally:
HTM in Clojure [DEMO #6] (2014 Fall NuPIC Hackathon)

I recall Felix was able to demonstrate the recall of arbitrarily long sequences. It would be good to recover some of those old insights. Or perhaps the current theory of HTM has superseded it??

I donā€™t suppose @floybix still monitors this forum!

1 Like

I think that the temporal memory algorithm should be able to do this.

3 Likes

Excellent. Thanks. My memory is coming back. I recall now that patterns of connectivity on dendrites play an important part in locking the state at each time step to its prediction. Each cell in the succeeding state needs a map of the preceding state on its dendrite. That map locks that cell to a prediction, not just by any cell which can predict it, but by only cells of a particular predicting state.

(This came up in some earlier work on an implementation of my whole network generalization thing, where we didnā€™t notice the dendrite ā€œmapā€, and every cell activated not only the cells of a state succeeding its state, but the cells of a state succeeding any state it participated in! So it basically predicted everything! Which quickly lit up the entire network!)

Having been taught the importance of this dendrite ā€œmapā€ mechanism, though, I recall I actually came to the conclusion this might be overkill. If there were a mechanism to pool a succession of states, each state wouldnā€™t need this full map of the preceding state on each of its cells. The pooling method would group the whole sequence together without it.

I wonder if all that is needed is enough distinctness between paths through states. Without the dendrite ā€œmapā€ locking them together.

But from another point of view, this ā€œmapā€ on the dendrites of each cell somewhat IS the pooling method currently implemented in HTM.

Anyway, it drives home the point that connecting successive states with a subset of the SDR for a given state can represent, not only the connection of a state to its preceding state, but also a path to that state at greater distance (the relevant subset in HTM being a subset of the cells in the active columns of a representation.)

Either we will want the full HTM, dendrite ā€œmapā€, version of that subset connectivity. Or we will find that simply connecting successive states using subsets of a given SDR representation, will give us the diversity for our separate pooling mechanism (synchronized oscillations, Iā€™m hypothesizing) to work.

But this is good. It refreshes my memory how this works in HTM.

That isā€¦ Iā€™m understanding this to be saying itā€™s the entire preceding sequence which encodes each successive prediction. I guess that is implicit in the active state. If there were not a preceding sequence, the columns of the current state would be bursting. So the code captures the entire preceding sequence up until the last ā€œburstā€?

There will be a limit to the length of preceding sequence which can be encoded. As stated it reminds me of an RNN. Except it is not trying to generalize each successive state. The point about an RNN is that it is not only recording an entire sequence, but it is trying to generalize over it. To contrast with transformers, this is not selecting which context to pay attention too, or learning combinations of that.

Iā€™m hypothesizing that failure to select or generalize over preceding states to pay attention to, might be fixed in the same way as the need to ā€œlockā€ the successive states together (by providing a ā€œmapā€ of the entire preceding state for each cell of the entire succeeding state.) The same solution might apply to a mechanism to ā€œlockā€ the states together, and to identify prior states to pay ā€œattentionā€ to. They would both be aspects of a separate ā€œpoolingā€ mechanism (oscillations in a network?)

4 Likes

Iā€™m still figuring out the ideas toward computer simulation, but feel like to mention local inhibition at the mini-column scope, which is leveraged by HTM sequence prediction. Are we imposing this part or not? HTM inhibition is by design and less inclined to get tuned by hand, I wonder possible conflicts in later steps when weā€™ll tune inhibition for desirable oscillations.

1 Like

I donā€™t recall why inhibition is done at the mini-column level in HTM.

Is it to train the sequence pattern?

If itā€™s to train the sequence pattern Iā€™m guessing we can skip it.

I think we can use inhibition only globally, to tune the global oscillation.

Iā€™m guessing we can skip training for sequences. Iā€™m thinking training is only necessary for sequences in HTM, because in HTM both time-steps and state SDRs are basically arbitrary (even if state SDRs have some ā€œspatialā€ meaning, in sequence terms they are arbitrary.) That means that in HTM we need some way to tie successive states together, and that is done by ā€œtrainingā€ the SDR of the prior state on the dendrites of each cell in the successor state (correct?)

In our case however the synchronization process is supposed be the thing which ties successive states together. So the ā€œtying togetherā€ (and actually the definition of what a state is, and what a time step isā€¦!) should happen as part of the same synchronization process (though possibly requiring different paths through sub-sets of SDRs for each time a sequence is encountered, to make the internal clustering tighter and ensure synchronization.)

1 Like

As I understand it, the inhibition speaks for a successfully predicated element (predicated then encountered), and further provide more accurate predication for next element in the sequence. An un-expected element once encountered, fires all neurons in a mini-column (i.e. with no inhibition), speaks for maximal uncertainty in prediction of elements to come next, as consuming way too much biological energy in that sense.

I think if without mini-column scoped local inhibition, we can replace a whole mini-column with a single neuron, yet still get a theoretically equivalent model.

2 Likes

Yes, that would change. We wouldnā€™t be ā€œlearningā€ sequences over time steps anymore. We would be expecting the synchronization process to ā€œselectā€ paths through states, as an equivalent for what ā€œtrainingā€ does for HTM currently (AND pull them together in time to select the ā€œstateā€, AND actually, by pulling them together into a ā€œstateā€, select the sense of what it MEANS to have a ā€œtime-stepā€ between states!)

The sense of a column representing a state, and cells within the column representing paths between states can stay.

We still want a sub-set of cells in a state to represent different paths associated with different sequences of states.

I donā€™t know what the exact split would be between columns for states and cells for paths between states. But we would want paths to be represented as some kind of subset of cells for a single state. So the distinction between columns and cells might carry over.

2 Likes

Then we need either some ā€œdesignā€ for it, or some ā€œevolutional-strategyā€, if supervised-learning (like that in HTM) is dropped off.

2 Likes

Iā€™m seeing the ā€œdesignā€ as the synchronization process.

As stated earlier:

If a sequence of letters has multiple paths through the cells of its constituent columns, then that sequence will tend to synchronize under oscillation. Wonā€™t it? That seems an intuitively clear consequence of having more closely clustered internal paths to me.

Why would a sequence of letters with multiple observed paths through multiple cell sub-sets, not tend to result in that sequence synchronizing under any oscillation?

To have a computer simulation, ā€œsynchronizationā€ is better considered a ā€œmeta designā€, i.e. more physical designs (e.g. SDR schema, tunable parameters etc.) have to be derived from it.

1 Like

There are several types of inhibition cells in the cortex.
As to the type pertinent to your question - Chandelier cells.

These cells ā€œlistenā€ exclusively to the boutons (output from the pyramidal cells soma), and they fire really fast. When the chandelier cell fires it inhibits other pyramidal cells in the immediate neighborhood.
The net effect is the cell that is most sure of its input fires first and the nearby cells are stifled. This is the primary mechanism to enforce sparsity.

Numenta does a rough approximation of this with the k-means portion of the spatial pooler. The primary difference between this and k-means is that the inhibitory cell form a clear local spatial exclusion zone that is a key co-factor in hex-grid formation that is missing in the the Numenta implementation.

Other types of Inhibitory cells work to keep the pyramidal cells balanced on a razors edge, ready to fire.

The inhibitory cells are also involved in learning. These may be a component of the ā€œnegative weightā€ function that @roboto asked about in the Negative Weight thread.
https://www.cell.com/neuron/pdf/S0896-6273(11)00879-8.pdf
and
https://www.cell.com/neuron/pdfExtended/S0896-6273(19)30848-7

6 Likes

Can you envisage what I mean by paths through sub-sets of cells between columns of letter representations?

Basically what exists for HTM now in terms of the raw SDR. But not ā€œtrainedā€ on a particular set of cells for each path, only set oscillating by some regular driving sequence, and inhibited by trial and error to the point where the spread of activation oscillates.

So, letā€™s have the SDR ā€œschemaā€ the same. Just not trained. And likely a different set of cells for each time the same sequence is encountered (resulting in strongly reified sequence paths if a sequence is encountered often, possibly coming down to the same thing in the sense of multiple repetitions equating to ā€œtrainingā€.)

What ā€œtunable parametersā€ do you want? Sure, the number of cells, the number of columns/letter, all those sorts of things likely come down to parameters we would need to ā€œtuneā€. Just as your observation that 26 columns is far too few immediately ā€œtunedā€ my expectation to be much higher.

2 Likes

Nice. If we can find cell types to fit the functional distinctions weā€™re making, all the better.

Have at it!

This to me means resonance (in a dynamic network). Is this what you mean?
How will you know when you have achieved ā€˜oscillationā€™?

Assuming that was what you meant, then you need to ā€˜tuneā€™ any network even to achieve oscillation, let alone resonance (when a stable/repeating pattern locks in). This can be achieved in a very large number of ways, and may not always even be possible for any given network - especially if chaotic .

This sounds like an ā€˜input frequencyā€™ parameter, and a ā€˜timeoutā€™ parameter for stabilization. This will also drive the definition of a step (update rate?) or state.

2 Likes

I think so. Iā€™ve been using the words interchangeably. But resonance will be a maximum on any oscillation. So, another tunable parameter, I guess. Presumably you might vary the inhibition until you got a maximum, and that would be the resonance (about a given driving sequence?)

Looks like this:

Spontaneous oscillation from a random spike input to a network formed from a small language sample.

Yes. I thought it would be tricky to get oscillation. I thought some clever feedback connectivity might be necessary. And there may indeed be some tricks remaining to get something meaningful around a specific driving signal from a prompt, such as using subsets of cells to code for sequence. But it turned out that to get a basic network of language sequences oscillating was not hard at all. This is where ā€œsuck it and seeā€ can be better than endlessly overthinking it. Playing around with connectivities suggested a solution where I did not expect it.

I envisage just a sequence of impulses to the words (letters) of a prompt, in sequence.

Perhaps it will need to drive through many cycles to force the network near a given resonance solution, and maybe some inhibition adjustment together with that.

Then the output should be: the way any network resonance about the driving sequence, pushes the spike time of constituent elements together. Potentially a hierarchy of degrees of being ā€œpushed togetherā€.

3 Likes

Thanks. That explains a lot.
This looks like a stepped network rather than dynamic to me (all firing states are synchronous). This is a much simpler environment than I was thinking of.

1 Like

Youā€™re right. Stepped (if that means all states are updated simultaneously?)

Itā€™s done on Charles Simonā€™s Brain Sim II. Yeah, everything updates simultaneously. And, right again, that will likely make a difference. I hadnā€™t thought of that. There could still be differences of spike times, but it would be very coarse.

That update schedule is just an artifact of the simulation. Ideally it would all be done on parallel hardware with genuine asynchronous processing. But we should address that on a simulation, yes. Good point.

This was just a basic proof of concept for oscillation. Word connections are single step. No coding of sequences. Lots of refinements needed.

But the underlying principles remain that:

a) oscillation is easy to achieve for language sequences and,
b) tightly interconnected sequences should synchronize

Itā€™s just a matter of figuring out how to squeeze the signal we want out of that (basically that tightness of interconnectivity = equivalent to sharing predictions = finding structure to maximize predictions.)

3 Likes

Details-wise, Iā€™m stuck at ā€œcolumns of letter representationsā€, lacking a specification enough concrete.


Iā€™m well aware that you are going to tune a lot of things for experimental purpose, but to write a simulation program from scratch, with bare minimum technical debt, I need very concrete specification about what youā€™d like to start with, and further variations in your mind.

1 Like

Itā€™s great that youā€™re working on it. Many eyes on the same problems finds solutions at multiples of speed.

As you may have guessed, Iā€™m also in a position of ā€œlacking a specification enough concreteā€! Iā€™m thinking as I go. Only holding on to the basic insight: that tightness of clustered sequential links should cause synchronization. And at the same time be ā€œmeaningfulā€ as (borderline chaotic?) structure which will maximize prediction.

So any more detailed specification I give you will just be my best next guess too.

But, taking that next best guessā€¦

How would HTM currently code for letters? Letters in isolation. If there exists an example of how letters were coded for HTM in the past, we might take that as an initial guess how we want to code them now.

I donā€™t mean a word representation as was done by cortical.io. I recall those were ā€œspatialā€ representations (not sequential anyway, as far as I know.) As I recall ā€œspatialā€ in the sense of a kind of ā€œmeaningā€ distance, so ā€œdistanceā€/space from (non sequential) co-occurrence in texts (Wikipedia, I recall.)

But maybe someone did an experiment using SDR representations for letters some time in the HTM past. Then we could use that.

If not, we neednā€™t sweat that. Really I think we can try just about any SDR letter representation as a first approximation. Then when it doesnā€™t work, we figure out why, and try something better. Soā€¦ Your initial guess of 2000 columns results in a network likely too big to simulate. So why not 200? 200 columns, each with 100 cells. Assign some well distributed random code from those 200 columns to each letter (as a tweak we might cluster them on similarity of sound value, but I donā€™t think thatā€™s necessary for a start.)

I think that, randomly assigned over 200 columns, code might be enough to start (will 200 columns be enough, do you think?) But as a biological digression your question got me interested in the actual code coming from the cochlea. So Iā€™ve turned up the paper below. Iā€™ll take a look at that. But that may be closer to actual biology than we need to go. I donā€™t know. Iā€™ll take a look at it. Interesting tangent:

Sound Coding in the Auditory Nerve: From Single Fiber Activity to Cochlear Mass Potentials in Gerbils

2 Likes