Chaos/reservoir computing and sequential cognitive models like HTM

Overlap/association is now based on a cluster. A cluster is based on what synchronizes in context. What synchronizes in context is based on how many other shared contexts a set of sub-sequences has.

But overlap/association in contrast to what I remember of HTM, which was the current state, is now based on a cluster, which might extend back along the sequence. So it is “pooled”.

“Noticing the match” can be as simple as resonating with closely connected elements in other areas. Maybe sensory-motor, whatever. “Noticing the match” can also just be suggesting a next element (or pooled clustering of elements), which is the sole current output of transformers.

1 Like

Did I address this? You can get “new” structure as new groupings based on shared contexts. A new “prompt” or submitted/activation driving sequence, might select a new one. In particular it might generate a new “parse”, like the parse in the ((Place (wood blocks)) (under (the wheel))) example, above.

1 Like

Got time to listen to some of that Julia talk you linked to. Interesting. They are training networks to reproduce mouse brain activity plots?

So they want to learn a network connectivity.

For what I want to do we don’t need to do this. We are hypothesizing a network connectivity. So we just read in the connectivity we want. We don’t have to try and learn any connectivity. Our task is much easier.

I do wonder about the networks they learn. If they managed to learn actual mouse brain connectivity, that would be great! Then all they would need to do would be to connect up some legs, hide the cheese, and they would have an AI mouse!

But I doubt they are learning the actual mouse brain connectivity.

I think the networks they learn (using an least squares algorithm?) may reproduce close analogs of the spike plots of recorded mouse brain activity. But I would be surprised if they reproduced the mouse connectivity, or functionality, beyond the coarse external measure of average activity. I suspect they may be modeling some gross parameter of mouse brain activity, without the function.

I think what they get might be like training a network to produce the heat of a computer, but not actually doing any calculations!

1 Like

You may find this informative!

https://cic.ini.usc.edu/

1 Like

Thanks.

Useful in itself, I’m sure. But they are not learning that detail of connectivity by least squares regression on average activity either.

I’m interested in these connectome projects. But the really interesting problem to me is to figure out the functionality of these networks.

With a hint at functionality, the most interesting result for me was a paper from Markram’s work a few years back which talked about constantly forming and disintegrating cliques:

“It is as if the brain reacts to a stimulus by building then razing a tower of multi-dimensional blocks, starting with rods (1D), then planks (2D), then cubes (3D), and then more complex geometries with 4D, 5D, etc. The progression of activity through the brain resembles a multi-dimensional sandcastle that materializes out of the sand and then disintegrates,”

Fuller paper here?

Cliques of Neurons Bound into Cavities Provide a Missing Link between Structure and Function

“we represented the spiking activity during a simulation as a time series of sub-graphs”

Interesting, a “time series of sub-graphs”. I wonder what the activation of my network would look like as a time series of sub-graphs.

I like these dynamically forming and reforming blocks. This to me is more evidence for the dynamism I’m talking about, responsible for the indeterminacy of natural language grammar.

Though Markram’s team also only talked about general topologies. No idea about how the topology of the functionality links to meaning.

Here’s another paper which is quite interesting for a fuller examination of those topologies (static this time?):

Cliques and cavities in the human connectome
Ann E. Sizemore, Chad Giusti, Ari Kahn, Jean M. Vettel, Richard F. Betzel & Danielle S. Bassett
https://www.nature.com/articles/s42005-021-00748-4

I think I’m making progress.


I’d like to start with a minimal implementation of neural connection simulator, just sufficient to drive the oscillation, some followup questions for confirmation/clarification:


Can I use an even more simplified LIF model? Something like:

  • membrane potential assumed linear, rest at 0, and having a positive threshold as an exogenous parameter
  • synapse connection strength assumed linearly scaling
  • allow negative synapse connection strength for inhibition?
  • an “action potential” (i.e. “firing” am I correct?) simply reset membrane potential to 0, with 0 refractory time
  • leaking rate assumed linear, as an exogenous parameter
  • discrete time steps, one step, one unit time elapsed

– exogenous parameter: fixed per simulation run, or at most changable by GUI at runtime

Each word maps 1:1 to a LIF neuron? AddSynapse() increase 1 unit connection strength between such 2 neurons standing for the specified 2 words?

Your example code seems to only create connections between direct neighbor words, I don’t get how the cluster can be formed, or even how a multi-word noun phrase get represented?

Inhibition by tuning parameters or some other means? What is the “intensity” and how to “vary” it?

Great. Yes, it makes lots of sense to just start with whatever minimal implementation is easiest. That’s basically what I did.

Absolutely. It may matter, but I don’t know of anything that would make a difference at this stage.

An “action potential” to my understanding would be the firing potential I guess, yes.

That’s as far as I have got. It is sure to be wrong. I think we will need to represent sequential elements (words, strings, sound…) as an SDR. Representation as an SDR should enable paths to be distinguished and longer distance relationships to be encoded. This should be the equivalent of “attention” in a transformer model.

As a first approximation to show the basic language sequence connectivity oscillates, I think go with single node for each word for now.

Yes. I only implemented connections between direct neighbour words up to now. You are quite right. This is surely far too simple. This would be the need for representation of words (or any sequential element) as a cluster (SDR?)

That said, I don’t think it absolutely precludes identification of any structure. The structure I expect to be revealed by the spike times. In the way of the Brazilian paper fig 2, where clusters are revealed by clear groupings of spike times in a raster plot. Distance relations would only inform the spike time groupings, they are not essential to them. Although the information might be important, to draw the analogy with transformer “attention”.

But we do eventually want to move to multiple node representations for each element of the sequence being modeled. So that distance relations can be coded (in sub-graphs of the SDR.)

But, actually, you know, this might happen automatically. The choice of “words” as node labels, as opposed to “letters” or “phrases”, should be completely arbitrary. Words are just one level of sequential clustering. One level of “pooling” of the sequence. Eventually I would expect the implementation to be sequences of discrete samplings of sound.

Then “words” would just be a clustering of phonemes/letters (which would themselves be clusterings of sequential discrete sound samplings.)

So the way to get a distributed/SDR representation for words, may be to start, as a next level of precision, by forming the initial sequence network from sequences of letters. In fact, I like that. Because letters are much less likely to have meaningful distance relationships… (though not always! Several languages have distance elisions and agreements of sound structure… Arabic, Turkish, are two which come to mind… It blows the mind. I’m trying to deal with that at the moment!) But maybe for English a good way to code words as an SDR would be to code letter sequences in the initial network, and allow the word SDRs to form automatically… Only problem, I’m not sure how much that might increase the parallel processing burden… Basically it would mean increasing the size of the network up from order number of words, to order number of words x 26?? Might mean going from order number of words (English typically 100,000 or so…) node network, to a 2,600,000 node network :frowning: Ouch.

1 Like

Well, no. Obviously not! It would mean going to a 26 node network!

But the processing burden would increase somehow… Does that mean there would be 2.6 million distinct clusterings of that 26 node network? So a “virtual” network imprinted in the 26 node letter network, with 2.6 million virtual nodes representing the sequences of words?

Perhaps it is at the level of the raster plots that things would be insanely complicated. There would be all kinds of different spike time patterns.

Actually, that becomes a little interesting. Because as I mentioned, there’s a certain amount of neuroscience inspired work which focuses on spike times for representation. I mentioned Brainchip, and Simon Thorpe. They use a kind of spike time coding. And that is supposed to be biologically motivated. But I wonder if they link they make between spike time and representational class is sophisticated enough.

It’s interesting though. Our structure might come down to what elements have spike times which are close together too.

1 Like

Here’s a paper by Simon Thorpe. I’ll take another look at the approach and think about how it might relate to complex spike time patterns appearing in our raster plots:

Analyzing time-to-first-spike coding schemes: A theoretical approach

1 Like

Such a network, even fully connected, can have just 26*26*2 = 1,352 synaptic connections. Can it be really this small (w.r.t. parameters)?

Or there each “node” should really be a “mini-column” as in HTM?

How is the 2.6M number calculated? I don’t get it.

Being biologically inspired as we are, the connectome define the pathways and interconnections of the hierarchy.

Lesion studies give important clues on the gross functioning of the areas.

Other imaging techniques (BOLD comes to mind) adds further information about the time course of local and inter-area activation.

These techniques offer a bounding box on the inner workings of the areas; this affords a top-down avenue of investigation.

I think of the serial updates of HTM as roughly corresponding to the ~100 ms alpha wave cycles.
I think of the lateral voting within an area (TBT theory) as the ~10 to 25 ms gamma cycles.

The thalamus provides synchronization waves that allow various areas to fire in close harmony.
This primes the L2/3 connections between maps/areas to coordinate spike timing based learning.

For the details of how this activation spreads from one area to related ares in other maps see the blackboard description in the “three visual streams” paper.
“Anatomically, we hypothesize that the pulvinar nucleus of the thalamus plays the role of a projection screen where the predictions are represented (similar to Mumford’s (1991) blackboard conception). These predictions are generated every 100 msec (10 hz, alpha rhythm), collaboratively by the entire visual neocortex, conveyed to the pulvinar via extensive corticothalamic projections from cortical deep layers.”

1 Like

That was just me thinking each word with a 26 letter code. Also not the case.

The code depth would be more the number of permutations and combinations of n elements. Is that more like n! is it? I forget the formulas. So 26! things to be coded. (~4E+26?) Even permutations might have code value, because the timings could vary. So you could think of the different combinations of elements having order, as well.

I think the code depth is quite big.

But squeezing it down to 26 letters might be a bit bare bones all the same. To code for longer distance connections you need to be able to represent variation in the code according to path at distance greater than 1.

What’s the basis for temporal pooling in HTM these days Mark?

1 Like

I have not seen much activity from Numenta on the H of HTM.

My personal opinion on canon HTM is this:

Spatial pooling is used at a local map/area to create sparsity at that level.

As time progresses the sequential processing at that level will activate some population of columns. As long as that is a learned sequence that same population of columns will be active.

At a higher level, the confluence of two or more streams composes relatively static patterns signifying that some known pattern is happening. As the lower level switches to a new sequence pattern the constellation of active columns evolves. The higher level sees this as the next step in the sequence that it learns.

Lateral connections between areas - as the streams go up the hierarchy, there are cross-connections and “skips” at every level. There would not be much point in having a chain of HTM processing areas without this cross-connections. At this level the spatial pooling acts to sort out this intermingling of inputs. I see this as a big part of the long-sought sensor fusion function.

Sequences - Much of this revolves around what constitutes a sequence. Much digital ink has been spilled in this forum on repeating sequences and how a sequence starts and ends. If you add the biologically plausible modification of habituation and some condition that triggers this “getting tired” (a certain number of transitions would be a good start) you transition to sequences as short phrases.

At a higher level, this collection of phrases is the sequences seen at that level.

I don’t have the numbers in front of me but I have worked out with @CollinsEM the time frames based on alpha rate and the 4 areas between the raw senses and the hub regions of the parietal lobe. Plugging in a few starting assumptions we worked out about 3 minutes as the “here and now” time frame. This actually fits with many common human scale time frames, such as song length.

Naturally, some of the details change with the hex-grid implementation of HTM but these time details remain the same.

1 Like

Quoting Buzsáki again (from a recent talk with Joscha Bach https://www.youtube.com/watch?v=NEf8LnTD0AA) he suggested that we have millions of pre-existing neural sequences which are just acquired by matching signals.
This might be a mechanism for few-shot learning and lowering the repetition requirement.

Practically, it isn’t feasible to pre-gen many (unknown) permutations on spec but perhaps this is more akin to many pre-tuned oscillators/SDRs, allowing imperfect/partial matches.
Such permutation farms could themselves be genetically tuned, I guess, and be highly modal as yet another selector.

This is my mere speculation on his comments - probably better to ask him directly.

3 Likes

You’re equating oscillators to SDRs. Which may be quite reasonable. And no doubt you could pre-generate many.

But why do so, if the selection is the same as the generation?

Interesting talk panel though. Also including Christoph von der Malsburg (who I’ve had something to do with through the Mindfire, Swiss, AI initiative.) I’ll listen to that (2 hours?)

But just to get some ideas out.

I was thinking about going down to a letter representation level as I discussed with @complyue . It might be something to try. The logic would be exactly the same. But thinking first in terms of letters may clarify what we are looking for.

Thinking out loud here, so there may be logical errors. But in the spirit of encouraging early feedback.

If we imagine the letter raster plot in time, then we might imagine coding the input of a sentence (a prompt) like “Place wood blocks” with node spikes for the constituent letters as (letter vs. time of spike):

a       x
b                             x
c          x
d                          x
e             x
f
g
h
i
j
k                                      x
l    x                           x
m
n
o                   x   x           x
p x
q
r
s                                         x
t
u
v
w                x
x
y
z

Where I have token time gaps of three spaces between each letter as it occurs in the phrase in time.

Those are the input, or “prompt” spikes. The actual network would also be coded like this. So you can think of these spikes as having edges between them in sequence. And for the whole network, representing many phrases, each node would have many, many, such edges.

As input, the spikes for the letters making up the words are equally spaced. A bit like this:

P   l   a   c   e   w   o   o   d   b   l   o   c   k   s

The hypothesis is that sub-sequences of letters like “place”, “wood”, and “blocks”, would tend to have more edges at their beginning and end, and thus under oscillation would tend to synchronize their constituent letter spikes, so the spike times might be pushed further together, like this:

  P  l  a  c  e      w  o  o  d       b  l  o  c  k  s

Because there would be more edges to the rest of the network at the ends of words, rather than the middle of words.

Thinking about it, as such, this is might be seen to be very much like the way pooling was already done in HTM (and still is? @Bitking , @DanML ?) Which is to say, based on the idea the predictions of the next letter are more diverse at word breaks. Assuming an HTM were coded to represent letters.

The difference might be that we would be identifying diversity of prediction at word breaks using spike synchronization under oscillation. Which would be a kind of “external” network representation for each letter (external in the sense of representing each letter by that letter’s external connectivity to other letters.) Whereas HTM currently does it based on… training? Training of an internal SDR representation for each letter to predict each next letter?

So HTM currently is an internal SDR representation, and trained. Whereas this would be external SDR representation, and dynamic, by synchronization.

If that clarifies the thinking, it may identify some tweaks we need to make. For it to be useful to cluster dynamically, the clustering has to depend on the context. For words dynamic clustering and training will come to the same thing, because words are fixed. The context of the letters in a word is always the same. So HTM training which captures the preceding letters of a word (if this is done) will be adequate, and the two methods achieve the same results. But the dynamic method needs to capture that context of other letters in a word too. It may identify that even at this stage, we need some more nodes for each letter, so that the path through the preceding letters can be captured. So that the network can distinguish that the particular letter breaks at the ends of words, are the ones with the greatest diversity of onward network connections.

If we do that, it may equate to what HTM is doing already. But would place us in the position of having a framework to move forward, and pool “phrases”, which is something HTM currently cannot do, because phrases are not fixed like the letters in a word, and so training, as a way to represent pooling breaks, is inadequate.

1 Like

I kinda get 26! means mapping each 26-letter-long word to a node/neuron, is that so? Given this large number of nodes, the number of connections will explode as the good old situation.

1 Like

Given “too small” the number of nodes, there can’t be “more” edges, but a melted number indicating synapse/connection strength of a single edge. That’s destroying information. Then we come back to the old unsolved question again? How “many” nodes/neurons are “enough”? And obviously biological neurons don’t fully connect, then what’s “the hierarchical principles” to organize them?

This is exactly the puzzle in my mind, instead of HTM training of mini-column structures, what’s such an “external” representation?

1 Like