Chaos/reservoir computing and sequential cognitive models like HTM

Quoting Buzsáki again (from a recent talk with Joscha Bach Vectors of Cognitive AI: Self-Organization - YouTube) he suggested that we have millions of pre-existing neural sequences which are just acquired by matching signals.
This might be a mechanism for few-shot learning and lowering the repetition requirement.

Practically, it isn’t feasible to pre-gen many (unknown) permutations on spec but perhaps this is more akin to many pre-tuned oscillators/SDRs, allowing imperfect/partial matches.
Such permutation farms could themselves be genetically tuned, I guess, and be highly modal as yet another selector.

This is my mere speculation on his comments - probably better to ask him directly.

3 Likes

You’re equating oscillators to SDRs. Which may be quite reasonable. And no doubt you could pre-generate many.

But why do so, if the selection is the same as the generation?

Interesting talk panel though. Also including Christoph von der Malsburg (who I’ve had something to do with through the Mindfire, Swiss, AI initiative.) I’ll listen to that (2 hours?)

But just to get some ideas out.

I was thinking about going down to a letter representation level as I discussed with @complyue . It might be something to try. The logic would be exactly the same. But thinking first in terms of letters may clarify what we are looking for.

Thinking out loud here, so there may be logical errors. But in the spirit of encouraging early feedback.

If we imagine the letter raster plot in time, then we might imagine coding the input of a sentence (a prompt) like “Place wood blocks” with node spikes for the constituent letters as (letter vs. time of spike):

a       x
b                             x
c          x
d                          x
e             x
f
g
h
i
j
k                                      x
l    x                           x
m
n
o                   x   x           x
p x
q
r
s                                         x
t
u
v
w                x
x
y
z

Where I have token time gaps of three spaces between each letter as it occurs in the phrase in time.

Those are the input, or “prompt” spikes. The actual network would also be coded like this. So you can think of these spikes as having edges between them in sequence. And for the whole network, representing many phrases, each node would have many, many, such edges.

As input, the spikes for the letters making up the words are equally spaced. A bit like this:

P   l   a   c   e   w   o   o   d   b   l   o   c   k   s

The hypothesis is that sub-sequences of letters like “place”, “wood”, and “blocks”, would tend to have more edges at their beginning and end, and thus under oscillation would tend to synchronize their constituent letter spikes, so the spike times might be pushed further together, like this:

  P  l  a  c  e      w  o  o  d       b  l  o  c  k  s

Because there would be more edges to the rest of the network at the ends of words, rather than the middle of words.

Thinking about it, as such, this is might be seen to be very much like the way pooling was already done in HTM (and still is? @Bitking , @DanML ?) Which is to say, based on the idea the predictions of the next letter are more diverse at word breaks. Assuming an HTM were coded to represent letters.

The difference might be that we would be identifying diversity of prediction at word breaks using spike synchronization under oscillation. Which would be a kind of “external” network representation for each letter (external in the sense of representing each letter by that letter’s external connectivity to other letters.) Whereas HTM currently does it based on… training? Training of an internal SDR representation for each letter to predict each next letter?

So HTM currently is an internal SDR representation, and trained. Whereas this would be external SDR representation, and dynamic, by synchronization.

If that clarifies the thinking, it may identify some tweaks we need to make. For it to be useful to cluster dynamically, the clustering has to depend on the context. For words dynamic clustering and training will come to the same thing, because words are fixed. The context of the letters in a word is always the same. So HTM training which captures the preceding letters of a word (if this is done) will be adequate, and the two methods achieve the same results. But the dynamic method needs to capture that context of other letters in a word too. It may identify that even at this stage, we need some more nodes for each letter, so that the path through the preceding letters can be captured. So that the network can distinguish that the particular letter breaks at the ends of words, are the ones with the greatest diversity of onward network connections.

If we do that, it may equate to what HTM is doing already. But would place us in the position of having a framework to move forward, and pool “phrases”, which is something HTM currently cannot do, because phrases are not fixed like the letters in a word, and so training, as a way to represent pooling breaks, is inadequate.

1 Like

I kinda get 26! means mapping each 26-letter-long word to a node/neuron, is that so? Given this large number of nodes, the number of connections will explode as the good old situation.

1 Like

Given “too small” the number of nodes, there can’t be “more” edges, but a melted number indicating synapse/connection strength of a single edge. That’s destroying information. Then we come back to the old unsolved question again? How “many” nodes/neurons are “enough”? And obviously biological neurons don’t fully connect, then what’s “the hierarchical principles” to organize them?

This is exactly the puzzle in my mind, instead of HTM training of mini-column structures, what’s such an “external” representation?

1 Like

26! (26 “factorial”, 26 x 25 x 24 x 23 x … 3 x 2 x 1) is just the number of permutations and combinations of 26 elements, isn’t it? That’s what I recall. Permutations too, so the order matters. So if you have 26 letters, there’s 26 ways to choose the first, and then 25 ways of choosing the second, and 24 the third … etc.

I think it’s just the raw factorial to the total number of unique elements for permutations and combinations. It becomes a bit more tricky when you exclude permutations. Here’s the formula I always forget for combinations of k objects taken from n:

extstyle {rac {n!}{k!(n-k)!}}

Ripped from Wikipedia:

It’s a bunch, anyway. Obviously 26 letters is what we use to represent most languages now.

As far as “mapping each 26-letter-long word to a node/neuron”… I think the reverse of that. More like mapping each word to a (subset of a) 26 letter (distributed) node! Or maybe I’m just understanding your sense wrong.

It would be easy to code text into a network with letters as the nodes, anyway. 26 letters in the alphabet. So any word, and in theory any sequence of words, can be plotted across a representation consisting of those letters.

It would be just like the graphical “raster plot” I sketched in my last message above.

So the phrase “Place wood blocks” could be plotted as spikes in a raster plot as below (and drawing edges between the spikes would give you a kind of representation of the network, only spread out over time):

a       x
b                             x
c          x
d                          x
e             x
f
g
h
i
j
k                                      x
l    x                           x
m
n
o                   x   x           x
p x
q
r
s                                         x
t
u
v
w                x
x
y
z
1 Like

You’re right. With just letters it will be fully connected almost immediately.

We need some way to represent that the ends of words will have a greater variety of predictive connections.

Perhaps that is just more nodes in the letter representation again.

With more nodes in the letter representation we can have paths over different subsets of those nodes representing the context back along the word.

And we can have a greater variety of those subsets going out to different first letters of subsequent words (or different subsets of those first letter representations, representing the different whole words it is a member of), representing that it is the end of a word, and the next word is less predictable.

2 Likes

I think it’s about “identity” of some concept, what sort of “a selection of items” deserves an identity?

In Functional-Programming, the best/simplest thing is an immutable value. E.g. number 3 is well identified by its own numeric value, all 3s are identical regardless where it from – 1+2 or 7-4 or 39/13. Immutable strings likewise, e.g. concat("wo", "rld") == substr("hello world" , 6). That expands to immutable sets/sequences of immutable values.

And mutable "variable"s (caution: not mathematical concept of variable) are particularly difficult to deal with, semantically, to read/write a variable, you essentially manipulate a box that holds the value of the variable. Such "box"es have their own identities, there needs a “designer” to decide how the "box"es are shared in context/scope, and every change of content inside a “box” affects subsequent reading result from it, very much burdensome to reason about things going to happen, when the "box"es can give different things out at different times.

Each neuron has its own identity, but what’s a word’s identity as encoded within our network? I’m afraid transformer based LLMs don’t explicitly provide identities for concepts (or words), are we doing that or not?

Identity (within the network, i.e. the representation) of each letter in this raster plot, is rather unclear to me now.

1 Like

I’m waiting more input from you, w.r.t. the simulation at this stage, kinda stuck here to make progress of writing computer program code.

1 Like

A word’s “identity” as encoded within this network, is a substring which has a greater variety of connections to possible contexts.

It’s good we’re talking about words, because the identity of a word has a deeper body of practice going back a few years now. You can break a text into words by maximizing the entropy of the string. Which is the same thing as saying you find the bits which have the least predictable relationship with what can come next.

We want to do it dynamically though. We want to find these maximum entropy = minimally constrained, maximally connected, elements.

I think that’s right. You can frame the argument either way: maximum energy or minimal energy.

That’s what “learning” algorithms use too. They also seek these maximum/minimum energy relationships between elements and their surroundings. It is just they assume the forms are static, and can be found by gradient descent on static energy surfaces.

For words, they are right. Words are static. So you can find words by any number of such “learning” algorithms.

I think that while words are static, the structures above words, phrases, do not have static abstractions. So I want to use a dynamic method of identifying a minimal/maximal energy relationship of an element with its surroundings. And seeing which elements synchronize their oscillations in an oscillating network should be a good way to do that.

Other people might do it too if they thought the forms they are seeking were not static. But they assume they are static. That’s the big difference. I do not assume the forms are static. I think they are chaotic. So must be found dynamically.

But the simple answer to your question. The “identity” of word is an element which has a minimal/maximal energy relationship with its surroundings. In this case found by seeing which sub-sequences synchronize their oscillations in relative isolation (maximal relative self connection) to their surroundings.

Our sense of “identity” for concepts will be the same as LLMs. LLMs work by maximizing prediction. It’s the same thing. They identify the way of structuring a text which maximizes prediction. The structural elements they find within that will be these same maximally self connected entities, like words.

The only difference will be that I think these structures (above words) are dynamic, not static.

Otherwise I think everything I’m proposing matches what LLMs are doing. It’s all about maximizing/minimizing prediction within a sequence.

2 Likes

From the computer’s (or programmer’s) perspective, all I get is neurons, synapses/connections, and spikes, I need an algorithm to identify such “elements” for “words”, out of simulated data and phenomena. That’s how the “naming” is done, quite seems.

I don’t think you mean enumerate all possible sub-sequences for it, but we should generate them, right? I need the generative algorithm then, and have no clue how such a generation process can obtain the “maximal” property.

2 Likes

Maybe we should do a call, and I’ll talk you through the diagrams I posted earlier.

For some things text just doesn’t suffice!

If you know how to PM me on this platform, send me a PM and let’s try to increase the bandwidth.

1 Like

Hi I am curious as to why you recommend this book? Jerry (G. Edelman) changed the way I think about things, but I don’t really see the connection. To be fair I only made it through the first part of the book so far. Need to read intermission notes as well as part 2

1 Like

Which Edelman book are you recommending? I looked back in this thread and didn’t see a specific book.

2 Likes

I don’t know who you asked, but basically he has three books on the topic:

  • Neural Darwinism: The Theory of Neuronal Group Selection
  • Topobiology: An Introduction to Molecular Embryology
  • Remembered Present: A Biological Theory of Consciousness

all the following books are just variations on the same subject and trying to appease the non-technical layman.

He also has a great intro to the theory in a book with Mountcastle:

  • Mindful Brain: Cortical Organization and the Group-selective Theory of Higher Brain Function
3 Likes

This book is the basis for my hex-grid post.

1 Like

Edelman’s Neural Darwinism came up in this thread because @DanML responded to my arguments - that a failure to recognize contradictory, even chaotic, structure is what has been holding us back in AI - by mentioning similar themes in György Buzsáki’s work:

I looked up György Buzsáki and said his work reminded me of Gerald Edelman’s Neural Darwinism.

@Bitking then recommended Calvin’s book on another variant of Neural Darwinism, with many of the same themes of expanding, growing, changing, structure, even chaos, that had been the basis of things he had been working on for HTM.

Here’s the post in this thread where @Bitking recommended the book @DrMittlemilk is asking about. It wasn’t initially a recommendation of a book by Edelman, but Calvin:

So basically Neural Darwinism came up in this thread because I am arguing that contradictory structure may be what we are missing in our attempts to code “meaning” in cognition. And that lead to some links to similar ideas, notably growing or contradictory, even chaotic, representation, in Neural Darwinism.

I agree with the themes of growing structure, and especially chaos, in Neural Darwinism. But I disagree with the specific mechanism it proposes for finding new structure.

I’m arguing in this thread that we don’t need to have the random variation followed by evolutionary selection of Neural Darwinism. Rather I think existing “learning” methods, specifically the cause and effect generalization of transformers and Large Language Models, is already telling us the structuring mechanism we need. The only thing which is holding us, and transformers, back, is that we still assume the structure to be found is static and can be learned.

I’m arguing that the blow out in the number of parameters “learned” by transformers is actually indicating that they are generating, growing, contradictory, and even chaotic, structure already. So the only mistake is that we are trying to “learn” all of it at once. Rather we should be focusing on its chaotic generator character, and generating the structure which is appropriate to each new prompt/context, at run time.

And I’m suggesting we can do that, by seeking cause-effect prediction maximizing structure in “prompt” specific resonances in a network of sequences, initially language sequences. This in contrast to transformers, which look for these cause-effect prediction maximizing structures, by trying to follow prediction entropy gradients to static minima, and then only selecting between potentially contradictory structures learned, using a prompt, at run time.

1 Like

Maybe it is ironic (pity me at least), my verbal (both hearing and uttering) English is terribly poor :frowning: , seems we’ll have to testify the limit of text based communication, anyway we are finding ways to tell a machine to do such things effectively, doing it ourselves (i.e. dogfooding) may help somehow.

2 Likes

Well, we could try in my Chinese. Though I would be very surprised if that provided a clearer channel!

To attempt in text… Your first question:

I’m hypothesizing “words” will be collections of letter spikes which tend to synchronize in a raster plot. As sketched in this post:

You can imagine the "x"s in the sketched raster plot above being pushed together, synchronized, in the same way.

So, to answer your question, I’m saying that might be how we could ‘identify such “elements” for “words”, out of simulated data and phenomena.’

Why would they synchronize? You made the good point that with a single node representation the letter network would immediately become completely saturated with links. It’s brought me back to the idea that we will need a distributed representation, an SDR, even for letters. I think an SDR representation for letters can mean that an entire path leading to a letter can be coded as a subset of the whole letter SDR. The same path can even be encoded in multiple such subsets, effectively coding for repetition. And the same variety of subsets can differentiate letter connections generally, so our network is not immediately saturated, and we can differentially find resonances among more tightly connected clusters associated with words.

In this I’m reminded of earlier work coding longer sequences in an SDR explored together with @floybix back in 2014. I’ve posted that in a separate thread. Perhaps Felix’s code can be a basis for our letter SDR representation now:

Your second question:

To find which sub-sequences synchronize their oscillations we don’t need to enumerate them all. We just need to set the network oscillating, perhaps with a specific driving “prompt” activation, and see which elements synchronize. The network performs the search, in parallel, for us, as it is seeking its own minimum energy configuration.

2 Likes

I think this is still the bit that I find the most confusing.
Let me read back a highlevel outline of what I think you are proposing:

  1. Build a SSN (LIF like) network for some input data (phrase/segment)
  2. Tune the net so it is just ‘sub-critical’ (from a feedback perspective?)
  3. Do some WTA/thresholding to decide the most active nodes
  4. Goto 1 with the next segement

Does the input to 1 require enough nodes for the whole vocab (all classes) or is it adaptive somehow?
What happens on the 2nd pass? Are we using a new or modified net for 1? Same weights for 2? What gets fed back from output 3?

Please let me know if this is close to your outline. Or change it to fit.

2 Likes

SSN = Spiking Sequence Network??

A network of sequences of some kind, yes. Maybe good to start with SDRs for letters in text, without any spaces, and try to find “words”.

“Sub-critical”?

So that it oscillates, yes. Sub-critical in the sense of not quite running away activation or inhibition.

WTA?

I don’t think we need to examine the activity of nodes. I think we can just look at clusters of spike times in a raster plot.

Not sure what you mean by “segment”. Segment of text?

The network would be for a large sample of text. A corpus. As big as possible. Ideally the whole language. Then processing would submit phrases or sentences, rather like “prompts” in transformer terms, to be structured. Or to generate a prediction like transformers. So if by “segment” you mean go on to the next “prompt”, then yes.

2 Likes