Chaos/reservoir computing and sequential cognitive models like HTM

26! (26 “factorial”, 26 x 25 x 24 x 23 x … 3 x 2 x 1) is just the number of permutations and combinations of 26 elements, isn’t it? That’s what I recall. Permutations too, so the order matters. So if you have 26 letters, there’s 26 ways to choose the first, and then 25 ways of choosing the second, and 24 the third … etc.

I think it’s just the raw factorial to the total number of unique elements for permutations and combinations. It becomes a bit more tricky when you exclude permutations. Here’s the formula I always forget for combinations of k objects taken from n:

extstyle {rac {n!}{k!(n-k)!}}

Ripped from Wikipedia:

It’s a bunch, anyway. Obviously 26 letters is what we use to represent most languages now.

As far as “mapping each 26-letter-long word to a node/neuron”… I think the reverse of that. More like mapping each word to a (subset of a) 26 letter (distributed) node! Or maybe I’m just understanding your sense wrong.

It would be easy to code text into a network with letters as the nodes, anyway. 26 letters in the alphabet. So any word, and in theory any sequence of words, can be plotted across a representation consisting of those letters.

It would be just like the graphical “raster plot” I sketched in my last message above.

So the phrase “Place wood blocks” could be plotted as spikes in a raster plot as below (and drawing edges between the spikes would give you a kind of representation of the network, only spread out over time):

a       x
b                             x
c          x
d                          x
e             x
f
g
h
i
j
k                                      x
l    x                           x
m
n
o                   x   x           x
p x
q
r
s                                         x
t
u
v
w                x
x
y
z
1 Like

You’re right. With just letters it will be fully connected almost immediately.

We need some way to represent that the ends of words will have a greater variety of predictive connections.

Perhaps that is just more nodes in the letter representation again.

With more nodes in the letter representation we can have paths over different subsets of those nodes representing the context back along the word.

And we can have a greater variety of those subsets going out to different first letters of subsequent words (or different subsets of those first letter representations, representing the different whole words it is a member of), representing that it is the end of a word, and the next word is less predictable.

2 Likes

I think it’s about “identity” of some concept, what sort of “a selection of items” deserves an identity?

In Functional-Programming, the best/simplest thing is an immutable value. E.g. number 3 is well identified by its own numeric value, all 3s are identical regardless where it from – 1+2 or 7-4 or 39/13. Immutable strings likewise, e.g. concat("wo", "rld") == substr("hello world" , 6). That expands to immutable sets/sequences of immutable values.

And mutable "variable"s (caution: not mathematical concept of variable) are particularly difficult to deal with, semantically, to read/write a variable, you essentially manipulate a box that holds the value of the variable. Such "box"es have their own identities, there needs a “designer” to decide how the "box"es are shared in context/scope, and every change of content inside a “box” affects subsequent reading result from it, very much burdensome to reason about things going to happen, when the "box"es can give different things out at different times.

Each neuron has its own identity, but what’s a word’s identity as encoded within our network? I’m afraid transformer based LLMs don’t explicitly provide identities for concepts (or words), are we doing that or not?

Identity (within the network, i.e. the representation) of each letter in this raster plot, is rather unclear to me now.

1 Like

I’m waiting more input from you, w.r.t. the simulation at this stage, kinda stuck here to make progress of writing computer program code.

1 Like

A word’s “identity” as encoded within this network, is a substring which has a greater variety of connections to possible contexts.

It’s good we’re talking about words, because the identity of a word has a deeper body of practice going back a few years now. You can break a text into words by maximizing the entropy of the string. Which is the same thing as saying you find the bits which have the least predictable relationship with what can come next.

We want to do it dynamically though. We want to find these maximum entropy = minimally constrained, maximally connected, elements.

I think that’s right. You can frame the argument either way: maximum energy or minimal energy.

That’s what “learning” algorithms use too. They also seek these maximum/minimum energy relationships between elements and their surroundings. It is just they assume the forms are static, and can be found by gradient descent on static energy surfaces.

For words, they are right. Words are static. So you can find words by any number of such “learning” algorithms.

I think that while words are static, the structures above words, phrases, do not have static abstractions. So I want to use a dynamic method of identifying a minimal/maximal energy relationship of an element with its surroundings. And seeing which elements synchronize their oscillations in an oscillating network should be a good way to do that.

Other people might do it too if they thought the forms they are seeking were not static. But they assume they are static. That’s the big difference. I do not assume the forms are static. I think they are chaotic. So must be found dynamically.

But the simple answer to your question. The “identity” of word is an element which has a minimal/maximal energy relationship with its surroundings. In this case found by seeing which sub-sequences synchronize their oscillations in relative isolation (maximal relative self connection) to their surroundings.

Our sense of “identity” for concepts will be the same as LLMs. LLMs work by maximizing prediction. It’s the same thing. They identify the way of structuring a text which maximizes prediction. The structural elements they find within that will be these same maximally self connected entities, like words.

The only difference will be that I think these structures (above words) are dynamic, not static.

Otherwise I think everything I’m proposing matches what LLMs are doing. It’s all about maximizing/minimizing prediction within a sequence.

2 Likes

From the computer’s (or programmer’s) perspective, all I get is neurons, synapses/connections, and spikes, I need an algorithm to identify such “elements” for “words”, out of simulated data and phenomena. That’s how the “naming” is done, quite seems.

I don’t think you mean enumerate all possible sub-sequences for it, but we should generate them, right? I need the generative algorithm then, and have no clue how such a generation process can obtain the “maximal” property.

2 Likes

Maybe we should do a call, and I’ll talk you through the diagrams I posted earlier.

For some things text just doesn’t suffice!

If you know how to PM me on this platform, send me a PM and let’s try to increase the bandwidth.

1 Like

Hi I am curious as to why you recommend this book? Jerry (G. Edelman) changed the way I think about things, but I don’t really see the connection. To be fair I only made it through the first part of the book so far. Need to read intermission notes as well as part 2

1 Like

Which Edelman book are you recommending? I looked back in this thread and didn’t see a specific book.

2 Likes

I don’t know who you asked, but basically he has three books on the topic:

  • Neural Darwinism: The Theory of Neuronal Group Selection
  • Topobiology: An Introduction to Molecular Embryology
  • Remembered Present: A Biological Theory of Consciousness

all the following books are just variations on the same subject and trying to appease the non-technical layman.

He also has a great intro to the theory in a book with Mountcastle:

  • Mindful Brain: Cortical Organization and the Group-selective Theory of Higher Brain Function
3 Likes

This book is the basis for my hex-grid post.

1 Like

Edelman’s Neural Darwinism came up in this thread because @DanML responded to my arguments - that a failure to recognize contradictory, even chaotic, structure is what has been holding us back in AI - by mentioning similar themes in György Buzsáki’s work:

I looked up György Buzsáki and said his work reminded me of Gerald Edelman’s Neural Darwinism.

@Bitking then recommended Calvin’s book on another variant of Neural Darwinism, with many of the same themes of expanding, growing, changing, structure, even chaos, that had been the basis of things he had been working on for HTM.

Here’s the post in this thread where @Bitking recommended the book @DrMittlemilk is asking about. It wasn’t initially a recommendation of a book by Edelman, but Calvin:

So basically Neural Darwinism came up in this thread because I am arguing that contradictory structure may be what we are missing in our attempts to code “meaning” in cognition. And that lead to some links to similar ideas, notably growing or contradictory, even chaotic, representation, in Neural Darwinism.

I agree with the themes of growing structure, and especially chaos, in Neural Darwinism. But I disagree with the specific mechanism it proposes for finding new structure.

I’m arguing in this thread that we don’t need to have the random variation followed by evolutionary selection of Neural Darwinism. Rather I think existing “learning” methods, specifically the cause and effect generalization of transformers and Large Language Models, is already telling us the structuring mechanism we need. The only thing which is holding us, and transformers, back, is that we still assume the structure to be found is static and can be learned.

I’m arguing that the blow out in the number of parameters “learned” by transformers is actually indicating that they are generating, growing, contradictory, and even chaotic, structure already. So the only mistake is that we are trying to “learn” all of it at once. Rather we should be focusing on its chaotic generator character, and generating the structure which is appropriate to each new prompt/context, at run time.

And I’m suggesting we can do that, by seeking cause-effect prediction maximizing structure in “prompt” specific resonances in a network of sequences, initially language sequences. This in contrast to transformers, which look for these cause-effect prediction maximizing structures, by trying to follow prediction entropy gradients to static minima, and then only selecting between potentially contradictory structures learned, using a prompt, at run time.

1 Like

Maybe it is ironic (pity me at least), my verbal (both hearing and uttering) English is terribly poor :frowning: , seems we’ll have to testify the limit of text based communication, anyway we are finding ways to tell a machine to do such things effectively, doing it ourselves (i.e. dogfooding) may help somehow.

2 Likes

Well, we could try in my Chinese. Though I would be very surprised if that provided a clearer channel!

To attempt in text… Your first question:

I’m hypothesizing “words” will be collections of letter spikes which tend to synchronize in a raster plot. As sketched in this post:

You can imagine the "x"s in the sketched raster plot above being pushed together, synchronized, in the same way.

So, to answer your question, I’m saying that might be how we could ‘identify such “elements” for “words”, out of simulated data and phenomena.’

Why would they synchronize? You made the good point that with a single node representation the letter network would immediately become completely saturated with links. It’s brought me back to the idea that we will need a distributed representation, an SDR, even for letters. I think an SDR representation for letters can mean that an entire path leading to a letter can be coded as a subset of the whole letter SDR. The same path can even be encoded in multiple such subsets, effectively coding for repetition. And the same variety of subsets can differentiate letter connections generally, so our network is not immediately saturated, and we can differentially find resonances among more tightly connected clusters associated with words.

In this I’m reminded of earlier work coding longer sequences in an SDR explored together with @floybix back in 2014. I’ve posted that in a separate thread. Perhaps Felix’s code can be a basis for our letter SDR representation now:

Your second question:

To find which sub-sequences synchronize their oscillations we don’t need to enumerate them all. We just need to set the network oscillating, perhaps with a specific driving “prompt” activation, and see which elements synchronize. The network performs the search, in parallel, for us, as it is seeking its own minimum energy configuration.

2 Likes

I think this is still the bit that I find the most confusing.
Let me read back a highlevel outline of what I think you are proposing:

  1. Build a SSN (LIF like) network for some input data (phrase/segment)
  2. Tune the net so it is just ‘sub-critical’ (from a feedback perspective?)
  3. Do some WTA/thresholding to decide the most active nodes
  4. Goto 1 with the next segement

Does the input to 1 require enough nodes for the whole vocab (all classes) or is it adaptive somehow?
What happens on the 2nd pass? Are we using a new or modified net for 1? Same weights for 2? What gets fed back from output 3?

Please let me know if this is close to your outline. Or change it to fit.

2 Likes

SSN = Spiking Sequence Network??

A network of sequences of some kind, yes. Maybe good to start with SDRs for letters in text, without any spaces, and try to find “words”.

“Sub-critical”?

So that it oscillates, yes. Sub-critical in the sense of not quite running away activation or inhibition.

WTA?

I don’t think we need to examine the activity of nodes. I think we can just look at clusters of spike times in a raster plot.

Not sure what you mean by “segment”. Segment of text?

The network would be for a large sample of text. A corpus. As big as possible. Ideally the whole language. Then processing would submit phrases or sentences, rather like “prompts” in transformer terms, to be structured. Or to generate a prediction like transformers. So if by “segment” you mean go on to the next “prompt”, then yes.

2 Likes

Thanks for responding. Let me elaborate slightly so we have the same understanding.

My typo - SNN (Spiking Neural Network).

Yes, you are spot on.

Winner Takes All (top 1) or k Winner Takes All (top k).
Who/what is looking at ‘activity’? We need an algorithm here, yes?

Correct. All LLMs use tokens which sometimes equate to words or partial words (but never letters IMHO).

Sorry for the translation errors but perhaps we can look at the main questions again.

Does this represent the totality of the ‘process’? (this is still very high level)
And some of the interplay between steps:

2 Likes

Perhaps we need to distinguish between training and application?
You start training on some elements - this cannot be the whole language. It could be a sentance/paragraph ( and in theory page/book/library but not in practice). What is the training ‘unit’?

2 Likes

The “algorithm” is to find prediction energy… minima. In that sense it is the same as transformers/LLMs. The prediction energy minima LLMs seek, are those on energy surfaces formed by collections of elements, notably back along the sequence, with “attention”. They may also seek energy minima on energy surfaces formed by collections of elements representing the current state (I need to catch up with the balance there, RNN/LSTMs would be ALL the current state, and just a memory state for the prior sequence. I don’t know how much “attention is all you need” threw out collecting elements across the current state. Maybe you know.) These energy surfaces will represent different collections of elements in the current state (and/or “attention” state) and the minimum energy will be for the collection which best predicts the next state. That’s how they learn to predict. They calculate an energy surface for each collection and follow the slope of that energy surface, adjusting the collection of elements, until they find the minimum prediction energy collection, and that minimum energy collection of elements is their network.

My “algorithm” is the same. It’s seeking minimum energy prediction collections across a network. The main difference is that I don’t think the minimum energy predictive state is static across the whole data set. I think there are different best minimum energy predictive states depending on what you are most interested in. In particular I believe that minimum energy predictive state will vary from prompt to prompt. (I think this is happening in transformers too. But we don’t see it. They hide it from us in their enormous context sensitive “parameter” sets.)

So I want to find the minimum energy predictive state dynamically. And a way to do that should be to set the network oscillating and allow it to find its own minimum energy states, as resonances on the network.

The state-of-the-art would do this too. But they assume the surfaces to be found are static. So it probably never occurs to them to seek them dynamically. A lot of extra work, right! Best to do it once and be done with it!

(Also, I guess this would not have been done in neural network history, firstly because they didn’t know what to train for! For supervised “learning”, first you “supervise” to a set of ideals. Just letting the network find its own energy minima doesn’t make sense because you don’t know what “meaning” is. You only have a bunch of examples you consider “meaningful”. So the only thing you can think of is to tie your system to those. Allowing the network to find its own minima only makes sense if the energy minima goal is more general. This is “unsupervised learning”, and it’s never been obvious what energy to minimize for this. What “objective function” to seek. It just so happens that for language the goal is obvious. It’s just prediction along a sequence. I would argue that this simplicity of language is revealing a deeper truth about the relevant parameter of “meaning” for cognition, that the deeper truth about “meaning” in cognition is also cause-effect prediction. And that deeper truth is the reason LLMs seem to be capturing so much of general cognition. So language helped transformers stumble onto a deeper truth of cognition. But even when looking at language, everybody assumes the “truth” will be static. Nobody imagines the “truth” of minimum energy prediction will be multiple, and flip from one structure to another, chaotically. Hence the current LLM state of the art also only seeks static energy prediction surfaces.)

Perhaps the distinction between training and application can become confusing with this change. I guess LLMs don’t have any intermediate stage where the network only represents the language as observed. From the get go their networks will be formed into collections representing the prediction energy surface of the data. “Training” incorporates incorporates both the addition of data AND the minimum prediction energy search across that data.

In a system which seeks minimum prediction energy collections dynamically, the minimum prediction energy collection search is postponed to run time. “Training” is only the addition of data.

So the sense of “training” and “application” will be different between the two.

Exactly what “training” will require for the dynamic option, I’m not sure. Naively you just add a synapse for each time step of the sequence in the data. But as has been pointed out in this thread, to capture longer distance relationships, we will need to represent each time step with an SDR. The way that’s done now in HTM, my memory is refreshed, there is a “training” just to encode the sequence. Each element of the SDR at each time step, needs to be given a “map” of all the elements of the SDR at the preceding time step.

Such a time step to time step “map” between states may be necessary when defining “states” as dynamic minimum energy prediction collections too. I’m not sure. It’s possible the “map” will also be found dynamically as part of the whole minimum energy prediction collection thing. Even what should be considered a “state”, or even a “time step”, becomes part of the same search. A “time step” in the cognitive sense becomes just one or other partition of spikes in the raster plot. It may be sufficient to just “train” by adding synapses across random subsamples of SDRs for raw sequential sensory observations (which sequences of sensory observations can actually be asynchronous and just as the data arrives from the sense organs. No central clock…)

2 Likes

Thanks. I think you’ve outlined the concept well but, without the ability to break it into any smaller chunks, there is no effective way even to do EDA (Experimental Data Analysis), which precludes writing any algorithms/code.

I think there is a dichotomy between using HTM (synchronous queues) and a resonating network capable of ‘relaxation’ into the key states you describe (on multiple time and size scales).

If you want to look at using HTM with SNNs (Spiking Neural Nets) then there are few references and I’m sure other people here could point you at others/better ones.

1 Like