Chaos/reservoir computing and sequential cognitive models like HTM

After some discussion in the recent “Deeper Look at Transformers” thread, cezar_t suggested a separate thread might be appropriate.

The basic idea I was presenting was that transformers have done very well finding hierarchical sequential (temporal) structure, initially in language. But what I believe they don’t grasp, and a remaining insight to be gained by the application to language, is that the “hierarchical temporal” structure of language appears not to be stable.

This lack of stability might be the cause for the enormous blow out in number of “parameters” found by transformer models. Billions of “parameters” being found.

I would argue this lack of stability is also evident in certain historical results from linguistics. Notably that contradictions language structure which can be “learned” were actually what fractured the entire field of linguistics in the 1950s.

So:

Current state-of-the-art: Transformers seek to find structure in language by clustering sub-sequences (clusters on “attention”.) Such clusters define a kind of “energy surface” on the prediction problem. They’re found by “walking” the energy surface using gradient descent. They work remarkably well. But the number of structural “parameters” blows out into the billions, and appears to have no upper limit.

Hypothesis: The actual energy surface might be dynamic, even (borderline sub-)chaotic. With peri-stable attractor states, but subject to suddenly flipping from one stable attractor state to another, perhaps with contradictory structural “parameters”, depending on context.

The idea is that the perceptual structure the brain creates over language sequences, and modeled currently by transformers as a stable energy surface, might actually be the excitation patterns of some kind of evolved, reservoir computer/echo/liquid state machine.

HTM might be an ideal context to explore this idea, because HTM is not trapped by the historical focus on “learning” procedures in the ANN/Deep Learning, tradition. Such as just the assumptions of stable energy surfaces which can be found by gradient descent. In fact one of the grounding motivations of HTM was exactly an explicit rejection of such learning procedures which have always been the focus of the broader “neural network” research thread, on the basis of biological implausibility.

By contrast, it turns out there is a very biologically plausible interpretation of the alternative dynamical system hypothesis being proposed. That it might easily have evolved from some kind of predictive reservoir/echo/liquid state machine on an early simple nervous system.

This might be easily explored by some simple experiments, some of which have already been implemented on a basic neurosimulator.

Comments, or continuation from the transformer thread, are welcome here.

5 Likes

A naive question about reservoir computing: I guess the modeling-reservoir-device has to posses some kind of resonant (if not exactly entangling) property, to successfully reflect the target problem, then I would feel it very accidental by chance, that we can discover the right device. How is this chance compared to billions of parameters that transformers leverage atm? What principled ways could give better chances / advantages via HTM approach?

2 Likes

| complyue
March 9 |

  • | - |

A naive question about reservoir computing: I guess the modeling-reservoir-device has to posses some kind of resonant (if not exactly entangling) property, to successfully reflect the target problem

I think “some kind of resonant” property, yes.

How this happens in reservoir computing per se, I can’t tell you. I’m not an expert in that area. I think I came across reservoir computing because I was searching for other people working on chaotic system properties underlying language structure. So I was working back from the opposite direction. I was starting from the supposition that language structure was being governed by some kind of chaotic groupings. And I found reservoir computing coming from the opposite direction, using chaotic resonances in liquids to model sequences.

So they will be different.

I think what might be different might not so much be the fact of resonances coding sequences as such. What might be different might be the connectivity of the underlying substrate.

I would say you need a specific connectivity in your underlying network, to get the groupings I imagine for language. A basic echo state machine, I understand, just needs any kind of elastic or fluid medium. As I understand it, it’s like the ripples spreading out from a stone dropped in a pond. The ripples “encode” the stone. You can reproduce the stone, in a way, as an echo of the ripples.

But that is just ripples in a liquid. The substrate is a liquid. Randomly connected. A standard echo state machine will not necessarily develop an underlying network which explicitly connects observed sequences. I’m pretty sure no permanent (Hebbian, fire together wire together?) observed sequence connectivity, gets implemented in echo state machines at all at the moment.

So the important step from the echo state machine point of view might not be developing the resonant property. The important step might be evolving to connect the underlying network substrate on which the resonant echo will resonate in a maximally predictive way. And according to the way language actually seems to structure for maximum predictivity, that connectivity might just be a simple, Hebbian, fire together wire together, for sequences.

Simple, but it wouldn’t happen in a fluid. It would have to happen in a network. And the network would have to be one which evolved to make new connections based on both firing, together, before the element they connect to fires (which is just Hebbian?)

then I would feel it very accidental by chance, that we can discover the right device.

So, chance, yes. The chance to develop the right “device”, of connecting sequential nodes in a network according to how they share prediction of the firing of the next node. But maybe not such a far fetched chance in the context of a biological system echo state machine, which might already be happening over neural connectivity. Even if the neural connectivity developed for other purposes, like simple muscle control, or sensory feedback.

How is this chance compared to billions of parameters that transformers leverage atm?

The chance of a biological echo state machine occurring randomly on a primitive neural connectivity network, evolving to emphasize fire together wire together in this way, I don’t think would directly relate to the billions of transformer “parameters”. The billions of transformer parameters would just be the infinite number of resonant patterns possible in any kind of resonant prediction system, either on a fluid or on a sequence network.

What principled ways could give better chances / advantages via HTM approach?

Well, the HTM approach is already a network. And it’s a sequential network, basically implemented on Hebbian principles, just because those have been observed in the biology. So from the echo state machine point of view, it may already be the right substrate to capture the right kind of resonances, and ones which elude more primitive echo state machines, only implemented on fluids, or randomly connected networks.

All we would need to do would be to implement a resonant mechanism on it.

2 Likes

Imagine trying to implement an echo state machine which is ripples spreading in a pond from a dropped stone.

But doing so by trying to “learn” all the possible ripple patterns which might spread from it!

Imagine how many ripple patterns you would have to try and learn! You would learn billions. But it would never be enough. Especially if some of them could overlap and interfere with each other.

1 Like

I said I would post my AGI-21 presentation which has a little bit on this resonance idea at the end.

I hesitate to do so, because the main focus of the presentation was my paper on an older version of this idea. And I hesitate to confuse what is fundamentally a very simple idea, by putting too much emphasis on what I now see as a very old, and confused, attempt to implement it.

The old version of the idea attempted to abstract the network connectivity into vectors. I then attempted to imitate constantly changing resonant groupings over this network connectivity by recombining the vectors in different ways.

The abstraction into vectors was kind of necessary at the time (very early this, 2000 and before) because the computational power available was rather low. Especially parallel processing was not available at all. This was before even GPU’s liberated ANNs.

It did work. But I only realized slowly that abstracting into vectors actually threw away a lot of necessary connectivity information, and that kept its power pretty much at a GOFAI symbolic abstraction level.

I only slowly developed the direct network implementation idea. Much of that while corresponding on the HTM forum 2014-2016.

And then the final simplification that an implementation might be as simple as resonances on a network, didn’t come until I found the Brazilian paper mentioned in the “Transformer” thread (A Network of Integrate and Fire Neurons for Community Detection in Complex Networks, Marcos G. Quiles, Liang Zhao, Fabricio A. Breve, Roseli A. F. Romero https://www.fabriciobreve.com/artigos/dincon10_69194_published.pdf)

Taking a detour through vectors may confuse. So people following this thread may find it best only to look at the short presentation of the network resonance implementation right at the end. But FWIW, here’s the presentation of that old paper:

Vector Parser - Cognition a compression or expansion of the world? - AGI-21 Contributed Talks

2 Likes

The reservoir is normally not tuned for a specific problem, although that might help.

What also helps (as mentioned in various papers) is if it is almost chaotic. But not entirely. The closer to criticality it becomes more sensitive to farther past inputs but above that it’s output does not stabilize - it becomes sensitive to all past inputs which is not really useful since it won’t reflect meaningful sequencial patterns

1 Like

But the molecular structure/parameters of the liquid substance can vary more than billions of possibilities too, the overlap/interference problems seem never-the-less complex, as macromolecule liquids considered.

How can it be any easier, when used to discover the mechanism underneath language?

Sounds like an arbitrary chaotic system can be used to simulate another chaotic system, with just proper tuning? I don’t get it.

1 Like

I think you are imagining like the resevoir has to capture some properties of the dynamical system we are trying to predict.

I believe thats the wrong way to look at it.

to me it seems resevoir computing is a clever hack to extend locality-sensitive hashing into the time domain.

if similar inputs produce chaotic but still similar oscilations, they can be mapped into any outputs we want, its that same talk we keep repeating about random vectors being nearly orthogonal.

3 Likes

Thanks! I kinda get a little about it, can you please expand, ideally with some example in doing “extend locality-sensitive hashing into the time domain” ?

I’m not sure I understand it correctly, but if some mappable domain with discrete values/options is the demand, there is very cheap and reliable ways in the computer programming practices, e.g. Algebraic Data Type comes very handy:

data State = State0 | State1 | State2;

Almost 0 effort to get.

1 Like

Correct @JarvisGoBrr . My opinion is the same as yours. It is not trying to emulate the chaos of the world. It is trying to predict the world based on grouping events which tend to share behaviour. It essentially posits that events which share some predictions, might share other predictions, and groups them into a class. It is just in this case I’m suggesting the grouping relation might itself generate attractor states of a chaotic system. But it would not be trying to emulate chaos in the world directly. It would only be trying to group things based on shared predictions, which grouping might itself turn out to generate a chaotic system.

Funnily enough, the last time I talked about chaos in this forum (2014-2016 as I say), someone also jumped to the conclusion that the way to do it was to imagine the brain was trying to emulate external chaos directly. This was @fergalbyrne . I don’t know if he still monitors this list. I see his ID tag appears to be in the system. He went off an started a company to implement this direct emulation of external world chaos with a startup… Ogma.com.

It would be amusing if he is still monitoring, and can tell us how that went.

I don’t know enough about reservoir computing to say. Perhaps you are right. Perhaps reservoir computing is just hashing past experience.

That would emphasize the importance of the “amplification” mechanism I suggesting evolved to occur on top of reservoir computing. The “amplification” would be a grouping of events across the time domain. A novel grouping. So not just a hash this time, an active regrouping of events in a way which has the “meaning” of enhancing prediction.

I’m always reminded when I think of this, of the way I believe Google (and now others) implement low light photography. As I understand they enhance low light photos by combining many photos taken rapidly one after the other. So they “stack” a rapid succession of photos across (a small period of) time.

Well, the amplification I am talking about would also be a stack. Though different from the Google low light photography stack, it would be across wide ranges of time, and stacked, not by matching the images, but by matching the contexts events occur in.

This appears to be talking about the expressive power of SDRs. Which is fine. And you’re surely right that the very randomness of chaos has extra expressive effect in the same way that the sparseness of SDRs (if that’s what you mean.)

But anyway, if you are right, and reservoir computing is essentially a hash, the crucial difference between the chaotic groupings I am talking about, and the expressive power of a reservoir computing “hash” would be that grouping on shared contexts could generate completely new groupings. So it would not just be a hash. The resonances I’m talking about would be new structure, parameterized by the fact of sharing context/predictions.

3 Likes

I don’t think there’s a rule that says it has to be discrete, LSH only requires the hashing and similarity property.

imagine there’s a hash function h = H(x) we are trying to implement it needs a few requirements.

  • needs to output random looking vectors.
  • similar inputs should map to similar outputs.

with this we can map any continuous input to any continuous output with enough training and parameters on the decoding matrix.

but if we want to extent it to the time domain we need to modify H() so that it takes its own output from t-1 and it needs an extra property.

  • H() will output a vector similar to its previous output but slightly shifted in a deterministic but chaotic manner. since inputs close in time are similar outputs have to be similar.

A network of spring-coupled oscillators just happen fit the bill perfectly as H() it takes its own position and momentum output from previous timestep and generates a new similar output but mixed chaotically with the input from the current timestep.

edit: lots of typos, writing this from phone.

4 Likes

It’s not trying to find the mechanism under language. It’s saying the mechanism under language IS to make ripple patterns.

This is different from transformers. Transformers suggest the “mechanism” is the patterns. So in the context of transformers we are trying to find patterns. This is suggesting the mechanism generates patterns (and goes on generating new ones forwever.) So we are not trying to find patterns. We are assuming a mechanism to generate patterns, and hopefully demonstrating that it does generate them.

It’s the difference between trying to learn all the ripple patterns, and having the mechanism to generate the ripple patterns.

Or, to go beyond simple reservoir computing, I’m suggesting, the difference between trying to learn all the ripple patterns, and having an enhancement which might have evolved on top of that mechanism to generate ripple patterns, one a bit like Google’s low light photo enhancement “stacking”.

3 Likes

not just that, take the dot product from two random vectors and it will be nearly zero, it works for dense vectors just as well as for sparse ones.

2 Likes

Informative enough! Thanks.

Now I think I get most of it. A computer’s type system (usually conveyed by a programming language) lacks the design device to model similar-ness, it’s always equal or not-equal, resulted from excluded middle of logic (which is “a” math) I guess.

2 Likes

I feel I gained more (but not necessarily correct) understanding of your ideas. But in this case, I would think it has to be some novel predictive-power, most probably diverging from currently perceived chaotic yet physical reality, and the best we can do with it is to find its own utility, rather than predicting subsequences in our present version of the chaotic world?

– I may sound pessimistic, but not really feeling bad, just wondering.

1 Like

I think the power of chaotic systems is that once trained properly, they can generate novel but still plausible instances of its input.

I believe what @robf is saying is more like constructive/destructive interference of ripples generating “groups” but I think there also has to be nonlinearity and damping on the system.

2 Likes

Yes. The ripples analogy is like this: when you throw a stone in a lake initial impact (size, speed, direction, lak shape,etc…) determine a unique pattern of ripples that attenuate in time. If you throw several stones at different points their patterns overlap. By measuring wave states at several random “output” points in the lake, the output pattern tends to be unique to the input sequence of stones. While you can not easily compute initial conditions, similarity between inputs tend to propagate towards their respective outputs thus you can train a simple linear classifier/regressor to identify input temporal patterns

In practice there are many ideas of implementation, from fixed non trainable RNNs to physical analog circuits. Even a damped pendulum was shown to exhibit reservoir properties [2201.13390] Machine Learning Potential of a Single Pendulum

1 Like

This leads me to a deeper / more-philosophical wonder, that “similarity” seems to be some imposed-effect-on-observer per the perception, or some qualia, rather than any innate property of the “inputs”. That’s to say, you “feel” the inputs are “similar” so for them to be, so the mechanism made your “feeling-of-similarity” is what really matters, regardless of any other relationship (however physical/true/significant otherwise) between those inputs.

Then as human feelings converge at large, so do semantics conveyed by utterance (despite of which surface language it’s in), it’s no wonder some unhuman mechanisms can converge similarly?

It may certify the false faith in back-propagation w.r.t. AI today, but I still feel the discovery of such unhuman mechanisms not guaranteed, with HTM or other biologic plausible/inspired attempts.

1 Like

Novel predictive power, yes. This novel recombination mechanism can never have the same predictive power as physical reality. But it can find “objects” in that world. And the ability to resolve the world into different “objects” has a novelty of its own.

I guess this is what we see when we appreciate great art. We see new ways of resolving the world. These new ways in which an artist resolves the world don’t exceed the power of the world to generate its own destiny, but they can be things that the world has not bothered to spell out by explicitly grouping them until the artist, or simply original thinking person, comes along.

There’s nothing pessimistic in being the agent which expresses a new “object” or “truth” just because it was always latent in the world. It’s the fact it was latent in the world, but not expressed before, which makes it great art.

The ripples won’t so much have interference. Because they don’t have to all occur at the same time. It only becomes interference is you try to “learn” them, like a transformer. And I imagine a transformer just hashes them all distinguished by different contexts, so keeps them separate that way.

So I think transformers deal with this “interference” of contradictions by hiding it from us.

That’s why transformers will be working so much better than GOFAI. GOFAI tries to learn a single underlying pattern. The interference will be what washes symbolic representations out, and makes GOFAI ineffective.

Ah, that was a twist of echo state machines I didn’t know. This combination of states to reproduce a sequence.

Anyway, to the extent they reduce to a lookup “hash” that puts more credit on this recombination mechanism I posit evolved to work on top of them. The “Google low light photography” like “stacking” mechanism. In that sense it will only be their dynamical nature which is directly significant to what I am talking about in their operation.

2 Likes