The “algorithm” is to find prediction energy… minima. In that sense it is the same as transformers/LLMs. The prediction energy minima LLMs seek, are those on energy surfaces formed by collections of elements, notably back along the sequence, with “attention”. They may also seek energy minima on energy surfaces formed by collections of elements representing the current state (I need to catch up with the balance there, RNN/LSTMs would be ALL the current state, and just a memory state for the prior sequence. I don’t know how much “attention is all you need” threw out collecting elements across the current state. Maybe you know.) These energy surfaces will represent different collections of elements in the current state (and/or “attention” state) and the minimum energy will be for the collection which best predicts the next state. That’s how they learn to predict. They calculate an energy surface for each collection and follow the slope of that energy surface, adjusting the collection of elements, until they find the minimum prediction energy collection, and that minimum energy collection of elements is their network.
My “algorithm” is the same. It’s seeking minimum energy prediction collections across a network. The main difference is that I don’t think the minimum energy predictive state is static across the whole data set. I think there are different best minimum energy predictive states depending on what you are most interested in. In particular I believe that minimum energy predictive state will vary from prompt to prompt. (I think this is happening in transformers too. But we don’t see it. They hide it from us in their enormous context sensitive “parameter” sets.)
So I want to find the minimum energy predictive state dynamically. And a way to do that should be to set the network oscillating and allow it to find its own minimum energy states, as resonances on the network.
The state-of-the-art would do this too. But they assume the surfaces to be found are static. So it probably never occurs to them to seek them dynamically. A lot of extra work, right! Best to do it once and be done with it!
(Also, I guess this would not have been done in neural network history, firstly because they didn’t know what to train for! For supervised “learning”, first you “supervise” to a set of ideals. Just letting the network find its own energy minima doesn’t make sense because you don’t know what “meaning” is. You only have a bunch of examples you consider “meaningful”. So the only thing you can think of is to tie your system to those. Allowing the network to find its own minima only makes sense if the energy minima goal is more general. This is “unsupervised learning”, and it’s never been obvious what energy to minimize for this. What “objective function” to seek. It just so happens that for language the goal is obvious. It’s just prediction along a sequence. I would argue that this simplicity of language is revealing a deeper truth about the relevant parameter of “meaning” for cognition, that the deeper truth about “meaning” in cognition is also cause-effect prediction. And that deeper truth is the reason LLMs seem to be capturing so much of general cognition. So language helped transformers stumble onto a deeper truth of cognition. But even when looking at language, everybody assumes the “truth” will be static. Nobody imagines the “truth” of minimum energy prediction will be multiple, and flip from one structure to another, chaotically. Hence the current LLM state of the art also only seeks static energy prediction surfaces.)
Perhaps the distinction between training and application can become confusing with this change. I guess LLMs don’t have any intermediate stage where the network only represents the language as observed. From the get go their networks will be formed into collections representing the prediction energy surface of the data. “Training” incorporates incorporates both the addition of data AND the minimum prediction energy search across that data.
In a system which seeks minimum prediction energy collections dynamically, the minimum prediction energy collection search is postponed to run time. “Training” is only the addition of data.
So the sense of “training” and “application” will be different between the two.
Exactly what “training” will require for the dynamic option, I’m not sure. Naively you just add a synapse for each time step of the sequence in the data. But as has been pointed out in this thread, to capture longer distance relationships, we will need to represent each time step with an SDR. The way that’s done now in HTM, my memory is refreshed, there is a “training” just to encode the sequence. Each element of the SDR at each time step, needs to be given a “map” of all the elements of the SDR at the preceding time step.
Such a time step to time step “map” between states may be necessary when defining “states” as dynamic minimum energy prediction collections too. I’m not sure. It’s possible the “map” will also be found dynamically as part of the whole minimum energy prediction collection thing. Even what should be considered a “state”, or even a “time step”, becomes part of the same search. A “time step” in the cognitive sense becomes just one or other partition of spikes in the raster plot. It may be sufficient to just “train” by adding synapses across random subsamples of SDRs for raw sequential sensory observations (which sequences of sensory observations can actually be asynchronous and just as the data arrives from the sense organs. No central clock…)