Can you restate exactly which bits you agree with. Because to me that sounds like you saying, “I agree clusters won’t be stable, but don’t get what you mean by saying they won’t be stable”!!
Was it just the prediction of sequences using clusters bit that you agreed with?
It seems clear to me that because they start from a paradigm of “learning”, transformers, and likely Hopfield nets, will try to find clusters which predict better by following prediction “energy” gradients to try and find minima. That’s great, but once you find such a minima, that is your cluster. It won’t change. You will have trained your network with weights which embody it.
That to me is a “stable” cluster. And seeking a maximally predicting cluster in that way, seems to me to assume the cluster will be stable. Otherwise you wouldn’t try to embody it in network weights.
You might say the network learns many different clusters dependent on context (context = “attention”.) And those many different clusters might be thought of as a form of “instability”. Is that what you meant by “take thousands of words into context”? But however many such clusters you learn, even if they are contradictory (distinguished only by “attention” context?), their number will be fixed after the learning process.
Having a fixed number after learning is just an artifact of a paradigm of learning which seeks to gradually adjust weights, and follow energy gradients to prediction minima.
That “learning” paradigm has the advantage that you don’t have to assume as much about the system to be learned. You can try all sorts of things, and then just follow your prediction energy gradient down to a minima.
So it has the advantage of being dependent on fewer assumptions about what is to be learned.
But it has the disadvantage of being dependent on the idea that weights will gradually converge, and then remain static. You start from the slope, and trace it back. So the slope must be fixed (and “smooth”/“linear”?)
By contrast, I think the prediction energy “slope” flips back and forth discontinuously. That means you can’t trace back along it. (Is the ability to “trace back along it” an assumption of “linearity” in back-prop?)
So how do you find an energy minima predictive state in a network which flips from one maximally predictive cluster to another, discontinuously, as context changes?
Fortunately, the tracing of slopes smoothly down to minima need not be the only way to find clusters which maximally predict in a sequence network. If maximally predicting means having a lot of shared connections, such clusters will also tend to synchronize any oscillations. In any case, synchronized oscillations are another kind of minimized energy state of a network. And in a sequence network, a synchronized cluster will surely be a minimized energy state of sequence.
There was a research that suggests Transformers are Hopfield networks after some slight modifications and if it is allowed to oscillate at each layer of the transformer. Not sure if this is fundamentally different from your approach but there’s the oscillatory component to it. Youtube link: Hopfield Networks is All You Need (Paper Explained) - YouTube
I looked at the paper, but couldn’t find any mention of oscillations. Does Kilcher mention oscillations? Can you cue the spot in the video?
I’m not familiar with Hopfield nets. But the paper was just saying they can achieve equivalent forms of learning. Which I don’t doubt.
Do they not also gradually adjust weights, to find minima along smoothly varying, “linear”, energy surfaces?
Are you basing the above opinion more on the cluster formation side of things based on the article you linked above titled “The Many Truths of Community Detection”? Or is it more about how it’s impossible to learn how to choose best clusters based on context?
I’m basing it firstly on what I observed when trying to learn grammars when I first started working on the grammar description problem. It’s fairly simple really. You can learn grammars fine.
You just classify words based on a vector of their contexts. QED… or so it seems…
The only trick is that the vectors contradict! It’s just a element ordering problem. You order all the elements (contexts) one way, and it gives you a nice (grammatical) “class”. The trouble is, you order them another way, and it gives you another class. Both classes are correct. The word participates in both of them. But not both at the same time. And while other words will belong to those same classes… mostly. None of them will have all the same elements. Some might have some elements, others might have other elements. It all becomes a big hodge podge, and the only way to capture it all is to order them the way you want, when you want to.
So this is something I observed directly.
It’s only after, that I found other people were noticing similar “problems”. Having a physics background I felt there was an intriguing parallel to uncertainty principles, and even gauge theories of particle formation. And then I found other people noticing the same thing. The first one was Robert Laughlin, “Reinventing Physics from the Bottom Down”. Emphasizing the irreducibility of structure at each level even for physics. I mentioned in another thread the parallel Bob Coecke has drawn between language structure and quantum mechanics… etc. etc. Then I found there seemed to be a parallel in the diagonalization proof of Goedel’s incompleteness theorem (though that’s still conjecture, I haven’t nailed an equivalence down for that.) But the same parallel exists in the same applicability of category theory I found being applied to language structure. Category theory being itself a response to Goedel incompleteness in the maths context.
Oh, and then I found the observation of contradictions in the history of language learning…
The network clustering paper I linked is just the latest in a list of many examples.
But initially it was something I noticed myself, when trying to learn grammar.