After some discussion in the recent “Deeper Look at Transformers” thread, cezar_t suggested a separate thread might be appropriate.
The basic idea I was presenting was that transformers have done very well finding hierarchical sequential (temporal) structure, initially in language. But what I believe they don’t grasp, and a remaining insight to be gained by the application to language, is that the “hierarchical temporal” structure of language appears not to be stable.
This lack of stability might be the cause for the enormous blow out in number of “parameters” found by transformer models. Billions of “parameters” being found.
I would argue this lack of stability is also evident in certain historical results from linguistics. Notably that contradictions language structure which can be “learned” were actually what fractured the entire field of linguistics in the 1950s.
Current state-of-the-art: Transformers seek to find structure in language by clustering sub-sequences (clusters on “attention”.) Such clusters define a kind of “energy surface” on the prediction problem. They’re found by “walking” the energy surface using gradient descent. They work remarkably well. But the number of structural “parameters” blows out into the billions, and appears to have no upper limit.
Hypothesis: The actual energy surface might be dynamic, even (borderline sub-)chaotic. With peri-stable attractor states, but subject to suddenly flipping from one stable attractor state to another, perhaps with contradictory structural “parameters”, depending on context.
The idea is that the perceptual structure the brain creates over language sequences, and modeled currently by transformers as a stable energy surface, might actually be the excitation patterns of some kind of evolved, reservoir computer/echo/liquid state machine.
HTM might be an ideal context to explore this idea, because HTM is not trapped by the historical focus on “learning” procedures in the ANN/Deep Learning, tradition. Such as just the assumptions of stable energy surfaces which can be found by gradient descent. In fact one of the grounding motivations of HTM was exactly an explicit rejection of such learning procedures which have always been the focus of the broader “neural network” research thread, on the basis of biological implausibility.
By contrast, it turns out there is a very biologically plausible interpretation of the alternative dynamical system hypothesis being proposed. That it might easily have evolved from some kind of predictive reservoir/echo/liquid state machine on an early simple nervous system.
This might be easily explored by some simple experiments, some of which have already been implemented on a basic neurosimulator.
Comments, or continuation from the transformer thread, are welcome here.