Transformers model spatial representations in hippocampus

"In this work, we show that transformers, when equipped with recurrent position encodings, replicate the precisely tuned spatial representations of the hippocampal formation; most notably place and grid cells.

In this work we 1) show that transformers (with a little twist) recapitulate spatial representations found in the brain; 2) show a close mathematical relationship of this transformer to current hippocampal models from neuroscience (with a focus on Whittington et al. (2020) though the same is true for Uria et al. (2020)); 3) offer a novel take on the computational role of the hippocampus, and an instantiation of hippocampal indexing theory (Teyler & Rudy, 2007); 4) offer novel insights on the role of positional encodings in transformers. 5) discuss whether similar computational principles might apply to broader cognitive domains, such as language, either in the hippocampal formation or in neocortical circuits."


“Note, we are not saying the brain is closely related to transformers because it learns the same neural representations, instead we are saying the relationship is close because we have shown a mathematical relationship between transformers and carefully formulated neuroscience models of the hippocampal formation.”

That sounds pretty convoluted to me, do you understand the difference?

This could just be a comparison of two mathematical models and may have little to say about hippocampus reality?
I don’t have the neuroscience background to comment on the validity of TEM.

That should go without saying, models is all we have. If two effective models designed through very different processes match, it’s highly unlikely to be a coincidence.

That’s just a neat way to say that transformer and brain representations correlate, and they have no idea why. There’s a similar paper that blew up, comparing that transformer activations are pretty similar to brain ones.

While I won’t be as bold to claim scaling+transformers to be a way to AGI, it really goes to show how implicitly learning things can be so much better, especially when it comes to something as complex as intelligence. DL methods were always mocked for being neurologically dissimilar :thinking: even when alternatives demonstrated poor performance, and Universal Approximation Thereom was established.

Funny how something so diverged from the neuroscience models, performs and exhibits greater intellectual behavior than current ones…


“Essentially all models are wrong, but some are useful.” - George Box


Do you know a better model?

When teaching physics, I always make sure to counsel my students about relying too heavily on models to interpret the nature of reality. Reality is what it is, and it does what it does. We can only observe, measure, and attempt to form models of its behavior. It’s actually a wonder that we can make models and predictions at all. There’s nothing that says the universe has to be understandable and predictable, except that if it weren’t, then we probably would not be here; at least not in the way that we are now.

There are many types of modeling that we make use of on any given day: mental models, linear models (interpolative and extrapolative), ad-hoc/rule-of-thumb/analogy/etc., and mathematical models. Mathematical models are the most reliable for making accurate and long-term predictions; however, they are only possible because, on a macroscopic scale, the statistical behavior of the universe tends towards highly entropic states: i.e. there exists a very large number of states that the system could be in, but which are essentially indistinguishable from one-another. That is to say that the observable behaviors of these states are nearly indistinguishable from one another. Thus we are able to create models of the behavior of these highly entropic systems and apply them to describe all similar systems.

With all of that being said, the most interesting and challenging systems to model are ones with lower entropy, or that behave in such a way as to locally reduce or maintain their entropy. Some chemical systems, and nearly all biological systems are able to take advantage of the presence of excess free energy (energy available to do work) in the local system to reduce their entropy or otherwise persist in a lower entropy state. This lower entropy state can only be maintained at the expense of increasing the global entropy (of the system plus its environment). These systems are a challenge to model precisely because they are in these relatively unique low entropy configurations that are capable of generating some highly nonlinear behaviors/dynamics.


Nice. I think that’s a pretty good picture of the kinds of systems in which science has been so spectacularly and perhaps unexpectedly successful. Write a few equations, find some solutions, make a prediction. Nice.

Not my thing at all. I much prefer the kinds of system you allude to in your third para, for which there is often a computational model. We can’t solve an equation for tomorrow’s weather, but we can run a model and see where it goes.

My view is that animal intelligence does exactly that: construct a model of some part of reality, run the model, and take action on the outcome it predicts. Simple example: many birds and mammals can catch a moving object in mid air, and also learn to get better at it with practice.

I’d like us to know how to write software to do that. I think we could call it AGI.

I write numerical solvers to do computational simulations for my day job. They are very sophisticated pieces of software and capable of simulating very complex dynamics, but they are far from intelligent. They are simply brute force algorithmic implementations of the conservation laws (mass, momentum, and energy).

1 Like

Marvin Minsky said something like “Artificial intelligence is the science of making machines do things that would require intelligence if done by men.” That was before chess became a solved problem. Over the years I’ve seen a lot of things labelled AI until a computer did them, largely due to the advance of Moore’s Law.

So I agree, but why? What exactly is this ‘intelligence’ thing and how will we know when we’ve got there? What hypothesis can be make and what experiment can we conduct to test the accuracy of your claim?

I might claim that AGI is about algorithms that can choose and refine algorithms, is that even science?

1 Like

I think you can see the difference in what is missing from your previous definition

This is certainly demonstrating learning but we have something more in mind with the concept of AGI. It would seem coherent that before we can get AGI we must have the types of learning you describe and we largely do e.g. reinforcement learning can improve a robots ability.

The advances that are currently impressing people the most are in language models - maybe because they are demonstrating the ability to manipulate symbols. This difference is widely debated in AI - how to combine the learning we see in sub-symbolic ANN with the coherent rule based inference and deduction we see in symbolic ANN. Bridging that gap would, I think, lead to systems people would consider to be AGI e.g. it could learn the rules of algebra and then tell a “story” about algebra, perhaps in a way similar to GPT3 but with a coherence that could extend our understanding of algebra in creative ways.

Some fancy footwork around transformers:
"It relies on an ability unique to attention heads (vs. neurons): They can move information not only to the output, but also to other places in the context. Using this ability, a head in the first layer learns to annotate each word in the context with information about the word that preceded it. "

I think that’s related to my lateral vs. vertical learning, or connectivity-based vs. centroid-based clustering. In neuro-terms it’s lateral reinforcement, probably between columns through L2-L3.

The key part of my point was that brains construct models of reality, and that intelligent brains are able to improve, adapt and reapply those models as solutions to new problems. Your ANN can never do that.

A ferry on a Norwegian fiord pulls away, and as it does so a few seabirds take up a position alongside the rail of the top deck, matching speed precisely with the ferry. Eery when you see it, birds hanging in space just a few meters away.

They have learned to do this so that tourists will throw them food, which they catch in mid air. They have intelligently solved a problem unique to that environment, which plain evolution could never do. That’s the point of intelligence, and what we should be trying to emulate.

Language may well be a reasonable target, because we find it easier to identify some of the symbols, and because we have so much raw data.I’m still pessimistic because (a) we’re still so bad at it despite the megabucks and (b) intelligence predates language by such a long time period. But I would be happy to be wrong.

The whole point of using LLMs is that they can and are the only models we have ever created that actually do it. If you have any contradictory papers, you’re welcome to post them here and have it dissected.

Otherwise, take the ability to explain jokes for instance. When you provide a so-called "prompt "to an LLM, it would try and learn things from the prompt itself - which we call meta-learning. you can provide it patterns even which it can learn from and replicate accurately. This is analagous to your seabird flying parallel to a ferry analogy. Its ability to learn such correlations, and attribute to base rewards (food biologically, loss mathematically for models) is, as you just described “intelligence”.

Suffice to say, none of these models are explicitly trained to do any of these explictly, which binds to my final point.

Language is simply proxy for data and patterns. the point that many in the wider scientific community miss (probably because of the huge diversity of opinion) - transformers aren’t there to replicate language. the sole aim is to inculcate full meta-learning capabilities.

As scale goes up, 0-shot capabilities goes up. we don’t know how far this trend works but right now, things look very rosy at such large scales. This, this is meta-learning.

The proof of its meta-learning capabilities going outside language? Code. Any Software engineers here would attest to the complexity of writing code. PaLM, the new LLM performs on-par Codex, trained on 50X less data.
This is where meta-learning is apparent - with such less data about something, it still learns the task on par other models.

The same way why the behavior of seagulll is replicated by other animals (like dogs too). This is the meta-learning abilities and hence why Language is just a proxy.

To end this already long rant, most research right now is focusing on multiple modalities - language (includes symbolic), image, audio etc. for the same reason. These large backpropgated optimzers learn to meta-learn because its the most effective way of getting lower loss across so many different modalities. CLIP, and DALL-E-2 are prime examples

What you wrote earlier and what I was replying to was:

So the key part of your new point - adaptation of previous learning to new environments - was, perhaps, in your model of reality but was not what your action of writing produced. You did manage to ignore the key point I wrote. I’m afraid is it not possible to communicate effectively if I have to read your mind while you ignore what I write. If I have to guess at what your “model of reality” is for intelligence, while discounting what you write, then my guess is you are imagining intelligence as demonstrating autonomy.

The criteria for “adaption of previous learning” and “new environment” is unclear and likely a moving goal post. If an AI shows adaption of previous learning (which they do e.g. inventing new strategies in chess) then this is not sufficient in your opinion. If an AI shows meta-learning (and they do e.g. the ability to learn task B faster after learning Task A) then this is not sufficient in your opinion.

Personally I don’t think a single scale of intelligence is appropriate. It seems to be more like a hierarchy with emergent capabilities.