An ode to biological principles

Like many people, I’ve been impressed by ChatGPT. After obsessing over how transformers work, I’ve gradually realised that the principles mirror those that we learned from the old works of researchers such as Hopfield and Kanerva. I was delighted when I discovered videos and papers that confirmed this (I wasn’t searching to satisfy confirmation bias, they just popped up):

Hopfield Networks is All You need
Attention Approximates Sparse Distributed Memory

We know that the cortex, cerebellum, etc. use Hebbian learning as a rule, and that neural activity is sparse. So why is modern deep learning research so obsessed with dense representations trained with back-propagation? This is an important question because transformers such as GPT3 are massive, slow and super expensive. The brain as a whole consumes 0.002% of the energy compared to GPT3. These technologies are awesome, but they’re also disappointing, due to their extreme bloat and expense.

Although hardware and cloud companies enjoy people using their dedicated GPUs/TPUs to train & run dense vectors using back-propagation, there should really be more focus on sparse representations using Hebbian learning. We could potentially get something like GPT3 running fast on CPUs while extending the Dmodel. Although we have companies such as NeuralMagic that can sparsify the weight matrices, I feel the paradigm needs to be pruned back to first principles - a throw back to the beautiful principles of the early researchers. This is possible according to the relatively recent research papers. “It turns out they were developing Hopfield networks without realising it” - paraphrasing Sepp Hochreiter on transformer researchers.

This turned into a rant I think :smiley:
Anyway, hello to the folks who are still here from years ago :wave:


…because they work.


Actually because they work on GPUs. It’s a positive feedback cycle - bigger GPUs enable bigger DL models and bigger DL models demand bigger GPUs.

The attention however and even transformers might not be the “magic thing”.
E.G. there have been tested transformers with fixed (untrained weights) attention and they worked, with a slight performance penalty.

There-s also a recent development of a RNN with transformer capabilities - RWKV

RWKV is a RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it’s attention-free.

My guess is that any means to enable large # of parameters while embedding a large context has a chance to work decently well.

With high number of parameters comes big computation hence the architecture should also have parallel paths in order to be “spreadable” across GPUs


Indeed, in essence this means that the more parameters you have, the more keys and values you can embed to approximate the pattern space. However, a generator transformer bakes the queries, keys and values into the model (thus why they’re so massive), whereas other architectures, like open domain, delegate the keys and values to a vector database (which uses fast hierarchical search), allowing the number of ‘parameters’ to remain relatively small within the attention mechanisms (retriever & generator). However, this is slow to train because of back-propagation, but not so much of an issue for Hebbian learning. This is something I’m testing for myself.

My argument focuses on the difference between biological principles (sparsity & feed-forward learning/Hebbian) and deep learning principles (density & back-prop). The cortex is still superior to transformers, and much cheaper. Sparse vectors are incredibly cheap to work with. Biology has proven it works, and companies such as Numenta have taken the hint. IMO we shouldn’t just accept sub-optimal performance simply “…because they work”.

Cheers for linking that paper, the premise is intriguing. Here’s something to check out too, related to performance.


I had a search for work related to sparse attention, for confirmation bias, and it seems the researchers from Google and OpenAI have explored sparse activations in the attention mechanisms. Sparse Is Enough In Scaling Transformers

Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. Surprisingly, the sparse layers are enough to obtain the same perplexity as the standard Transformer with the same number of parameters. We also integrate with prior sparsity approaches to attention and enable fast inference on long sequences even with limited memory. This results in performance competitive to the state-of-the-art on long text summarization.

When starting to investigate sparse variants of Transformers, we assumed that there would be a price to pay for sparsity—that a sparse model would always underperform a dense one with the same number of parameters. To our surprise, this is not the case: sparse is enough! In our experiments with large models on the C4 dataset, the sparse models match the performance of their dense counterparts while being many times faster at inference. And, when scaling the models up, the benefits of sparsity become even larger. This promises to put Transformers back on a sustainable track and make large models more useful.

They hinted that this could be developed further. Feels like the need to become more performant has caused researchers to begin exploring sparse solutions. Without that need they remained working with dense representations. Looking forward to research that shows Hebbian learning speeds up training while lowing the costs (if indeed this turns out to be true).


Because self-attention automatically converges to some structured sparsity, and backpropogation with sparse matrices is quite hard. It’s a mix of hardware and mathematical reasons here.

It doesn’t scale that well, never did never will. I’m open to experiments that prove otherwise, but their capabilities are almost non-existent out of toy tasks.

It’s an entire topic of discussion worth in itself. Hopfield networks certainly do have some similarities, but at the end of the day fundamentally what the self-attention mechanism actually vs. what hopfield networks optimize for is completely different. So the claim that Transformers are just a rehash of Hopfield networks is a bit misleading; there are some parallels here but the main point is this - transformers simply don’t do what we think. We don’t know yet what self-attention does, except for recent evidence that they somehow learn a GD step and propagate these meta-gradients.

You can create endless hypotheses what they do (especially the later heads which learn more abstract relationships), because its Turing Complete after all. In the end, its all about interpretability. We always come back to it :slight_smile:

Not so sound rude or anything, but that paper’s baselines are pretty… bad to put it politely. Plus they missed the most important point - their hypotheses don’t scale at all. Simple tasks won’t have heads learning complex relationships.

Not that its built upon Apple’s AFT paper. Like many alternatives, they work well on simple benchmarks and at small scales. But they won’t withstand billion parameter regimes (we’re already seeing a slowdown here with RWKV).

I do hope this resurgence in applying recurrence to self-attention helps to solve some fundamental issues, but RWKV has yet to prove itself. Still, interesting work by the author.

I haven’t read this paper but it look interesting. Google has been pushing the sparse direction for quite a bit of time - but no LLMs have gone fully sparse yet :thinking: Perhaps some problems are yet to be solved?


I don’t get your point. Did anyone claimed otherwise?

What bilion scale are you talking about? The author managed to scale it to 7 and 14billion parameters with on donated months on a A100 cluster, with promising results.
He (she?) seem to be a person without a company or other organization affiliation.
“it doesn’t scale” in LLM is a convoluted way to say “you are poor a.f. let multibillion $ companies handle AI”

1 Like

Have you read the paper and/or watched the associated videos? They’re not talking about Hopfield’s original binary network, they talk about Modern Hopfield Network, which still operate on the same fundamental principles - the same principles in which attention is a subset. In other words, MHN are more general than attention mechanisms, so therefore HHN is a superset of other other related mechanisms also. It’s pretty interesting :slightly_smiling_face:

Yeah explainability is tricky. Not sure LLMs are Turing Complete however. I’d enjoy seeing it (GPT-3) perform cellular automata rule 26 for example. In theory if it was TC, then it could, which it can’t, I just tried it. It basically just interpolated the answer and gave me something that looked more like a binary tree. It should compute rules deterministically, but instead it just interpolates.

Can you give me reference to this, because I’d be interested in this phenomenon. But apparently it can be quantitatively sparse based on the Sparse Is Enough paper.

The cortex scales it pretty well. When you analyse what back-prop actually achieves, it seems it’s approximating competitive Hebbian learning (auto and/or hetro). Also, if what you say is true - that transformers converge towards sparsity, then they’re also approximating that too. It seems back-prop is trying to tell us something.

There seems to be a pattern when discussing principles - most people tend to think of various models as being isolated or superficial, as if they were quick experiments or tricks. They only look at the surface without looking at the deep structure. When we look at a pool of related models they have an aggregate set of principles. Some models express these principles more broadly than others. If a model can explain multiple phenomena of multiple other models then you have something more valuable than the sum of all subset models. This can only be discovered by inter-comparing all similar models to derive these principles (similar to self-attention eh?). It’s disadvantageous to dismiss models when they correlate to the operating principles of the cortex, when your goal is to achieve machine intelligence.


What do you think about dendritic backprop, it seems equivalent to 5-8 layers of conventional backprop? The latest on that:

BTW, I do agree that greyscale version of Hebbian learning is the core of conventional backprop too, as discussed in the quoted thread.

1 Like

The architecture barely outperforms attention, and is still limited to simple benchmarks. With RWKV-14B, it should be benchmarking against the full-suite of Big-Bench for GPT-J (6B compute optimal model).

Sure, it ‘scales’ but the gradient of it is already slowing down. BlinkDL tries to compensate that by extending the context length of the model to draw up to LLMs for perplexity which is NOT an apples-to-apples comparison.

Therefore, I’d need harder evidence of all the difference scale of architectures being run on the full eval harness results and demonstrating a power scaling law to show it scales as well as a transformer.

Yep, and those principles aren’t reflected by what self-attention actually does. If all it did was a form of fuzzy retrieval over some dataset then interpretability would be much easier. As it stands, attention heads may actually compute meta-gradients. That’s very far from what the OG paper surmised.

I posted a bunch of papers on another thread about LLMs being Turing Complete (look through my post history). It wasn’t really a point of contention that transformers are Turing Complete.

I found one which mentions it, so you can probably follow the citation trail: [2106.03764] On the Expressive Power of Self-Attention Matrices

I’m not sure of an explicit paper which covers this exact area. This was long ago with the BERTology hype and its long established that attention converges to some form of structures sparsity. Which is why LLMs like GPT3 now have those sparse attention layers in-built.

Well, if we assume we know how the cortex works (as if) and Hebbian Learning is the core principle of it, then we’d have replicated it by now. So either the cortex does NOT use Hebbian Learning or our understanding of it is very wrong.

True, but then I could point out the cortical activations and artificial networks’ activaitons converge. That more-than-correlates with the cortex. So does that mean simple CNNs are the way to AGI? No.

Just because an idea is close to a hypothesis that we think works doesn’t give it any merit until it actually does. Again, if somehow you could demonstrate complex intellectual behavior with Hebbian Learning then I’d be obliged to consider it.


This was a part of my opening argument. Since networks of perceptrons were architected into multiple layers with backprop optimisation, it seemed magical, because it was approximating functions. The research took off in that direction when it could be generalised and be parallelised in hardware. However, we also know that associative memory can also approximate functions. But we live in a world where culture biases towards an optimum (typically biases towards where money can be moved). So research ran away with progress correlating with hardware progress.

It’s a shame. Computational Neural scientists have concluded principles again-and-again over decades of obsessive & diverse research. Biology is giving us clues on how to approach the problem, yet we seem to favour the brute force approach instead. We could narrow down our search for the answer by biasing the space with intersection with biology… but that’s not the ‘cool’ thing to do today. Instead, it’s cooler to sell hardware and software services (which markets perfectly for deep learning), especially if you have specialised resources such as TPUs.

If more deep learning researchers/engineers were focused on principles from computational neural science (like our early researchers were… before monetisation stepped in) then we might have machine intelligence that mimics the processes of the cortex, while using wayyyyyyy less computational resources.

I’m not arm-charing this, I’m seeing results. When you apply dimensionality expansion and sparsity to a Hopfield neural network (0-1) then you get an enormous storage capacity compared to the primitive model. There’s so much to extend/improve. It’s not local optima improvements - there’s global improvements. Experimenting with composing multiple associative attractor networks in various architectures is fascinating - connecting them via excitatory association and inhibitory diversion. Yet, it feels largely unexplored.

1 Like

Allow me to reframe it. What we see in transformers is not the mechanism of intelligence - it’s an approximation of it. Calculus has always been approximating/modelling something - but you will not get the real thing for yourself - just a cheap copy (or rather a computational expensive copy).

The big tech companies & their stakeholders have poured many millions into this approximation venture. We can give these models prompts during inference… but they have a hard limit. Fake it til you make it.

Deep down we want the real thing. When we get it, we won’t have to spend another X million to linearly improve it… it’ll improve from a prompt - real-time learning like a human… and it’ll run on a laptop.

1 Like

That’s an extremely naive way to look at it IMO. The reason why neuroscientific methods fell out of favour was precisely because the biology was insanely complex and the lack of a unified theory. The models created therefore were quite simple and efficient but failed to scale.

That line of thinking can be technically attached to anything - I might argue here GOFAI was the true way to achieve HLAI/AGI, and we just need to spend a few more centuries and billions of dollars to get it. That doesn’t solve anything and we have no evidence otherwise.

Simply put, science works off results. Its fine to have a theory, but if that theory doesn’t align with reality then its of no use - otherwise we might be better off assuming the earth is flat.

You’d be surprised how many there are - most of the DL research revolves around making architectures and computation more efficient. If someone finds a way to accelerate training throughput by an OOM, then they’d become celebrities overnight. It’s a highly researched area and we are seeing advances, just not at the pace that’s expected.

Sure, prove it then :man_shrugging: I personally have no reason to be biased towards any field or architecture. If you can show that Hopfield networks can be scaled up, preserving performance w.r.t FLOPs and competing against LLMs on meta-learning tasks, there would be absolutely little doubt where the field would move towards next.

The trouble is, what seems like would work almost always never works. Its an interesting paradox of research that I’ve experienced firsthand.

And as the Sillicon Valley euphemism goes, “talk is cheap”.

Sure. But as the parameters move towards large numbers and they’re scaled up their approximations become more and more close to the ‘real thing’ automagically. While we aren’t exactly tending towards \infty-parameters, the trajectories laid out by scaling laws seem quite interesting indeed.

Which again, boils down to what I said above. Science isn’t built on promises. Useless conjecturing gets us nowhere and remains confined to armchair debates over the internet.

1 Like

I agree with all except: what runs on a laptop is more likely about as intelligent as a mouse. The computational power of the human brain is probably in the order of 10^15 (100M MIPS) compared to a typical laptop at 10^12 (100K MIPS). And a human brain uses about 20W while a gamer laptop running flat chat is about 200W.

Given the model we have computers powerful enough but using KW or MW of power, not 20W.


So what’s the problem? we already spend lots of centuries of human-brain effort and lots of billions on backprop-driven AI.

1 Like

ah, the good ol’ deep learning vs symbolic machines wars.

I think we should accept that we are biased towards one side and just run with it. Its not like this specific kind of conceptual bias is harmful to the society.

but throwing aguments back and forth like that usually goes nowhere, lets not become keyboard warriors.


It’s a hyperbole, but the difference is that GOFAI and other schools of thought conjecture that at some arbitrary point in time - maybe a decade, maybe a century things would magically be solved, and AGI would emerge from their elaborate systems.

DL however returns consistent gains as we scale. As long as scaling keeps going on, and algorithmic+data efficiency keep going up companies are more-than-happy to provide investments and the field keeps happily chugging along. There is a clear cut path towards achieving X goal.

This reduces uncertainty to a huge extent. Nobody will invest a billion dollar in a GOFAI/neuroscientific project which may or may not payoff. But If I can predict in advance what emergent abilities a certain LLM will have at what scale, predicting even benchmark performance to a decent degree of accuracy - that is useful, and it shows that things work. That there isn’t luck involved.

Whereas with other projects its a hit or miss; and so far they’ve all been misses.

The current limits of DL are not of scale, but of kind.
ChatGPT and the art engines are developing into amazing data mining and synthesis engines.

I have great respect for what I have seen of the current works, and expect them to evolve into powerful tools.

I can image a suite of tool that lets a creative person craft virtually any story with story line, script, characters, music and images that are totally synthetic, with great detail and accuracy. I have some trouble imagining what it will be like when any artist has these tools at his/her ready disposition. I look forward to playing with these tools.

I can imagine powerful research, engineering, and programming assistants. To know all the minutia that is part of building codes, electronic design, the law, or other engineering tasks. With the ability to generate schematics & board layouts, shovel ready building plans, correct code; most of the human mistakes and oversights that currently creep into human endeavors should be ironed out of the bulk of engineering.

This is all within the reach of current technology without really adding anything new.

So what of kind?
The core of a personality and sense of self is not currently incorporated into the tools I have played with but I can imagine adding modules that add drives and personal memory. At that time, we should have something that performs well enough that it could pass a turning test with flying colors.

Chat GPT gets surprising close and as far as I can tell, its creators have deliberately limited it in ways that make is seem like it is NOT a person. Adding the aforementioned modules could certainly go the other way.

Will that be AI? Dunno.
Will it be close enough to AGI to do whatever it is that we wanted AGI for? Probably.

1 Like

Agree. Or at least ask the questions;

  1. Are we meant to disagree/debate to challenge an idea?

  2. Or are we here to learn more by brainstorming and asking questions or providing qualitative/quantitative data?

For the OP’s post I feel it is #2 and my attitude towards it should be brainstorming and not like I’m trying to prove that I’m right. Just like an adult would do.

#1 behavior is problematic especially that when that person that is here to debate has no obligation or doesn’t feel obligated to explain things using the right language – Math, but resort on narratives and then call it science.

Maybe we can have a rule, if your intent is #1 act like a scholar don’t be like a crusader. Otherwise we’d just ask ChatGPT then.