A Deeper Look at Transformers: Famous Quote: "Attention is all you need"

Check out this implementation with a step by step demonstration of a Transformer Model by Andrej Karpathy. Sparsity (one-hot encoding) and tensors and attention are all well put to work in this compact and dense demo. I think we should take some Lessons Learned from this approach and see if they help us within the HTM approach). Hope some of you may enjoy and make some use of the content of this video.

Pay special attention to timestamp: 1:11:30

What are your insights with regards to transformers and how they contribute to the field of semantic understanding? And knowledge representation? I would be very interested on your opinions and impressions. Do you see potential for enhancements in our HTM and TBT paradigms? Thanks in advance for your feedback.


Here is a less deep, but still very informative review of the concept of transformers. Check out the link below. It provides a very good, high-level exploration of Transformers and ChatGPT at a more conceptual level.

ChatGPT - Word embeddings and semantics: Transformers & NLP 2 - YouTube

Here is the original publication that got Transformers started. A very big milestone in the universe of Machine Learning.

No, attention is not all you need. What is ‘needed’ is a suitably adapted Actention selection serving A(G)I system; but no emulation of emotions should be aspired to as part of this non biological evolutionary stage of this universe (that you and Science are deeply involved in.

1 Like

It’s just a name; the point was that Attention gets you very far in the kind of capabilities that are associated traditionally with “intellectual behavior”. The authors themselves never knew how influential their paper would be, so its kind-of a meme quote.

The idea is that scaling LLMs can already bring so many capabilities that we\re quite close to AGI - relatively speaking of course. Still, the progress has shown that LLMs are the most serious contenders for AGI right now. They still lack a few things but hopefully we can fix those.

1 Like

Unfortunately, these few things are at the base of the whole construct, as we have already discussed.

By the way, this “attention” has nothing to do with “biological” attention (which in most cases is a top-down phenomenon). Transformers (like any NN) are just forward only networks. No feedback is involved/necessary to make them work. Are they just another example of “throw things at the wall and see what sticks”?

1 Like

That was what I thought, because they simply train on such colossal bodies of text. But after watching the first video at the top, (it is two hours) then I got the impression that there is indeed much more to the GPT model than I thought. Feedback is built in, and supervisory feedback as well. It is definitely niftier than plain brute force, though they still apply brute force. Around the timestamp: 1:11:30, there is a good review of all the components and nodes in action.


The Large Language Model is one important aspect of chatGPT, but not the only one. The Transformer concept does apply self-supervision in a kind of feedback loop. They call it self-attention and seems to be iterated and nested a few times. I find the model interesting considering their better-than-expected-results. Perhaps we can also learn from some of the self-centered supervisory loops?

1 Like

The “feedback” is only present during training. Biological feedback has nothing to do with error propagation or correction (I think the predictive coding interpretation is wrong). The main feedback contribution in the cortex is L6 to L1 downstream. The synapses involved are metabotropic: barely affect a few mV to the tuft of pyramidal cells below. The postsynaptic effect can be present for seconds or minutes. It is additive and substractive. The tuft is very far from the soma of the pyramidal cell.

Inference is only forward.

I can’t deny that the idea is useful, but is still a lookup table (with an enormous computational cost to access the information and non-zero probability of producing utter garbage). We all know how FSD systems have ended up. Sorry, but this won’t be any different. Google was the developer of the transformers and still let Microsoft overtake them? I don’t know, but maybe they know how hype works.

AGI may come, but certainly not this way.


X. -Y. Zhang, G. -S. Xie, X. Li, T. Mei and C. -L. Liu, “A Survey on Learning to Reject,” in Proceedings of the IEEE, vol. 111, no. 2, pp. 185-215, Feb. 2023,
doi: 10.1109/JPROC.2023.3238024.


It is very similar, but not exactly congruent. Animals attend to filter out irrelevant information (without which there’s sensory overload, often associated with autism). Attention does the same, hence the name. Mathematically, you’re still finding a subspace to unify 3 distinct subspaces together and hope that correlates to attention, but as it turns out it can learn more complex circuits than just simply regulating the information flow in a network.

In a sense, larger LSTMs and LLMs create some of their own feedback, meta-gradients basically. So they don’t really need feedback to learn something new. They do need feedback if you want to align them with a particular task, say doing whatever the humans wants or predicting a world as perfectly as possible.

There is a subtle but important difference here.

Absolutely not. The transformer itself is under no feedback loop, what you’ve probably conflated is the autoregressive nature of decoder-only models which is used to generate a token at a single timestep rather than sampling the sequence uniformly.

“Self”-attention isn’t given the prefix because its in a self-supervised feedback loop. It’s called self-attention because the subspaces projected by the W_q, W_k and W_v transformations are performed ontop of the same sequence, input sequence [x_0, x_1, ...] \;\forall{x}\in{V_{50000}}. Because Query, Key and Value vectors are derived from the same initial tensor those (trainable) matrices compute a kernel for inter-relationships between ever x_i. The “self” is thus derived from computing that kernel onto itself. You “cross”-attend when the derivations of Q, K and V aren’t similar. This is when you cross-attend the latents between the encoder and decoder to propogate information.

Just please don’t peddle those stupid ChatGPT driven twitter blogposts. I’d recommend if we keep opinions to a minimum and keep discussion scientific, then that’d be the best policy for engaging and interesting discourse. I don’t think anyone wants HTM forum to become another Discord or Twitter.


Thanks very much for your insights. First, I want to state that I fully agree with you on the fact that all these ML hypes are based on very artificial and less intelligent approaches than our biological role models have evolved to be. The power consumption test is always very telling, as you also noted above. I have been a participant in this Forum/Community since 2012 and am still going strong in my conviction that the biological paradigms are the fastest and surest path towards AGI and perhaps other kinds of emergent intelligence. The ML community tries to out-engineer nature and fails quite clearly. I posted these links and questions regarding the current hype around Transformers going into their 4th generation, because I sense that they keep on adopting some elements from the neuroscience community that had previously been ignored, in an incremental way, feature by feature. For example, sparsity started entering the picture of their models and now attention, even if it is conceived on a non-biological basis, and even some encoding and decoding aspects seem to creep into their models. As they adopt these features the performance grows and, with their brute force data masses, some surprising illusions materialize. These illusions of cognitive intelligence actually reveal more about our naivety and superficial gullibility exposing how bad we are at recognizing intelligence as opposed to an extremely extensive closed loop knowledge base at a sociological level with a linguistic frontend. We do not find a self-evolving entity with a cognitive model of reality in these Transformers, which in my opinion is a fundamental necessity to fulfill the definition of terms like “comprehension” or “understanding”.

My curiosity, however, is open to all approaches, so I wanted to ask this community, whether we can learn something from the implementation of GPT3 and GPT4 regarding some of their qualities? Perhaps some are adaptable to our neuroscientific approach? I am thinking about the tools, libraries, integrations, data sets, looping cycles, self-referencing, etc. If I followed the first video correctly and the paper they published in 2017, there is an encoder in GPT that is used in the inference phase and this communicates with the decoder, that is only applied in the training phase. There seems to be a little more than just straight training with brute force. And semantic relational representations do emerge for tuples of a base term and a term to be added and a term to be subtracted. A multidimensional relationship of terms seems to emerge. Analogies and metaphors are made possible with this implementation, which does deserve some recognition. What is your take on these points? Anyone?


I think finding/evolving/training ways to attend to a select few pairings of features (== clues) in the whole input data is the main takeaway from transformers.


I think it is very good to bring these ideas here and discuss them. But I would try not to get caught up in the hype. Most of us believe that there is a lot of it, and the problem is not the hype … the problem is the aftermath. I can’t say that they’re useless because they’re not. For example, I’m writing this post with the help of DeepL Write :slight_smile:

I agree that the right way to go is with the biologically plausible ideas. Low-hanging fruit like LLMs are, in my opinion, a resource distraction; if all the people and money invested in such things were redirected to the former ideas, they could progress rapidly. But because these ideas (like HTM) don’t perform well on certain “manufactured” benchmarks, they don’t attract enough people. We want to run, and Transformers provide the illusion of running, but to learn how to run we have to start learning how to crawl.


My understanding is that attention, although a phenomenon not very well understood by current neuroscience, is fully distributed. For example, in the visual stream, attention increases the oscillatory coherence of regions higher in the hierarchy (e.g. IT->V4). There is another good example of attention in the cocktail effect. Here in non-learning scenarios, there is a “distributed” effect , so it depends on the actual system state (i.e. feedback is used during inference to “correct” somehow the sensory input noise). Are doing the transformers something similar to this?


You may call it focusing if you don’t like attention - picking up most relevant details/correlations within the whole input stream


Do they perform decently on any benchmark at all? Most researchers are more than happy to see results from an algo that hasn’t been scaled, if it shows a strong scaling behavior and the benchmark is not a toy one.

While MNIST and friends are often used, they’re the worst candidates to ever show any capability. Simply put, calling benchmarks which you can’t outperform (or atleast perform decently) “manufactured” is the same as calling scientists deluded for making up abstract theories and weird symbolic equations to prove the earth is round when I can clearly see its flat.

So if HTM/TBT can do something a little more complicated, than say cartpole or MNIST then that’s a pretty decent result. MNIST can be solved by literally averaging and thresholding, so you can understand why the bar is so high for the algorithm to even start doing something.

They’re doing something quite close to the cocktail party effect indeed. As @cezar_t said:

At the end the model chooses to focus on some relevant information that it thinks might help in the task ahead

1 Like

The question is who defines these benchmarks. Most of the benchmarks used by NN seem to have been created in parallel with their own development (e.g. MNIST or ImageNet). To me, it looks like cheating yourself.

In my opinion, the perfect benchmark is the biological performance, not only in terms of accuracy, but also in terms of flexibility, efficiency and resilience. For example, can a NN tolerate a high level of noise in the input? Can a NN learn continuously without falling into catastrophic forgetting? Can an NN say “I don’t know”?

Eventually, biologically inspired systems will outperform NNs on any of the current benchmarks, but you can demand that they do so from the start. A shrew can hardly get a good score in… MNIST, but is much more intelligent than any current NN. I agree that it’s hard to quantify how good an idea is if you don’t have an “objective” measure. I just think that “accuracy” in any identification task is not the right one for the current state of things. We can start with the “easy” things (e.g., support continuously learning) and move progressively to the “harder” (e.g., understand

ChapGPT regrets to disagree

1 Like

Rather, they represent fundamental cognitive abilities that any actual AGI/HLAI should have.

Sure, but it can still adapt faster than vanilla NNs. So show me any architecture that:

  1. Improves with scale
  2. Maintains the same level of adaptability and flexiblity which are the hallmarks for biological mechanisms

on even simple RL tasks like open-world environments and all. Hell, even being able to solve all the Atari games is an excellent start provided inputs in terms of numeric features. I’m giving a huge leeway here - any algo requiring those crutches would be viewed as a failure by most sane researchers

What else did you expect?


I’ve put some effort into trying to get my head around Transformers.

I feel that I should understand their function deeply and intuitively in order to earn a strong opinion. And I’m not there yet.

I’ve been putting up learning resources that I found helpful at Transformers - The Age of Machine Intelligence

The most insightful piece of information (for me) was from the Justin Johnson lecture. He motivates QKV historically.

In ~2015 there was a buzz around RNNs. But they suffered issues with propagating information over multiple time-steps. e.g. They would struggle to contextualize tokens separated by many time steps. One solution path was LSTM.

Another was the Attention Mechanism. See the first few slides. Justin explains brilliantly, so I won’t even attempt to summarize. (Note: he examines the biology of the eye; fovea, saccades).

And this Attention Mechanism can be abstracted out from the RNN into an Attention Block. And from this unit the Transformer emerges.

If we are able to connect this architecture (which performs amazingly well) with the biology, I think that would be effort well spent.

And I’m tempted to try to do it through Hinton’s Capsule Networks.

Listening to Hinton talking about his recent ForwardForward discovery (which lays a basis for how a NN may train without invoking backpropagation), I remember him saying (in passing) that he had let go of his work on CapsNets (to some degree) after Transformers ‘got there first’. So I think there’s a connection between these two architectures. They’re doing something functionally similar. And CapsNets do seem half-way biologically plausible. So it may be that CapsNets offer a bridge to understanding.

I think that’s all I have at this point. This work is mentally demanding to keep up with, let alone advance.