A Deeper Look at Transformers: Famous Quote: "Attention is all you need"

I have found the following two Youtube videos helpful:

  • https://youtu.be/OyFJWRnt_AY
    or just search for: pascal poupart transformers
    This is a lecture at the University of Waterloo.
  • https://youtu.be/kCc8FmEb1nY
    or just search for: Andrej Karpathy build gpt from scratch
    Note that at the end of the video he gives several insights,
    that I found valuable.

I haven’t interacted here for some years. But I still get email notifications somehow, and this thread caught my eye.

When I did interact (2014-2016) I was arguing for a relevance of language modeling to the key problem of understanding cognitive encoding.

Unfortunately at that time Jeff was dead set against any direct modeling of language. He regarded language as “high level” stuff. Not relevant at the neural level.

So I dropped off.

Given the success of these new large language models, and to understand what LLMs might be able to tell us of significance to HTM, perhaps the list might like to look at what I was saying in 2016.

To give some background, I think conceptually there is very little difference between transformers and HTM. Both predict future states based on preceding states.

The big jump for transformers was the “attention” mechanism. Attention just gives the model access to long distance context.

But that’s mostly engineering. HTM might have stumbled on attention to long distance context too.

The main difference was perhaps that HTM was trying to work up from the neural level. So HTM was inspired to seek guidance about encoding from the biology. While transformers started from a functional level. They were inspired to seek guidance about encoding from language. And that focus on language led them to “attention” to long distance context earlier.

Perhaps focusing on language might have got HTM to find attention to long distance context earlier too. Hard to say.

But that’s history. Here’s the thing. I think the real lesson that working backwards from language structure teaches us, what I was trying to communicate in 2016, still eludes both HTM and transformers.

The key thing about these new transformer large language models, to me, is that they are LARGE. “Attention”, sure. But by “attending” to long distance context they blow out to billions of parameters. And it’s really the size which enables them to perform so well. Why are they so large? It’s an inhuman scale, an inhuman amount of training data, and an inhuman compute requirement.

What I was arguing 2014-2016 was that, yes, language structure is telling us the encoding is context dependent. But I was arguing that it is telling us cognitive meaning encoding is really, really, context dependent. Actually I was arguing that the incredible sensitivity to context we see when we try to find meaningful structure in language, seems to be telling us that meaningful cognitive structure is chaotic. Chaos is the ultimate context dependence. That’s the key feature of chaos. Massive context dependence. Sensitivity to a butterfly flapping its wings is the ultimate context sensitivity.

What LLMs stumbled on, I say, was really a form of this extreme context dependence. What transformer models are really teaching us is that attention to long distance context shows the intricacy of meaningful structure is really LARGE. The larger they get, the better they get. And there seems to be no limit (final number 2^42 parameters, Hinton jokes.)

What I believe these attention models are really showing us is the same thing I was saying in 2016. Trying to find structure from language shows that cognitively significant context blows out. Seeking the structure of language generates billions of “parameters”. Transformers are trying to enumerate chaos. And the enormous blow out in parameters is a demonstration of chaotic expansion of structure.

The solution I was trying to suggest back in 2016, is that we embrace the ultimate level, context sensitivity, observable language structure seems to be telling us about cognition.

We can do that fairly easily, with biological plausibility. Much more biological plausibility than LLMs, with their biologically implausible size and compute requirements.

A little biological, bottom up insight, might jump start the next advance over LLMs. If we do what I was suggesting in 2016.


What is useful to notice is that attention doesn’t need to be strictly sequential (although that is how it is used mostly).

E.G. for an image positional embeddings of each “patch” can be 2-dimensional to reflect actual relative position between patches. Or even 3-dimensional to combine patches of various sizes.

And extra dimensions in general can be applied the same way as additive linear embeddings are.

The problem with extra dimensions is the “volume” (instead of linear size) of the attention window increases exponentially.

A way to deal with that would be - again - sparsity and specialization.
Instead of a few dozen big, “knows-everything” transformer blocks stacked on top of each other, imagine thousands (or millions?) of micro blocks each with its own limited focus, and only 1% of them “activating” according to the current context.


Strongly agree with your points overall, but I disagree with some

In a sense, he was right. You can learn a limited amount of information about the world if you only learn from text, and that world model would always have holes because you aren’t directly modelling the environment, but rather a bad proxy of it.

Transformers are strong in a sense that you can put in any data from any source and it’ll still learn from it as well as any other modality. Thus, its makes it a strong candidate for directly modelling multi-modality inputs like video, image, audio etc. aggregated together. So text was just a more ‘convenient’ way for researchers to test out their LLM’s modelling capabilities.

I think the key point are priors. Humans, and biological organisms have priors dating back to millions of years which were evolved from directly modelling their world, the earth, rather than the 1D world of tokens which LLMs currently model.

With better priors, you achieve extreme data and compute efficiency. Look at AdA which adapts as fast as humans (if not better) to unseen test scenarios. This came about because due to its training, it had priors for the world that help it - close to the priors held by humans about how the environment functions.

Ultimately, LLMs are large because text is complicated. You can get smaller, more efficient LLMs if we find the correct priors. But the entire point of the transformer revolution is that it has NO inductive biases. Every time you train a transformer, it learns brand new priors from the task which is why its to efficient against other models whose inductive biases hinder rather than help.

We’ve close to reached the limit of what we can achieve through text alone. There’s still some progress, but the future is going to be multimodal. Hopefully, if you have multi-modal priors helpful on language then we can achieve capabilities of much larger models in smaller packages. But we’ll have to look and see :wink:


Ha, yeah. OK. You say you agreed with my points overall, but it’s not clear how you agree.

You listed some things you didn’t agree with. But said nothing about what you agree with.

What were the points you agreed with?

I said:

  1. I think the fact that LLMs turn out enormously LARGE may be telling us that meaningful cognitive structure is intensely context sensitive. Actually chaotic.

  2. That this is something which HTM might have discovered by trying to directly model language structure earlier.

Were those things you agreed with?

The chaos would be generated by a prior, yes. That is the biological simplification I was hinting at.

But frankly, I think this chaos generating prior is MORE evident from looking narrowly at the language problem. So there is MORE insight to be extracted from language. We haven’t exhausted the insights language structure can tell us about cognition at all. We are still ignoring the insights language is telling us. We are seeing that language requires a vast amount of context sensitivity. But then we are just pushing this insight aside and thinking the solution must be in some other data, somehow. It’s as if we are saying, “Yeah, language structure is complex, but somehow all that complexity produces the best AI models yet… Hmm. I don’t know why, so I’ll look at something else…”. Like the old proverb about looking for your keys where it’s convenient to look for them, not where you dropped them! We are ignoring an actual prior, one language is teaching us, which could generate all this (chaotic) LARGE context sensitivity. That prior is something which language structure teaches us. It is actually LESS evident in multi-modal data. Though multi-modal data should prove amenable to it too.

So I disagree the next step is to drop language and move on to multi-modal. Instead of racing on to look for unknown solutions somewhere else, do a U-turn and come back and look at language again for a moment.

What were the bits you agreed with?


The sequentiality can be a feature, not a bug.

This might be another case of moving on to look for the solution somewhere else instead of looking at the solution which is staring us in the face.

Yeah, attention could be multi-dimensional. But sequence is working better than anything else for some reason at the moment. Is this something else language is teaching us?

Instead of trying to improve on ENORMOUS sequential models, by trying EXPONENTIALLY MORE ENORMOUS(!!) multi-dimensional models. What I’m suggesting is that we consider the complexity we already see with language, might be chaotic. And consider a simplification which generates that chaos.


It’s not that language alone won’t get use towards AGI/HLAI. But rather, the current path for DL is steep with heavy compute and data requirements.

Alleviating those requirements through priors is the obvious solution, but we recognize why transformers have been so successful from a theoretical point of view. Any priors that should be embedded should be learnt by the model itself, not hardcoded through inductive biases.

So the only way to learn priors as generic and useful as possible are to leverage the exact modalities biology faced and to learn across them. Of course, us being in the 21st century we’ll leverage other novel (and arguably more informative data sources) such as RL trajectories and using IDA to enhance our models, culminating to ASI.

But the core idea remains - to solve those problems, we need to prove multi-modal scaling laws transfer across all domains in a compute efficient manner.

1 Like

With the best will in the world neel_g, you are not addressing my ideas at all. You are just restating your own ideas:

Your ideas:

  1. Why transformers have been successful - learning?

“Any priors that should be embedded should be learnt by the model itself, not hardcoded through inductive biases.”

  1. The way to improve on transformers - go multi-modal

“the only way to learn priors as generic and useful as possible are to leverage the exact modalities biology faced and to learn across them”

Is that fair? It seems to me to be the key points you are trying to make. You are ignoring the points I am trying to make. But if we both ignore each other we will make no progress, so I’m at least trying to grasp your points.

If this is indeed a fair summation of the points you want to make, they seem to not only ignore the ideas I’m stating, but to completely capitulate the ideas which made HTM good in the first place, and which I think could make it a better substrate on which to implement a better solution. What distinguished HTM from the mainstream of neural networks was that it made a conscious decision to avoid the “learning” paradigms of the ANN thread of research. ANN’s depended on back-propagation. This was biologically implausible. So HTM avoided it.

Now you’re saying that “learning” is what has made transformer ANN’s successful. And the way forward is to “learn” more priors from more data modes.

So, not only do you want to embrace transformers fully. You want to abandon the rejection of “learning” which made HTM distinct from them.

I disagree. I want to keep the rejection of “learning” which made HTM distinct. I think that part of HTM was correct. The “learning” paradigm of ANN’s is biologically implausible, and ultimately wrong.

The wrongness reveals itself in the explosion of size generated by transformers. That is the key lesson of transformers, that they are LARGE. They are not “learning” a prior at all. They are learning a… for want of a better word, a “post”. Not a generative principle, but the infinitely unenumerable products of a generative principle.

The true prior remains trapped in the data.

Transformers can’t learn that prior because they are trapped within a body of techniques based around gradient descent over energy surfaces, back-propagation, which was exactly what HTM saw was not biologically plausible, and sought to avoid.

I think the solution is to recognize that this enormous size means the “learning” paradigm is indeed back to front. HTM was right. Gradient descent is not a biologically plausible mechanism. The enormous size of transformer models is a demonstration that transformers are not learning the fundamental “prior”. And that we need to go back and seek a generative prior which can generate all the billions of transformer “parameters”. Which generative prior I think will actually be chaotic.

Language is the best data set to learn this, because it is actually the simplest. It is the simplest, and the closest to a fundamental cognitive prior, because it is the data set which is most purely and simply generated by the brain itself.

With language the brain is telling us that sequence is fundamental.

Indeed, since you don’t ask, and display zero curiosity to know, let me say that I think the prior we are seeking is cause and effect. But transformers don’t learn this. Instead they learn an eternally finite subset of actual causes and actual effects. They learn actual causes and actual effects, examples, enormous numbers of examples expressed by humans previously, instead of the active generating principle of cause and effect itself, and the infinitely expanding patterns generated by it.

Sure, apply this cause and effect prior to multi-modal data eventually. But let’s implement it properly for language first. That’s where it reveals itself most concisely. If we have eyes to see it.

But I don’t think you will see this. The entire industry is being steam rolled by the size of transformers. To judge by your comments even HTM is now crushed by it into accepting back-propagation.

That there is talk of priors may lead us eventually to the right solution. But for now the drift seems to be in the direction you indicated, moving away from the simplicity of language, away from the simplest and best hint that cause and effect is the fundamental prior. In practice AI is embracing size, dominated by enormous companies, which are the only ones big enough to try and model a generative infinity, by generating all examples of it.

1 Like

Here-s the problem: if attention scans multiple dimensions there is no implicit or “good” order in which to scan them. Left to right top to down or otherwise?
If the positional embedding is one dimensional every new run relative position of sun to tree could be encoded differently besides the fact spatially they did not move.
If the positional encoding is modally and spatially “correct” then the same relationship between parts would be transcribed regardless of their sequential position. I mean with an encoding that acknowledges existence of and tracks multiple dimensions, different modalities won’t need to sync with each other in order to provide the same encoding when relationships between tokens from different modalities/dimensions do not change.

Language is a particular case, it forces all parallel streams into a single one, for various reasons. Conscious experience itself seems to be single threaded, but that is only the visible, “unifying” tip of the mind processes which underway are massively parallel.

I haven’t proposed bigger sequential models, but:
The multimodality will challenge ANY model with exponential more data regardless how it is fed in - sequential or parallel.

Well, I don’t really understand what you mean by that nor why/how would it help?

PS And I don’t think size in itself is the implicit culprit here. After all brains are above LLMs, at least in a strict number of parameter metric.
The actual “sin” of deep learning/backpropagated models is this size needs to be “monolithic” otherwise can’t leverage a GPU’s teraflop/sec capacity.
One of the unfortunate consequences is the need of massive amounts of learning data. Another is computing/energy cost - every parameter needs to be accounted every timeframe.

And I think that @neel_g suggests transformers will actually require less data once they become multimodal, which to some extent might be true.

1 Like

| cezar_t
March 5 |

  • | - |

What I’m suggesting is that we consider the complexity we already see with language, might be chaotic. And consider a simplification which generates that chaos.

Well, I don’t really understand what you mean by that nor why/how would it help?

Thanks for saying so cezar_t! It’s not surprising that you don’t. People have different backgrounds. It’s hard to know what prior knowledge to build on. But if you ask, I can give more depth.

First thing, mathematical chaos. I don’t know if you’re familiar with that generally. It’s worth exploring, because it’s pretty clear that patterns of neural firing in the brain are chaotic. If you do a Web search on that, you’ll find lots of references. Here’s one that came up in my Twitter feed recently:

Could One Physics Theory Unlock the Mysteries of the Brain?

Chaos is kind of weird, and only discovered quite recently. ~60 years ago?

It’s weird, but in another way of thinking, there is nothing strange about it. It just describes a state of extreme context sensitivity it turns out is possessed by some dynamical systems. It’s not inherently less meaningful than any other dynamical system. The use of the word “chaos” just comes from the fact it can’t be predicted. It’s not itself “chaotic” in the traditional sense of disorder, it just seems to defy order to outside observers. So they call it chaotic because the order is more than they as observers can know, not because the system itself does not have order. Actually it is the opposite of disorder. It is extreme order. So much order, that the order can’t be described more compactly than the thing itself! There’s a parallel to free will. The only thing which really knows what a chaotic system will do, is the chaotic system itself. Even the creator of the chaotic system cannot know fully what it will do.

A good example of chaos, and actually the first place it was observed, is the weather. The chaotic character of the weather is the reason you can’t make useful predictions more than a few days forward. The only way to really predict the weather with full accuracy, is to wait and see what it does.

But for all the complexity of structure chaos can generate, the actual generating function can be extremely simple. So, for example, a double pendulum is a chaotic system. Just two degrees of freedom, but it generates chaos!

Here’s a nice example of “dancing” robots, with motions expressed as chaotic attractors (attractors being stable states of a chaotic system.) A good example of a very simple “robot” model, which generates behaviour which is quite complex:

How the body shapes the way we think: Rolf Pfeifer at TEDxZurich

There’s lots more examples I can give. I can give more if you like. Or you can just google up heaps yourself.

The upshot is it seems clear cognitive patterns generated by the brain are actually chaotic. Some people may dispute that. But I think their objection would be on the level that the chaos is just at some irrelevant substrate level. Like people who argue whether it’s necessary to model neural spikes.

But at the level of modeling the actual chaos, which is observed, the problem comes because no-body knows exactly what the generating function is.

Given such a function, though, some way of relating elements which is inherently meaningful to the system, any “prior”, really, there is no reason to imagine the forms generated by that “prior”, should not be chaotic.

The only thing about it is that you would not be able to “learn” it. Where by “learn” I mean abstract it in a form more compact than the system itself. A chaotic system cannot be fully abstracted more compactly than itself.

It would be like the weather in that sense. No way to abstract it more compactly than the actual weather. If you tried to do so, if you tried to “learn” all the patterns, you would get one hell of a lot of particular storms that happened at particular times, but they would all be just that little bit idiosyncratic to themselves. You might be able to analogize between them a bit, saying, “Today was a particularly strong SW’er”, or "Red sky at night shepherd’s this that and the other… " etc. But two storms would never be be exactly the same. And you would never know with absolute certainty where a tornado might touch down. A bit like the old Greek proverb of the same river never quite the same.

So a model of the weather based on “learning” actual patterns developed by the weather, would be extremely LARGE. Would become more accurate the larger it became. But could never be completely accurate.

LARGEness of attempts to model the weather based on “learning” actual patterns expressed by the weather, would be an indicator that the weather is actually a chaotic system.

Maybe I’ll stop there and see if you are with me so far. Does what I’ve written above make sense?


Compared to a biological neuron, the parameter count of a LLM is absolutely nothing. Current LLMs are barely equivalent to a bee or something even simpler. It’s hard to admit, but our Von Neumann architecture simply isn’t the most efficient. We stick with it because its flexible and useful everywhere but that comes at a cost for FLOPs/$

Transformers do learn priors, they mostly learn them in the form of circuits and algorithms embedded in their weights, some of which are quite ingenious really.
They also learn some complex mechanisms which are beyond the explanation of current interpretability works. For instance, they learn to meta-GD which is quite insane to think about - Gradient Descent converges to a solution which implements itself. Talk about meta.

This the crux for atleast 3 of your points. I have a simple position about this: backpropogation could be better, more efficient and much more faster while still being scalable.
However, I see no promising alternatives yet. All initial ideas never scale and never work outside toy domains.

It also doesn’t discount the fact that whatever algorithm the brain uses would be resource constrained, thus locality would be enforced implicitly to preserve energy. This would imply that it would be inferior to true full backpropogation, and backprop could be the way to potentially ASI rather than focusing solely on AGI.

Disagree. Being sequential has nothing to do with biology or the brain, which is rather recursive in nature.

That’s a lot of assertions. Any evidence to back up:

  1. cause and effect is the prior we need to create a generalist agent, and can be quantitatively embedded in algorithms
  2. proving transformers don’t learn that

I couldn’t care less what HTM uses to “learn”. It could be a bunch of slaves playing bingo with the weights for all I care. I simply wish to see results. If HTM can’t deliver it, then perhaps its entire foundation is wrong, the core model is wrong or the learning update rule is flawed.

So I would look for scientific literature ablating and modifying those components to see how they work better, If I were you.

Also, it doesn’t help your case that Numenta has pretty much given up on HTM and its paradigm :man_shrugging:

Well, because it simply isn’t enough. You aren’t raised through millenia by just language. In fact, visual stimuli was the main reason why the brain become complex in the first place. Darwinism selected organisms with more neurons who could process the complex 3D world and its spatio-temporal connections properly.

It’s simply a great way to increase unimodal data and compute efficiency by transferring knowledge across the domains. Look at visual transformers like BLIP 2 and their amazing capabilities across domains (despite never being trained on multimodal data directly; its a NLP model with a frozen adapter)

Just a small correction - the models being monolithic has nothing to do with the lacking sample efficiency. You could train smaller models on the same dataset and still get away with outperforming large ones (Chinchilla; in a few weeks, you’ll see LLaMa being a bigger rage than ChatGPT)

The training data requirements come with priors, the same which I’ve been harping on about for months now :wink:. LLMs start out with a blank slate - they make no assumptions about the world. We on the other hand have plenty of priors biologically to aid sample efficiency on earth and being generally very flexible/adaptable.

Larger models learn more complex priors, hence the more complex capabilities and increase transfer learning across all domains. As for being monolithic, they just work the best empirically. The brain (more accurately, the neocortex) is also quite uniform and monolithic. Perhaps there’s some link here, but empirical results are hard to argue against :slight_smile:

1 Like

That’s a lot of assertions. Any evidence to back up:

  1. cause and effect is the prior we need to create a generalist agent, and can be quantitatively embedded in algorithms
  2. proving transformers don’t learn that

How many assertions was that by me?

“2” would be proving a negative. I’ll concede no proof for that.

Aside from proving a negative, that leaves “1”, which is one assertion. Is one assertion a lot?

You can’t prove a negative. On the other hand, proving transformers do learn cause and effect as a generative principle would only require one existence proof. Should be easy for you. Just point me to the paper demonstrating it.

Evidence for “1”? Well, it works for language. That goes back a long way. It was the basis for American Structuralism in the '30s. You can look up Harris’s Principle, distributional analysis, Latent Semantic Analysis, Grammatical Induction…, for instance:

You shall know an object by the company it keeps: An investigation of semantic representations derived from object co-occurrence in visual scenes
Zahra Sadeghi, James L McClelland, Paul Hoffman

Bootstrapping Structure into Language: Alignment-Based Learning, Menno van Zaanen

The only problem is that it generates contradictions. Which prevent “learning” as such. That’s what destroyed American Structuralist linguistics, and ushered in Chomsky’s Generativism (Chomsky enormously dismissive of transformers.) Here’s a bunch of papers which characterize the inability for linguistic structure to be learned (starting at phonemes), as non-linearity:

Lamb, review of Chomsky … American Anthropologist
69.411-415 (1967).

Lamb, Prolegomena to a theory of phonology. Language
42.536-573 (1966) (includes analysis of the Russian obstruents
question, as well as a more reasonable critique of the criteria
of classical phonemics).

Lamb, Linguistics to the beat of a
different drummer. First Person Singular III. Benjamins, 1998
(reprinted in Language and Reality, Continuum, 2004).

Lamb and Vanderslice, On thrashing classical phonemics. LACUS

Forum 2.154-163 (1976).

Or for a more mainstream analysis of that learning problem here:

Generative Linguistics a historical perspective, Routledge 1996, Frederick J. Newmeyer:

“Part of the discussion of phonology in ’LBLT’ is directed towards showing that the conditions that were supposed to define a phonemic representation (including complementary distribution, locally determined biuniqueness, linearity, etc.) were inconsistent or incoherent in some cases and led to (or at least allowed) absurd analyses in others.”

That this then can be extended to something which generates meaningful structure for a generalist agent might be an assertion. More of a hypothesis. But first we should apply it fully to language.

What other assertions are you attributing to me?

By contrast, your thesis to the best of my ability to understand it, seems to be that transformers are fine. They are the full solution. We only need to make them even bigger, and give them even more data. Back-propagation is perfectly biologically plausible. HTM has been abandoned (quite possible) so the insights which motivated it are not worthy of consideration (less justifiable.)

And reiterated, that you are sure the way to move forward is to do more learning, over more multi-modal data.

Well, that has the advantage of closely aligning with what maybe 90% of people in the industry currently believe. Maybe you’re right. Maybe HTM was completely wrong. Maybe that Google chatbot really did just suddenly become conscious, and all we need to do to achieve final AGI is to get yet bigger, feed ANN back-propagation gradient descent learning algorithms yet more data, build a speech recognition engine with 100 years of training data instead of 77 (Whisper?) Learn it 2^42 parameters as Hinton jokes…

Maybe all Elon Musk needs to do to finally achieve fully human level object recognition for driving really is to perfect his auto-labelling system, in his automated driving example generating system, so that he really can finally label every possible corner case of driving that could ever occur in the world, and then train his networks to recognize everything that’s been labeled:


1 Like

Hmm, I doubt it:

In my book that means 4x lower sample efficiency. LLaMa I think used > 1T tokens for training.
Which, I think, again, is significantly more than what GPT-3 was trained with.

PS sorry the above quote is from here

Sorry again, I have some idea what a chaotic system means, what I asked is what do you mean by an AI to be chaotic. Any particular algorithm/mechansim you can describe or just a general feeling about the fact natural intelligence exhibits chaotic properties hence (as with backprop) if an artificial system doesn’t replicate “chaotism” it won’t be able to be intelligent?
Actually, your assertion is even more confusing, you say it has to generate chaos. What I learned is some systems behave chaotically but I haven’t heard of chaos generators. Or you mean they are the same thing?

@neel_g deepmind’s paper puts it even more clearly:

we find that for compute-optimal training, the model size and the training dataset size should be scaled equally: for every doubling of model size the training dataset size should also be doubled

which indeed, as the model increases sample efficiency gets lower.

The fact chinchilla reached the same performance with fewer parameters but more training data is used to explain the larger model could have make use of (much) more data. Yet that could be a problem.


Natural language with or without extra sensory grounding is too complex to estimate how sample efficient a model is because we have little context.

Here-s a much simpler RL problem, an I-maze (which is actually an H turned 90 degrees, see page 12 what that means) is conceptually simple problem. A transformer model learned to solve it in 2million time steps which, since the game can be won in 20 steps means it needed to play it 100k times to “figure it out”.

Which I doubt it equates with “understanding” the problem. It should be solvable in a few dozen trials by humans and less (dogs, cuttlefish, mouses probably).

That’s what I call sample inefficient. Orders of magnitude more trials before “figuring it out”

My statement is simpler. Transformers have quite some way to AGI. But they’re still closer to being a generalist agent than HTM. Insights gained from HTM may be useful, but their application as of right now leads to no breakthroughs so I can’t really comment on their impact.

Self-driving won’t be solved anytime soon simply because the scale of the models which are deployed are pitiful. It’s like asking why we can’t stream 4k content in the 1990s. They just didn’t have the hardware back then to pull it off.

Models deployed on the edge are highly resource constrained. If Tesla ever wants to solve FSD, it would have to scale their models up OOM to start making a dent towards the problem. They do recognize the problem, hence DOJO but again its yet to be fully rolled out.

As I said,

Chinchilla scaling laws are simply a compute-optimal version of Kaplan et al. You could keep the dataset size constant and scale up parameters alone leading to improvements. The gradient would just would be slower than what chinchilla optimal models can do.

The models being monolithic however have nothing to do at all with sample efficiency whatsoever. Its a question of the priors and inductive biases we bake into our networks.

Again, priors. I feel I’ve repeated myself hundreds of times now. animals have biological priors carried down by evolution for adaptability and flexiblity. Everytime you start a transformer’s trianing, its learning from a blank slate. It makes no assumption about its environment or data distribution.

When you do find optimal priors, like AdA you quickly find transformers learning as fast as humans (if not faster) in real time, towards tasks the human hand-crafted. This is extreme sample efficiency, rivaling humans with a fraction of the compute and parameters. XLand 2.0 is also pretty complex; arguably the humans have more knowledge about how the 10^{40} possible tasks work than the agents itself, yet AdA still performs on par

All of this, just with ICL/few-shot and a frozen LM to boot. Simple learning through multiple trials about how the environment and its dynamics work, and optimizing its strategy to complete the task at hand better and faster. (I cannot stress this enough - watch the results reel!)

Transfer learning also helps; MGDT shows how the DT pre-trained on expert trajectories outperforms baseline, randomly init agents as well as improving on the expert trajectories themselves. It’s much more efficient despite being just 1B parameters. There is a huge opportunity just for grabbing the low hanging fruit here, and scaling it up to improve performance by magnitudes easily.



Maybe I’ll stop there and see if you are with me so far. Does what I’ve written above make sense?

Sorry again, I have some idea what a chaotic system means, what I asked is what do you mean by an AI to be chaotic. Any particular algorithm/mechansim you can describe or just a general feeling about the fact natural intelligence exhibits chaotic properties hence (as with backprop) if an artificial system doesn’t replicate “chaotism” it won’t be able to be intelligent?
Actually, your assertion is even more confusing, you say it has to generate chaos. What I learned is some systems behave chaotically but I haven’t heard of chaos generators. Or you mean they are the same thing?

“Generate” chaos or “behave” chaotically, I’m not distinguishing here. You can take them to be the same thing in my expression. They’re dynamical systems, which generate behaviour, and that behaviour happens to have the extreme context sensitivity and resistance to abstraction of chaos.

For a concrete model. I don’t think the leap is too far. Actually I think the process will be very similar to what is happening now in transformers. Transformers also “generate” structure. They will be learning to do it in the same way. Which is to say they will be learning structure which predicts effectively: cause and effect.

This is something which the language problem constrains you to nicely. It constrains you to think about a cognitive problem as cause and effect.

So what’s the difference?

I’m saying the difference is that grouping elements according to how they effectively predict cause and effect will actually find structure which is not static. Not “learnable”. Instead will find (generate) structure which changes dynamically from one moment to another. The groupings of the network, the hierarchies it finds/generates, will change from moment to moment and problem to problem.

Transformers don’t look for such dynamically changing network groupings/structure. They assume the network structures it finds will be static. They must do. Because the mechanism they use is gradient descent to find energy minima (where energy minima means the groupings predict the next element maximally.)

It’s a different assumption about the nature of the system you might find. That is all. If you assume one kind of structure, that is the structure you will find. If you assume what you find will be static, you will only find the static bit.

I’m saying we use the same principle of grouping according to effective prediction, but drop the assumption that structure will be static.

Dropping the assumption the structure you find will be static means you can’t use gradient descent. It means we need some other way of finding groupings of elements in a network which are predictive energy minima.

So the problem becomes, how do we find predictive energy minima in a network of observed (language) sequences, when we can’t simply track a fixed energy surface using gradient descent?

The solution to this problem for me came when I found this paper:

A Network of Integrate and Fire Neurons for Community Detection in Complex Networks, Marcos G. Quiles, Liang Zhao, Fabricio A. Breve, Roseli A. F. Romero



Nope it’s not. The I maze problem has 50% to be solved correctly by random chance and once you learn a “trick” there-s no way to fail it, unless you-re a total idiot. At least that’s how humans do. And has little to do with priors, since well, it is a very artificial problem.

The DL models they test, do not exhibit this ramp-50% success followed by a sudden 100% in a few (like a dozen) trials. when they “get” the clue. For ~100000 games (at least 2M steps) they have a gradual improvement. Which means they never get any clue, they solve it statistically. Which you glorify here as the success of priors.

AdA sure is interesting but is not settling the case. The sample inefficiency is about how long it takes until it finds those optimal priors. If you want the same model to play decent chess too, it needs to be trained on another few million chess moves. The priors in a game do not apply in another, not as fast as for us. Yes we have some important instinctive priors but much fewer and more “general” - the brain structure encoding genome isn’t too large it can not count as pretraining. Whatever priors we have (if you claim it’s only a matter of priors) then priors are very good at sample efficiency in a wide domain.

My bet is it’s about different learning mechanism than just priors, and large, DL models simply ignore these mechanisms.


Yes there-s something to causality but it might be deeper. I don’t think is our language skills that shapes our minds to conceive causality.
I think the assumption of causality is one of those important priors which shapes everything else about our minds, language included.

That assumption is so strong we gladly assert any correlation is in fact a causation (logically wrong, but when a logical fallacy saves our ass, it IS evolutionary correct).
Look at how prone we are to superstitions and believing dangerously silly stuff like witchcraft.

The big question is how could one engineer a speculative causality prior in a machine learning framework. I call it speculative because that’s what it is, we make assumptions and which/when one is confirmed we take it as “correct” for as long as it fits the data

Regarding the chaos discussion- that could be a fertile one … we-re better of moving it in another topic. Where I encountered it was in reservoir computing. The closer to chaotic threshold a reservoir is the better at modelling its input signals it gets, but not if it passes that threshold. They need to be almost chaotic to be useful, at least in that particular case.


I remain convinced that the key difference is that the animal brain is continually creating models of reality. You (or a rat) look at a problem, try a few things, pick a known model that might work, tweak it a bit, solve the problem and move on. Next time that model gets used earlier and the solution is found sooner.

Until our AI builds explicit models of reality, it will continue to fail on many things we do so well. And we have absolutely no idea how to do that, do we?


Yes, actually we have one grand model of reality and update it here and there - wherever it fails. I won’t say we have a whole idea on how that works, but even the fact that we know that only small incongruent parts are updated might be a useful clue in an attempt to replicate it.

try a few things, pick a known model that might work, tweak it a bit, solve the problem and move on.

Yes, I think this is the case yet on a massive scale. My assumption is that out of the ~100M columns 1% are stabilized shaping the grand “world model” while the rest of 99M are speculating. It’s another kind of search engine but it searches not only within the stable model but in a massive parallel manner searches correlations between model and details within current
experience trough raw speculation. Maybe not raw (as in random) but loosely informed by prior knowledge.

And I think the above assumption isn’t too hard to test, in code.