A Deeper Look at Transformers: Famous Quote: "Attention is all you need"

It’s not that language alone won’t get use towards AGI/HLAI. But rather, the current path for DL is steep with heavy compute and data requirements.

Alleviating those requirements through priors is the obvious solution, but we recognize why transformers have been so successful from a theoretical point of view. Any priors that should be embedded should be learnt by the model itself, not hardcoded through inductive biases.

So the only way to learn priors as generic and useful as possible are to leverage the exact modalities biology faced and to learn across them. Of course, us being in the 21st century we’ll leverage other novel (and arguably more informative data sources) such as RL trajectories and using IDA to enhance our models, culminating to ASI.

But the core idea remains - to solve those problems, we need to prove multi-modal scaling laws transfer across all domains in a compute efficient manner.

1 Like

With the best will in the world neel_g, you are not addressing my ideas at all. You are just restating your own ideas:

Your ideas:

  1. Why transformers have been successful - learning?

“Any priors that should be embedded should be learnt by the model itself, not hardcoded through inductive biases.”

  1. The way to improve on transformers - go multi-modal

“the only way to learn priors as generic and useful as possible are to leverage the exact modalities biology faced and to learn across them”

Is that fair? It seems to me to be the key points you are trying to make. You are ignoring the points I am trying to make. But if we both ignore each other we will make no progress, so I’m at least trying to grasp your points.

If this is indeed a fair summation of the points you want to make, they seem to not only ignore the ideas I’m stating, but to completely capitulate the ideas which made HTM good in the first place, and which I think could make it a better substrate on which to implement a better solution. What distinguished HTM from the mainstream of neural networks was that it made a conscious decision to avoid the “learning” paradigms of the ANN thread of research. ANN’s depended on back-propagation. This was biologically implausible. So HTM avoided it.

Now you’re saying that “learning” is what has made transformer ANN’s successful. And the way forward is to “learn” more priors from more data modes.

So, not only do you want to embrace transformers fully. You want to abandon the rejection of “learning” which made HTM distinct from them.

I disagree. I want to keep the rejection of “learning” which made HTM distinct. I think that part of HTM was correct. The “learning” paradigm of ANN’s is biologically implausible, and ultimately wrong.

The wrongness reveals itself in the explosion of size generated by transformers. That is the key lesson of transformers, that they are LARGE. They are not “learning” a prior at all. They are learning a… for want of a better word, a “post”. Not a generative principle, but the infinitely unenumerable products of a generative principle.

The true prior remains trapped in the data.

Transformers can’t learn that prior because they are trapped within a body of techniques based around gradient descent over energy surfaces, back-propagation, which was exactly what HTM saw was not biologically plausible, and sought to avoid.

I think the solution is to recognize that this enormous size means the “learning” paradigm is indeed back to front. HTM was right. Gradient descent is not a biologically plausible mechanism. The enormous size of transformer models is a demonstration that transformers are not learning the fundamental “prior”. And that we need to go back and seek a generative prior which can generate all the billions of transformer “parameters”. Which generative prior I think will actually be chaotic.

Language is the best data set to learn this, because it is actually the simplest. It is the simplest, and the closest to a fundamental cognitive prior, because it is the data set which is most purely and simply generated by the brain itself.

With language the brain is telling us that sequence is fundamental.

Indeed, since you don’t ask, and display zero curiosity to know, let me say that I think the prior we are seeking is cause and effect. But transformers don’t learn this. Instead they learn an eternally finite subset of actual causes and actual effects. They learn actual causes and actual effects, examples, enormous numbers of examples expressed by humans previously, instead of the active generating principle of cause and effect itself, and the infinitely expanding patterns generated by it.

Sure, apply this cause and effect prior to multi-modal data eventually. But let’s implement it properly for language first. That’s where it reveals itself most concisely. If we have eyes to see it.

But I don’t think you will see this. The entire industry is being steam rolled by the size of transformers. To judge by your comments even HTM is now crushed by it into accepting back-propagation.

That there is talk of priors may lead us eventually to the right solution. But for now the drift seems to be in the direction you indicated, moving away from the simplicity of language, away from the simplest and best hint that cause and effect is the fundamental prior. In practice AI is embracing size, dominated by enormous companies, which are the only ones big enough to try and model a generative infinity, by generating all examples of it.

1 Like

Here-s the problem: if attention scans multiple dimensions there is no implicit or “good” order in which to scan them. Left to right top to down or otherwise?
If the positional embedding is one dimensional every new run relative position of sun to tree could be encoded differently besides the fact spatially they did not move.
If the positional encoding is modally and spatially “correct” then the same relationship between parts would be transcribed regardless of their sequential position. I mean with an encoding that acknowledges existence of and tracks multiple dimensions, different modalities won’t need to sync with each other in order to provide the same encoding when relationships between tokens from different modalities/dimensions do not change.

Language is a particular case, it forces all parallel streams into a single one, for various reasons. Conscious experience itself seems to be single threaded, but that is only the visible, “unifying” tip of the mind processes which underway are massively parallel.

I haven’t proposed bigger sequential models, but:
The multimodality will challenge ANY model with exponential more data regardless how it is fed in - sequential or parallel.

Well, I don’t really understand what you mean by that nor why/how would it help?

PS And I don’t think size in itself is the implicit culprit here. After all brains are above LLMs, at least in a strict number of parameter metric.
The actual “sin” of deep learning/backpropagated models is this size needs to be “monolithic” otherwise can’t leverage a GPU’s teraflop/sec capacity.
One of the unfortunate consequences is the need of massive amounts of learning data. Another is computing/energy cost - every parameter needs to be accounted every timeframe.

And I think that @neel_g suggests transformers will actually require less data once they become multimodal, which to some extent might be true.

1 Like

| cezar_t
March 5 |

  • | - |

What I’m suggesting is that we consider the complexity we already see with language, might be chaotic. And consider a simplification which generates that chaos.

Well, I don’t really understand what you mean by that nor why/how would it help?

Thanks for saying so cezar_t! It’s not surprising that you don’t. People have different backgrounds. It’s hard to know what prior knowledge to build on. But if you ask, I can give more depth.

First thing, mathematical chaos. I don’t know if you’re familiar with that generally. It’s worth exploring, because it’s pretty clear that patterns of neural firing in the brain are chaotic. If you do a Web search on that, you’ll find lots of references. Here’s one that came up in my Twitter feed recently:

Could One Physics Theory Unlock the Mysteries of the Brain?
https://youtu.be/hjGFp7lMi9A

Chaos is kind of weird, and only discovered quite recently. ~60 years ago?

It’s weird, but in another way of thinking, there is nothing strange about it. It just describes a state of extreme context sensitivity it turns out is possessed by some dynamical systems. It’s not inherently less meaningful than any other dynamical system. The use of the word “chaos” just comes from the fact it can’t be predicted. It’s not itself “chaotic” in the traditional sense of disorder, it just seems to defy order to outside observers. So they call it chaotic because the order is more than they as observers can know, not because the system itself does not have order. Actually it is the opposite of disorder. It is extreme order. So much order, that the order can’t be described more compactly than the thing itself! There’s a parallel to free will. The only thing which really knows what a chaotic system will do, is the chaotic system itself. Even the creator of the chaotic system cannot know fully what it will do.

A good example of chaos, and actually the first place it was observed, is the weather. The chaotic character of the weather is the reason you can’t make useful predictions more than a few days forward. The only way to really predict the weather with full accuracy, is to wait and see what it does.

But for all the complexity of structure chaos can generate, the actual generating function can be extremely simple. So, for example, a double pendulum is a chaotic system. Just two degrees of freedom, but it generates chaos!

Here’s a nice example of “dancing” robots, with motions expressed as chaotic attractors (attractors being stable states of a chaotic system.) A good example of a very simple “robot” model, which generates behaviour which is quite complex:

How the body shapes the way we think: Rolf Pfeifer at TEDxZurich
http://www.youtube.com/watch?v=mhWwoaoxIyc&t=8m23s

There’s lots more examples I can give. I can give more if you like. Or you can just google up heaps yourself.

The upshot is it seems clear cognitive patterns generated by the brain are actually chaotic. Some people may dispute that. But I think their objection would be on the level that the chaos is just at some irrelevant substrate level. Like people who argue whether it’s necessary to model neural spikes.

But at the level of modeling the actual chaos, which is observed, the problem comes because no-body knows exactly what the generating function is.

Given such a function, though, some way of relating elements which is inherently meaningful to the system, any “prior”, really, there is no reason to imagine the forms generated by that “prior”, should not be chaotic.

The only thing about it is that you would not be able to “learn” it. Where by “learn” I mean abstract it in a form more compact than the system itself. A chaotic system cannot be fully abstracted more compactly than itself.

It would be like the weather in that sense. No way to abstract it more compactly than the actual weather. If you tried to do so, if you tried to “learn” all the patterns, you would get one hell of a lot of particular storms that happened at particular times, but they would all be just that little bit idiosyncratic to themselves. You might be able to analogize between them a bit, saying, “Today was a particularly strong SW’er”, or "Red sky at night shepherd’s this that and the other… " etc. But two storms would never be be exactly the same. And you would never know with absolute certainty where a tornado might touch down. A bit like the old Greek proverb of the same river never quite the same.

So a model of the weather based on “learning” actual patterns developed by the weather, would be extremely LARGE. Would become more accurate the larger it became. But could never be completely accurate.

LARGEness of attempts to model the weather based on “learning” actual patterns expressed by the weather, would be an indicator that the weather is actually a chaotic system.

Maybe I’ll stop there and see if you are with me so far. Does what I’ve written above make sense?

3 Likes

Compared to a biological neuron, the parameter count of a LLM is absolutely nothing. Current LLMs are barely equivalent to a bee or something even simpler. It’s hard to admit, but our Von Neumann architecture simply isn’t the most efficient. We stick with it because its flexible and useful everywhere but that comes at a cost for FLOPs/$

Transformers do learn priors, they mostly learn them in the form of circuits and algorithms embedded in their weights, some of which are quite ingenious really.
They also learn some complex mechanisms which are beyond the explanation of current interpretability works. For instance, they learn to meta-GD which is quite insane to think about - Gradient Descent converges to a solution which implements itself. Talk about meta.

This the crux for atleast 3 of your points. I have a simple position about this: backpropogation could be better, more efficient and much more faster while still being scalable.
However, I see no promising alternatives yet. All initial ideas never scale and never work outside toy domains.

It also doesn’t discount the fact that whatever algorithm the brain uses would be resource constrained, thus locality would be enforced implicitly to preserve energy. This would imply that it would be inferior to true full backpropogation, and backprop could be the way to potentially ASI rather than focusing solely on AGI.

Disagree. Being sequential has nothing to do with biology or the brain, which is rather recursive in nature.

That’s a lot of assertions. Any evidence to back up:

  1. cause and effect is the prior we need to create a generalist agent, and can be quantitatively embedded in algorithms
  2. proving transformers don’t learn that

I couldn’t care less what HTM uses to “learn”. It could be a bunch of slaves playing bingo with the weights for all I care. I simply wish to see results. If HTM can’t deliver it, then perhaps its entire foundation is wrong, the core model is wrong or the learning update rule is flawed.

So I would look for scientific literature ablating and modifying those components to see how they work better, If I were you.

Also, it doesn’t help your case that Numenta has pretty much given up on HTM and its paradigm :man_shrugging:

Well, because it simply isn’t enough. You aren’t raised through millenia by just language. In fact, visual stimuli was the main reason why the brain become complex in the first place. Darwinism selected organisms with more neurons who could process the complex 3D world and its spatio-temporal connections properly.

It’s simply a great way to increase unimodal data and compute efficiency by transferring knowledge across the domains. Look at visual transformers like BLIP 2 and their amazing capabilities across domains (despite never being trained on multimodal data directly; its a NLP model with a frozen adapter)

Just a small correction - the models being monolithic has nothing to do with the lacking sample efficiency. You could train smaller models on the same dataset and still get away with outperforming large ones (Chinchilla; in a few weeks, you’ll see LLaMa being a bigger rage than ChatGPT)

The training data requirements come with priors, the same which I’ve been harping on about for months now :wink:. LLMs start out with a blank slate - they make no assumptions about the world. We on the other hand have plenty of priors biologically to aid sample efficiency on earth and being generally very flexible/adaptable.

Larger models learn more complex priors, hence the more complex capabilities and increase transfer learning across all domains. As for being monolithic, they just work the best empirically. The brain (more accurately, the neocortex) is also quite uniform and monolithic. Perhaps there’s some link here, but empirical results are hard to argue against :slight_smile:

1 Like

That’s a lot of assertions. Any evidence to back up:

  1. cause and effect is the prior we need to create a generalist agent, and can be quantitatively embedded in algorithms
  2. proving transformers don’t learn that

How many assertions was that by me?

“2” would be proving a negative. I’ll concede no proof for that.

Aside from proving a negative, that leaves “1”, which is one assertion. Is one assertion a lot?

You can’t prove a negative. On the other hand, proving transformers do learn cause and effect as a generative principle would only require one existence proof. Should be easy for you. Just point me to the paper demonstrating it.

Evidence for “1”? Well, it works for language. That goes back a long way. It was the basis for American Structuralism in the '30s. You can look up Harris’s Principle, distributional analysis, Latent Semantic Analysis, Grammatical Induction…, for instance:

You shall know an object by the company it keeps: An investigation of semantic representations derived from object co-occurrence in visual scenes
Zahra Sadeghi, James L McClelland, Paul Hoffman
https://pubmed.ncbi.nlm.nih.gov/25196838/

Bootstrapping Structure into Language: Alignment-Based Learning, Menno van Zaanen
https://www.researchgate.net/publication/1955893_Bootstrapping_Structure_into_Language_Alignment-Based_Learning

The only problem is that it generates contradictions. Which prevent “learning” as such. That’s what destroyed American Structuralist linguistics, and ushered in Chomsky’s Generativism (Chomsky enormously dismissive of transformers.) Here’s a bunch of papers which characterize the inability for linguistic structure to be learned (starting at phonemes), as non-linearity:

Lamb, review of Chomsky … American Anthropologist
69.411-415 (1967).

Lamb, Prolegomena to a theory of phonology. Language
42.536-573 (1966) (includes analysis of the Russian obstruents
question, as well as a more reasonable critique of the criteria
of classical phonemics).

Lamb, Linguistics to the beat of a
different drummer. First Person Singular III. Benjamins, 1998
(reprinted in Language and Reality, Continuum, 2004).

Lamb and Vanderslice, On thrashing classical phonemics. LACUS

Forum 2.154-163 (1976).

Or for a more mainstream analysis of that learning problem here:

Generative Linguistics a historical perspective, Routledge 1996, Frederick J. Newmeyer:

“Part of the discussion of phonology in ’LBLT’ is directed towards showing that the conditions that were supposed to define a phonemic representation (including complementary distribution, locally determined biuniqueness, linearity, etc.) were inconsistent or incoherent in some cases and led to (or at least allowed) absurd analyses in others.”

That this then can be extended to something which generates meaningful structure for a generalist agent might be an assertion. More of a hypothesis. But first we should apply it fully to language.

What other assertions are you attributing to me?

By contrast, your thesis to the best of my ability to understand it, seems to be that transformers are fine. They are the full solution. We only need to make them even bigger, and give them even more data. Back-propagation is perfectly biologically plausible. HTM has been abandoned (quite possible) so the insights which motivated it are not worthy of consideration (less justifiable.)

And reiterated, that you are sure the way to move forward is to do more learning, over more multi-modal data.

Well, that has the advantage of closely aligning with what maybe 90% of people in the industry currently believe. Maybe you’re right. Maybe HTM was completely wrong. Maybe that Google chatbot really did just suddenly become conscious, and all we need to do to achieve final AGI is to get yet bigger, feed ANN back-propagation gradient descent learning algorithms yet more data, build a speech recognition engine with 100 years of training data instead of 77 (Whisper?) Learn it 2^42 parameters as Hinton jokes…

Maybe all Elon Musk needs to do to finally achieve fully human level object recognition for driving really is to perfect his auto-labelling system, in his automated driving example generating system, so that he really can finally label every possible corner case of driving that could ever occur in the world, and then train his networks to recognize everything that’s been labeled:

https://www.youtube.com/watch?v=2cNLh1gfQIk&t=1241s

1 Like

Hmm, I doubt it:

In my book that means 4x lower sample efficiency. LLaMa I think used > 1T tokens for training.
Which, I think, again, is significantly more than what GPT-3 was trained with.

PS sorry the above quote is from here


Sorry again, I have some idea what a chaotic system means, what I asked is what do you mean by an AI to be chaotic. Any particular algorithm/mechansim you can describe or just a general feeling about the fact natural intelligence exhibits chaotic properties hence (as with backprop) if an artificial system doesn’t replicate “chaotism” it won’t be able to be intelligent?
Actually, your assertion is even more confusing, you say it has to generate chaos. What I learned is some systems behave chaotically but I haven’t heard of chaos generators. Or you mean they are the same thing?


@neel_g deepmind’s paper puts it even more clearly:

we find that for compute-optimal training, the model size and the training dataset size should be scaled equally: for every doubling of model size the training dataset size should also be doubled

which indeed, as the model increases sample efficiency gets lower.

The fact chinchilla reached the same performance with fewer parameters but more training data is used to explain the larger model could have make use of (much) more data. Yet that could be a problem.

2 Likes

Natural language with or without extra sensory grounding is too complex to estimate how sample efficient a model is because we have little context.

Here-s a much simpler RL problem, an I-maze (which is actually an H turned 90 degrees, see page 12 what that means) is conceptually simple problem. A transformer model learned to solve it in 2million time steps which, since the game can be won in 20 steps means it needed to play it 100k times to “figure it out”.

Which I doubt it equates with “understanding” the problem. It should be solvable in a few dozen trials by humans and less (dogs, cuttlefish, mouses probably).

That’s what I call sample inefficient. Orders of magnitude more trials before “figuring it out”

My statement is simpler. Transformers have quite some way to AGI. But they’re still closer to being a generalist agent than HTM. Insights gained from HTM may be useful, but their application as of right now leads to no breakthroughs so I can’t really comment on their impact.

Self-driving won’t be solved anytime soon simply because the scale of the models which are deployed are pitiful. It’s like asking why we can’t stream 4k content in the 1990s. They just didn’t have the hardware back then to pull it off.

Models deployed on the edge are highly resource constrained. If Tesla ever wants to solve FSD, it would have to scale their models up OOM to start making a dent towards the problem. They do recognize the problem, hence DOJO but again its yet to be fully rolled out.

As I said,

Chinchilla scaling laws are simply a compute-optimal version of Kaplan et al. You could keep the dataset size constant and scale up parameters alone leading to improvements. The gradient would just would be slower than what chinchilla optimal models can do.

The models being monolithic however have nothing to do at all with sample efficiency whatsoever. Its a question of the priors and inductive biases we bake into our networks.

Again, priors. I feel I’ve repeated myself hundreds of times now. animals have biological priors carried down by evolution for adaptability and flexiblity. Everytime you start a transformer’s trianing, its learning from a blank slate. It makes no assumption about its environment or data distribution.

When you do find optimal priors, like AdA you quickly find transformers learning as fast as humans (if not faster) in real time, towards tasks the human hand-crafted. This is extreme sample efficiency, rivaling humans with a fraction of the compute and parameters. XLand 2.0 is also pretty complex; arguably the humans have more knowledge about how the 10^{40} possible tasks work than the agents itself, yet AdA still performs on par

All of this, just with ICL/few-shot and a frozen LM to boot. Simple learning through multiple trials about how the environment and its dynamics work, and optimizing its strategy to complete the task at hand better and faster. (I cannot stress this enough - watch the results reel!)

Transfer learning also helps; MGDT shows how the DT pre-trained on expert trajectories outperforms baseline, randomly init agents as well as improving on the expert trajectories themselves. It’s much more efficient despite being just 1B parameters. There is a huge opportunity just for grabbing the low hanging fruit here, and scaling it up to improve performance by magnitudes easily.

2 Likes

robf:

Maybe I’ll stop there and see if you are with me so far. Does what I’ve written above make sense?

Sorry again, I have some idea what a chaotic system means, what I asked is what do you mean by an AI to be chaotic. Any particular algorithm/mechansim you can describe or just a general feeling about the fact natural intelligence exhibits chaotic properties hence (as with backprop) if an artificial system doesn’t replicate “chaotism” it won’t be able to be intelligent?
Actually, your assertion is even more confusing, you say it has to generate chaos. What I learned is some systems behave chaotically but I haven’t heard of chaos generators. Or you mean they are the same thing?

“Generate” chaos or “behave” chaotically, I’m not distinguishing here. You can take them to be the same thing in my expression. They’re dynamical systems, which generate behaviour, and that behaviour happens to have the extreme context sensitivity and resistance to abstraction of chaos.

For a concrete model. I don’t think the leap is too far. Actually I think the process will be very similar to what is happening now in transformers. Transformers also “generate” structure. They will be learning to do it in the same way. Which is to say they will be learning structure which predicts effectively: cause and effect.

This is something which the language problem constrains you to nicely. It constrains you to think about a cognitive problem as cause and effect.

So what’s the difference?

I’m saying the difference is that grouping elements according to how they effectively predict cause and effect will actually find structure which is not static. Not “learnable”. Instead will find (generate) structure which changes dynamically from one moment to another. The groupings of the network, the hierarchies it finds/generates, will change from moment to moment and problem to problem.

Transformers don’t look for such dynamically changing network groupings/structure. They assume the network structures it finds will be static. They must do. Because the mechanism they use is gradient descent to find energy minima (where energy minima means the groupings predict the next element maximally.)

It’s a different assumption about the nature of the system you might find. That is all. If you assume one kind of structure, that is the structure you will find. If you assume what you find will be static, you will only find the static bit.

I’m saying we use the same principle of grouping according to effective prediction, but drop the assumption that structure will be static.

Dropping the assumption the structure you find will be static means you can’t use gradient descent. It means we need some other way of finding groupings of elements in a network which are predictive energy minima.

So the problem becomes, how do we find predictive energy minima in a network of observed (language) sequences, when we can’t simply track a fixed energy surface using gradient descent?

The solution to this problem for me came when I found this paper:

A Network of Integrate and Fire Neurons for Community Detection in Complex Networks, Marcos G. Quiles, Liang Zhao, Fabricio A. Breve, Roseli A. F. Romero

https://www.fabriciobreve.com/artigos/dincon10_69194_published.pdf

3 Likes

Nope it’s not. The I maze problem has 50% to be solved correctly by random chance and once you learn a “trick” there-s no way to fail it, unless you-re a total idiot. At least that’s how humans do. And has little to do with priors, since well, it is a very artificial problem.

The DL models they test, do not exhibit this ramp-50% success followed by a sudden 100% in a few (like a dozen) trials. when they “get” the clue. For ~100000 games (at least 2M steps) they have a gradual improvement. Which means they never get any clue, they solve it statistically. Which you glorify here as the success of priors.

AdA sure is interesting but is not settling the case. The sample inefficiency is about how long it takes until it finds those optimal priors. If you want the same model to play decent chess too, it needs to be trained on another few million chess moves. The priors in a game do not apply in another, not as fast as for us. Yes we have some important instinctive priors but much fewer and more “general” - the brain structure encoding genome isn’t too large it can not count as pretraining. Whatever priors we have (if you claim it’s only a matter of priors) then priors are very good at sample efficiency in a wide domain.

My bet is it’s about different learning mechanism than just priors, and large, DL models simply ignore these mechanisms.

2 Likes

Yes there-s something to causality but it might be deeper. I don’t think is our language skills that shapes our minds to conceive causality.
I think the assumption of causality is one of those important priors which shapes everything else about our minds, language included.

That assumption is so strong we gladly assert any correlation is in fact a causation (logically wrong, but when a logical fallacy saves our ass, it IS evolutionary correct).
Look at how prone we are to superstitions and believing dangerously silly stuff like witchcraft.

The big question is how could one engineer a speculative causality prior in a machine learning framework. I call it speculative because that’s what it is, we make assumptions and which/when one is confirmed we take it as “correct” for as long as it fits the data

Regarding the chaos discussion- that could be a fertile one … we-re better of moving it in another topic. Where I encountered it was in reservoir computing. The closer to chaotic threshold a reservoir is the better at modelling its input signals it gets, but not if it passes that threshold. They need to be almost chaotic to be useful, at least in that particular case.

2 Likes

I remain convinced that the key difference is that the animal brain is continually creating models of reality. You (or a rat) look at a problem, try a few things, pick a known model that might work, tweak it a bit, solve the problem and move on. Next time that model gets used earlier and the solution is found sooner.

Until our AI builds explicit models of reality, it will continue to fail on many things we do so well. And we have absolutely no idea how to do that, do we?

2 Likes

Yes, actually we have one grand model of reality and update it here and there - wherever it fails. I won’t say we have a whole idea on how that works, but even the fact that we know that only small incongruent parts are updated might be a useful clue in an attempt to replicate it.

try a few things, pick a known model that might work, tweak it a bit, solve the problem and move on.

Yes, I think this is the case yet on a massive scale. My assumption is that out of the ~100M columns 1% are stabilized shaping the grand “world model” while the rest of 99M are speculating. It’s another kind of search engine but it searches not only within the stable model but in a massive parallel manner searches correlations between model and details within current
experience trough raw speculation. Maybe not raw (as in random) but loosely informed by prior knowledge.

And I think the above assumption isn’t too hard to test, in code.

2 Likes

Think of it this way; when you see a maze (or are in one) then the grid cells fire and you have a spatial understanding. You know things, which are extremely basic but the network does not. For instance, a single look at the maze and you can deduce the correct path. But why should you go towards the vertical stem of the I maze? why not just keep going horizontally? well, the answer it obvious - you would hit the wall.

Now why would I not progress on hitting the wall? If I press the right arrow, then I move. But near that wall, this rule breaks down. Why? because walls are immovable and there’s no way you can pass through them. i.e, at the edge of the maze, you can simply not traverse, lets say the blacked out area.

These are assumptions. You make millions of them everytime, everyday. There are certain priors in this world - such that things repeat. If I drew a pattern here:
1,2,1,0, -1, -2, _, _

Most would answer the blanks are -1,0. But that’s a prior. This is like a sin wave and we find that in this world, there’s lots of periodic motion. The sun rises and sets. Day and Night follow etc.

But Imagine an Alien who lived in a mirror world. As in, their world was covered by mirrors and they survived there. In that world, things aren’t periodic but are symmetrical. Mirror images of each other. So they might look at it, and say the solution is obviously:

1,2,1,0, -1, -2, ||| -2, -1, where ||| is a mirror.

From their POV, the solution is obvious too. Why should it repeat, when it should clearly continue to form a symmetry.

The exact same happens here. You, as a human can tell what the correct way to solve the maze is instantly because of those same priors. Here’s an idea - what if, I give the agent +1000 reward if it hits the wall 3 times in a row? A human would never figure that out - why would you intentionally make the same mistake thrice? but the agent would, due to the simple \epsilon-greedy strategy which explores the environment. The agent doesn’t think its stupid, because it doesn’t know how things work. That’s what it’s finding out. For it, hitting the wall makes perfect sense if it gets the +1000 reward.

So everytime you let the agent loose, its spending most of its iterations learning that. “Ok, pressing right moves in this vague direction. black areas are some weird things which I can’t escape from, but which waste time colliding with - avoid them. I travel this area of pixels to reach the green one. Oh! I just got a reward. Maybe reaching those pixels gets me the reward?”

Which is why transfer learning helps so much. The agent already has some spatial and visuo-temporal priors, maybe even a map-like representation (even simple DQNs learn that, according to literature). Its why sample efficiency is directly dependent on priors. Because assumptions help - a lot more than you think.

Oh, but that’s the whole point of these DTs my friend. Priors do apply to other tasks; pre-trained networks learn a) faster, are b) more accurate and often c) outperform the very demonstrators they were trained on. See MGDT I cited above.

This behavior doesn’t even require huge scale to study!

2 Likes

Should look more careful at the I (or H) labyrinth problem. Having spatial priors can’t increase an agent’s likelihood of winning above 50%.

And I don’t think decision transformer pre-trained on 1B (!!!) Atari gameplay experiences would need much fewer samples to play decent chess than the same millions of games to reach same level of performance training on chess from scratch

PS Chess is hard, just let a DT (pretrained or not) play the H labyrinth.

1 Like

A little busy these last two days so haven’t been able to follow this thread fully.

| cezar_t
March 6 |

  • | - |

robf:

This is something which the language problem constrains you to nicely. It constrains you to think about a cognitive problem as cause and effect.

Yes there-s something to causality but it might be deeper. I don’t think is our language skills that shapes our minds to conceive causality.
I think the assumption of causality is one of those important priors which shapes everything else about our minds, language included.

Agreed.

I don’t think it is that our language skills shape our minds to think casually. I just think language might be the simplest place to observe this more fundamental causal bias in our cognition. I believe it is most transparently visible in language, because language is the data set most directly generated by the brain itself. It makes sense that the brain would pare down unnecessary information as much as possible in the signal which it produces itself, to be consumed by another perceptual system of the same kind.

So I think a more fundamental bias is just more easily seen in language.

But yes, I don’t think language is fundamental at all. (I’ve actually spent a lifetime learning different languages to explore that idea for myself, subjectively!)

The big question is how could one engineer a speculative causality prior in a machine learning framework. I call it speculative because that’s what it is, we make assumptions and which/when one is confirmed we take it as “correct” for as long as it fits the data

I think it might be easy to “engineer a speculative causality prior in a machine learning framework”. We can do it with the path shown by the simple example of language in the first instance: just build a network of perceptual stimuli. Observations. Observations in sequence. In the first instance build a network of language sequences. Because it is a very simple example. Not because it is anything fundamental.

Then the “speculative causality” might just be groupings in that network of perceptual observations which (energy maxima/minima) tend to share predictions = appear to have shared causality.

Regarding the chaos discussion- that could be a fertile one … we-re better of moving it in another topic. Where I encountered it was in reservoir computing. The closer to chaotic threshold a reservoir is the better at modelling its input signals it gets, but not if it passes that threshold. They need to be almost chaotic to be useful, at least in that particular case.

Great! Yes. Reservoir computing. Total agreement. That’s my current guess at an origin story for “cognition” (along the energy minima/maxima prediction enhancing grouping) lines I’m describing. I imagine cognition started as a kind of “echo state machine” over even very simple early nervous systems, imprinting events and consequences, causes and effects, as a kind of “echo mechanism”. And then evolution simply enhanced that cause and effect echo state network, by evolving to enhance the “echo” mechanism. Some kind of simple amplification of the “echo”. Amplification by “stacking” events which shared causes and effects. And that “enhancement” is what we now call “concepts”. Proto-“concepts” might just be groupings of things which tend to share causes and effects. And found by seeking energy minima in networks of sequences of observations.

For language, those “stacks” would be hierarchies of groupings of words which share causes and effects within a text. That’s what linguists found useful for building “grammars” for unknown languages way back in the '20/‘30s. And will be what is implicitly found in transformers. But the general case might apply to any kind of perceptual sequence. Tactile, whatever. (I recall it was one of Jeff Hawkins’ insights for HTM that touch also requires the skin to move back and forth over a rough surface. Perceptual sequence, but perceptual sequence for touch, this time. You don’t perceive texture unless you move your finger… So HTM was on to this importance of sequence thing, early… Also, we don’t have any general explanation yet for why our visual system resolves images using saccades…) It would just more easily identified for language. Only olfaction might be an exception, I don’t think olfaction uses sequence. (Though my namesake Walter Freeman did a lot of interesting work on chaotic behaviour in olfaction networks… Just not sequential??)

These groupings (starting with language) will be what are “learned” by transformers. But transformers assume the energy surfaces described by such groupings are static and can (must!) be found using gradient descent. In the context of reservoir computer/echo/liquid state machines, the energy surfaces described by these prediction enhancing “stacks” of shared cause and effect observations, would not be static. They might correspond more to dynamic clusters such as those described in the Brazilian paper.

Anyway, good, yes. Some interesting similarities of thinking. Would be happy to explore it further.

Happy to contribute to a separate thread on any of these themes if you want to create one.

3 Likes

I’m also very interested to hear more of your thoughts regarding language, causes & effects, and etc., already felt a great deal of inspiration from your posts in this thread!

Maybe the semantics part can be fundamental? As the syntax on the surface and even rule-based grammar are not apparently fundamental.

I have “rational thinking” compared to “daily language usage” in my mind, e.g. one usually say “there’s a straight line sharply rising”, but with formal math language, one can give y = 5x + 3 to state it more precisely. I suppose math as a language, is one (or even more) orders/levels higher than usual “natural languages”? But semantically one can always answer that y is 48, when asked “what if x is 9”? Maybe less accurate and with more vague phrasing, but as well as the in-situ pragmatics demands.

What if “the language skill” is “causality thinking” itself? Language is but the vehicle for communication, and the mere goal of communication is to change your counterpart’s mind somehow. Then what else can it be, other than your communication be the “cause”, and the mind-change of your counterpart be the “effect”?

Further, for a successful communication, i.e. to convince your counterpart about any idea, can causal-relations be avoided entirely? I suspect not. Obviously not to tell some rules/laws, and even to tell some fact, there is an implication that “since you live in this world, you must be interested in that the world has …”. Even to tell some fictional fact, the implication can be: “in case you enjoy thinking about subject X, it appears …”.

“Causality” makes people “believe”, if not with causality semantics, I doubt one’s mind can be changed any other way.

Along this line of thought, I would suggest our (human’s) modeling of the world is not compact/precise description of the underlying (possibly chaotic) function at all, but a (good enough) approximation. We can apply Newtonian physics perfectly okay for daily living, even though we know it’s incorrect (compared to general relativity, and quantum mechanics in turn, and …), we can still believe it’s universally true where it applies.

We humans are capable to “learn” where the chaotic boundaries are, and enjoy smooth (trivially approximatable) continuums between them. I suggest that there are “deep-knowledges” which is about where/when/how (including at-what-scale) some “shallow-knowledge” applies. Contemporary AIs seem to have learnt “shallow-knowledges” well, but not at all regarding deeper knowledges.

1 Like

The 1B refers to the parameter size, not dataset :slight_smile: For comparison, 1B models are so small that you can train them on your own on gaming hardware. A free Colab can get you to ~2-3B easily.

And yes, if you do that pre-training and fine-tune on some other game you do require less demonstrations. That is “positive-transfer” and there are overwhelming results for it.

Coincidentally, there was a work published today which happened to prove all those claims for embodied multi-modal LMs: https://twitter.com/DannyDriess/status/1632904675124035585

  1. PaLM-E exhibits positive transfer, outperforming SOTA on a VQA (Visual Q/A) task by ~10% more.
  2. Scaling leads to a drastic drop in catastrophic forgetting; despite being fine-tuned on a totally different domain, it retains NLU/NLG performance proportional to the scale of the model.
  3. LLMs can handle a variety of domains and cross-transfer between them, again with scale. PaLM-E for instance handles language, images, an embodied robot, and a combination of either very well.

and the kicker is that PaLM isn’t even the best model - we’ve yet to Chinchilla scale it, or apply MoD training which further transcends scaling laws with a mere fraction of the compute (often ~0.1% of what’s employed in the base LLM)

2 Likes

I am agreeing with @complyue on this:
I’m also very interested to hear more of your thoughts regarding language, causes & effects, and etc., already felt a great deal of inspiration from your posts in this thread!

I find your proposal that HTM’s predictive properties may be a good base to construct transformer-like architecture to be very interesting. I see that the higher level connection structures of the brain to be a little like what I understand about multi-head attention so this fires my imagination.

1 Like