"How Your Brain Organizes Information" video

A first approximation would be to compare them to the data they are “learned” from. Does anyone have a number for that? A number to compare the number of parameters “learned” by a contemporary system, and the size of the data set used to train it?

Here, there’s an analysis from Wikipedia:

“Sixty percent of the weighted pre-training dataset for GPT-3 comes from a filtered version of Common Crawl consisting of 410 billion byte-pair-encoded tokens.[1]: 9 Other sources are 19 billion tokens from WebText2 representing 22% of the weighted total, 12 billion tokens from Books1 representing 8%, 55 billion tokens from Books2 representing 8%, and 3 billion tokens from Wikipedia representing 3%.”

So that’s 500 billion or so tokens abstracted to 175 billion parameters? I guess you could say that’s significantly smaller. ~1/3 rd? Apparently gzip achieves “90% for larger text-based assets”.

It seems “Open” AI has not released details of the training data for GPT-4. But I find record of 170 trillion trained parameters? What’s that 170,000,000,000,000?

This article about BARD says the data set was 1000 times bigger than previously. So if the increase for OpenAI going to GPT-4 was comparable, ~500 billion tokens trained on would now be 500 trillion? So about the same ratio, trained parameters 1/3 the data set size?

Maybe you could justify an argument it is finding a smaller system based on those numbers.

If anyone has better data I would be interested to see it.

My hypothesis would be that at some point the data set would hit a hard ceiling of all the information on the planet! But that with a sufficient training budget, the “parameter” size would continue to increase.

I’m also comparing the size of the data set used to train these things and the amount of data typically absorbed by an infant. I sketched an argument around that in this earlier thread:


That is a bit confusing, what is there:

  • trainable parameters: fixed number of weights the model (matrices) has. These are changed during the training process.
  • training data: amount of text used to train the parameters above, measured in tokens.

Some empirical evidence shows “optimal” ratio between the two speculates there could be a ratio of 20 data tokens/ parameter according to Chinchilla, or even larger (70/1 or more) according to LLaMA.

The later is more expensive to train but aims to obtain smaller (= cheaper to run) pre-trained models that are more accessible to mere mortals.

PS here you find a table mentioning both # of parameters and training dataset size for various models. Notice the “explosion” of models in March 2023


Some very important correlates to Consciousness in the paper highlights:

Semantic processing requires distinct neural mechanisms with different brain bases.

This is just NCC, Neural Correlates of Consciousness.

These include referential, combinatorial, emotional-affective and abstract-symbolic semantics.

The basis of feelings is the way the mind interprets affects.

All semantic processes are grounded; only some are embodied in action and perception.

Embodiment is the source of metaphorical relationships interpreting action and perception (Lakoff).

Neural processes of disembodiment are explained by neurobiological principles.

Again, NCC.


Thanks for that. That’s good data. Here’s my “back of envelope” calculations on that:

BERT            2018            340 million[19] 3.3 billion words               ~0.1
GPT-2           2019            1.5 billion[22] 40GB[23] (~10 billion tokens)   ~0.15
GPT-3           2020            175 billion[11] 499 billion tokens              ~0.33
GPT-Neo         March 2021      2.7 billion[27] 825 GiB                         ??
GPT-J           June 2021       6 billion[30]   825 GiB                         ??
Megatron-Turing October 2021    530 billion[32] 338.6 billion tokens            ~1.5
Ernie 3.0 Titan December 2021   260 billion[33] 4 Tb                            ~0.065
Claude[34]      December 2021   52 billion[35]  400 billion tokens              ~0.15
GLaM            December 2021   1.2 trillion[37]        1.6 trillion tokens     ~0.75
Gopher          December 2021   280 billion[38] 300 billion tokens              ~0.9
LaMDA           January 2022    137 billion[40] 1.56T words,[40] 168 billion    ~0.8
GPT-NeoX        February 2022   20 billion[41]  825 GiB                         ??
Chinchilla      March 2022      70 billion[42]  1.4 trillion tokens             ~0.05
PaLM            April 2022      540 billion[43] 768 billion tokens              ~0.6
OPT             May 2022        175 billion[44] 180 billion tokens              ~0.95
YaLM 100B       June 2022       100 billion[46] 1.7TB                           ~0.05
Minerva         June 2022       540 billion[47] 38.5B                           ~13
BLOOM           July 2022       175 billion[14] 350 billion tokens (1.6TB)      ~0.5
AlexaTM         November 2022   20 billion[51]  1.3 trillion                    ~0.015
LLaMA           February 2023   65 billion[54]  1.4 trillion                    ~0.05
GPT-4           March 2023      Unknown[f]      Unknown
Cerebras-GPT    March 2023      13 billion                                      ??
Falcon          March 2023      40 billion                                      ??
BloombergGPT    March 2023      50 billion      363 billion token dataset       ~0.15
PanGu-Σ         March 2023      1.085 trillion  329 billion tokens              ~3

Seems mixed. Going from 90% compression for BERT, to later models with both some very large numbers like 1300% expansion for Minerva! But also a 95% compression for LLaMA.

My rough figures might be cooked though. Errors in arithmetic please.

1 Like

The term “compression” you use here is incorrect since it compares apples to oranges - a token represents a text word (or part of it) while a parameter is a number which usually is a 16/32bit float for training and can be compressed to a 8bit or even 4bit for inference.

PS beware each input token is expanded to a vector of 1-10k values before being “crunched in” by the network. So it is enlarged to a couple dozens of kbytes in large models.

PS2 Minerva starts from pre-trained PaLM by training it further with data from math/science. A domain-specific fine-tuning. Look at the comments in the last column.

“PT” from GPT stands for “Pre-trained Transformer” which means anyone may continue training it with much fewer tokens (than used originally) for specialization in a specific domain. Well, assuming you can get it and afford the extra 1-5% computing cost.


Yeah, sure. A very general sense of “compression” used there.

I’m hypothesizing a general trend anyway. And that seems mixed. But with some blow outs, which might indicate parameters increasing even as data size remains constant.

1 Like

The measuring unit used is tokens/parameter ratio. The chinchilla experiments found an “optimal” 20/1 value for it. They showed that a Chinchilla at 70B parameters slightly outperformed a 280B model trained with 4.5x fewer (300B vs 1.4T) tokens but both using the same compute budget for training.

Optimal means that reducing parameters further - and increasing # of tokens proportionally the performance starts to drop.

However, LLaMA experiment said “so what? Let’s make best small models, let’s train them at 70:1 data to token ratio” . The models still improved, sub-optimally but their goal was to obtain a transformer so small it can be run on “common” hardware.
So they obtained 7B to 65B models that are best performers in their lightweight category.

PS The LLaMA smaller models have higher than 20:1 ratio.

A model trained on 4 times the tokens only slightly outperformed one trained on 4 times the parameters? The larger data set is still better, but only slightly?

The question was, as @roboto hypothesized, whether the parameters represent any kind of simplification of the data.

Or, as I hypothesize, the parameters actually represent a kind of expansion of latent structure in the data, and there’s no limit to how much increasing them might improve the model.

If parameters were a simplification of the data you might expect that the number of meaningful parameters would decrease as the data size decreased. If you’re simplifying something it seems reasonable to assume it results in something smaller. But this seems to be indicating “simplifying” is much the same as just increasing the data size. With this result you might argue that calculating more parameters “expanded” structure to a degree roughly comparable (4x) to the degree to which the data size was reduced (1/4.) Calculating more parameters wasn’t simplifying more, it was expanding in roughly the same way as adding data.

I’m hypothesizing the models will continue to get better as the number of parameters are increased, even if data size is limited. This seems to me to be consistent with that.

It strikes me that the sense of “optimal” in this 20:1 ratio is mostly talking about compute overhead. It’s a floor on the number of parameters they can get away with. It’s not a ceiling on the number of parameters which would be useful. They always want fewer parameters because it’s cheaper and smaller. And 20:1 is as much as you can get away with before your model really starts to decline (though LLaMa pushed it?) But what about just increasing the parameters?

Has there been any limit noted to improvement with increase in the number of calculated parameters? Other than that it is not “optimal” because your model gets even more expensive and large.

From what I’m seeing here it would seem the evidence is consistent with the idea that just increasing the number of parameters infinitely would increase model performance in a way proportionate to that which infinitely increasing data size would. To the extent it’s been observed, it’s been proportional. Just that there’s a 20:1 floor on the number of parameters.

Interesting. So a “simplification” (if that’s what parameters are) of a larger number of tokens always needs to be larger… That’s already a suspect sense of “simplification”. And it’s roughly in proportion.

So that’s numbers for decreasing parameters in proportion to an increase in tokens.

What about increasing parameters in proportion to a drop in tokens?

For Chinchilla you said that resulted in a model which was only “slightly” worse for proportionately fewer tokens.

But they didn’t push this. They didn’t try just increasing the number of parameters?

I guess they couldn’t get numbers for that, because it would mean collecting more data to compare.

I’m guessing they haven’t been so interested in doing it, either. That probably seems like a dumb idea. Because loading up on the token side is going to result in cheaper and smaller models. If it’s all proportional, why would you load up on the side which results in larger and more expensive models?

But it does seem from these numbers you have given, that just endlessly churning out parameters might result in models which continue to get better, and in a way proportional to what you would achieve by endlessly increasing the data size.

Seems that way to me, anyway. Is there something I’m missing?

1 Like

It’s an expansion of alternative simplifications. Actual simplification happens when a small subset of these params is activated in specific use cases.
Think of it as multiple compression algorithms working in parallel. Each will compress an input, but the sum of these compressions may be even longer than the input.

1 Like

Well, increasing # of parameters was already known to improve results for a given training dataset.

The gorilla in the room with more parameters is deployment of the resulting model is prohibitively expensive both as memory and compute.

So the research goal with chinchilla, llama and derivatives was to deliver smaller models that are as expensive (or more) to train than a large model but much cheaper to use.

This isn’t by far the end of it, there-s a whole species explosion happening right now all attempting to address limitations of previous (= last month’s) generations.


Right. Yes. Thanks. “Alternative simplifications”. Maybe that’s the way to say what I’ve been trying to say. That the system of language, and cognition, meaning, resolves as an almost unlimited number of “alternative simplifications” of the world.

“Alternative” only making sense if the “simplifications” contradict in some way (otherwise they wouldn’t be alternative?)

And that instead of trying to laboriously and expensively expand all these “alternative simplifications” during a single “learning” phase, we should change our understanding to be that we need to concentrate on finding the only the successive, single, “alternative simplification” of raw sequences of language, which is relevant to a given context, at the time that context is encountered.

1 Like

Right. So the question becomes whether the improvement with number of parameters tapers off at some point, or if it continues indefinitely.

If it continues indefinitely, that’s what I’m saying.

Maybe that’s a way to put what I’m saying. That the useful parameterizations of language (and all cognitive data) grow indefinitely.

That fits my argument that what we need to do is to calculate the “parameters” relevant to a given situation, at the time that situation appears.

That’s the only way to tame such a gorilla.

And it’s actually a good thing, because you can get that gorilla to do an infinite variety of things. You just train it as you go, rather than trying to teach it everything you might need, before you know you might need it.


I guess it’s more interesting to give a shot at a different type of AI than trying to overcome limitations of, and improving transformers. Simply because the latter is a very crowded train. Though understanding as much as possible about them might be useful.


It’s basically exploring multiple scenarios, which may occur in different times/places. There is no one way of defining “context”, possibilities expand with the system’s knowledge. So it will always be continuous generation and pruning of alternative simplifications / patterns, the only way to manage it is cost-benefit analysis.

That’s fine-tuning. You don’t know what you might need, else there is no need for learning at all. But it should be far more efficient with segmented / clustered models, basically mixture of experts. And I personally hate learning by backprop.

1 Like

Glad that we agree cognitive structure will “expand” into “continuous generation … of alternative simplifications/patterns”.

Currently managed by a kind of cost-benefit analysis, I agree. They go for as many parameters as they can afford. Though without much awareness. Most of the field surely believe with @roboto that they are finding simplifications. And that prevents them seeking expansions more efficiently.

You probably missed that I sketched (again) my alternative to backprop, earlier in this thread:

1 Like

No idea what that means, sorry. You are talking about STM in SNN? This doesn’t seem to be related to either fitting or indirect generalization performed by backprop. Minimized energy means that activity in the network should die down, and you have it maximized instead.

My alternative is connectivity clustering, basically lateral graph composition. That should happen in all layers in parallel, feedback will only adjust hyperparameters. But it’s quite complex and not at all neuromorphic.

STM… State transition matrix??

I’ve minimized the disorder. Minimized the prediction “energy”, or entropy. Minimize the entropy, maximize the order. Minimize randomness. Maximize the shared connectivity. Maximize the shared prediction.

The minimum of one thing is always the maximum of another.

The point is that states which oscillate together will have the same connectivity. It will give you the clustering of states which share the same connectivity. And the clustering of states which share the same connectivity is the clustering of states which best helps you predict the next state.

Minimizing the prediction entropy doesn’t seem related to generalization performed by backprop in transformers?

Backprop will optimize (max or min) anything. It’s just finding where the slope flattens.

If backprop is seeking to maximize prediction (or minimize the entropy of prediction), as in transformers, then the generalization performed will be exactly that of states which share predictions.

That’s what they’re getting. They’re getting groups of elements, also reaching back in the sequence using “attention”, which share predictions.

They use backprop because that’s the only tool they’ve ever had. It’s more about the tool than the task. It kind of started producing something special by accident when they applied it to language sequences. Because language forced them to apply it to sequences? No-one knows.

They never imagine the maximally predictive clusters might contradict. They don’t really know what kind of network they’re generating. Any contradiction is resolved by “attention” without them thinking about it. Big surprise when “attention” suddenly seemed important.

But the things just keep getting bigger and bigger. Also no-one knows why.

That’s what we do in this business these days, just blindly follow the direction which is working, for now, hoping the slope eventually flattens.


1 Like

Short-term memory, reverberation in neuronal ensemble resulting from recognition / classification.
This is different from clustering, where clusters are iteratively reformed through competitive learning to maximize mutual similarity of the elements. Similarity is a far more descriptive term than all that obscurantist physics envy: energy, entropy, disorder, randomness etc.
This iterative re-clustering dies down as the gain in mutual similarity is gradually reduced, same as convergence in backprop.

Everyone likes simple and natural. I thought my algo will be very simple too, until I started coding. But, function uber alles. Things always get more complex as you add functionality. That’s the nature of any progress, be that evolution in biology or technology, domain-specific or pure math / informatics. Simplicity doesn’t scale, otherwise the world would ruled by bacteria.


Fascinating word definition festival, where “clustering” is distinguished from “short-term memory”, which is somehow uniquely identified with vibrations. All of energy, entropy, disorder, randomness, are dismissed as obscurantist. And “similarity” is enthroned again as king of words, encompassing all meaning.

1 Like

That’s the difference between learning and recognition, or training and inference in DL. Centroid clustering is a fairly intense process, you can’t do it online.

Surely you don’t expect all possible “predictions” to “oscillate” all the time? This oscillation must be triggered by some specific input, for a short time. Hence STM.

1 Like