Well you wouldn’t be the first. As I say, linguistics was shattered. Chomsky went off searching for a “layer below” in our biology. Four or five generations of theory never found it. A school split from him (acrimoniously) that centered around George Lakoff and some others who thought a “layer below” could be found in “meaning” primitives (cognitivism.) And they joined an earlier another school split off which thought a “layer below” could be found in “function”. They all failed to find any kind of stable “layer below”, and the cognitive and functional schools drifted to the humanities and became preoccupied with social abuses of subjective meaning and function (broadly speaking, e.g. Lakoff: Women, Fire, and Dangerous Things.)
You do need some kind of anchor. But I think an anchor in predictive value is enough. Group things according to the way they predict the next element in the string, and the groupings don’t need to be knowable. That “unknowable” groupings are effective predicting the next element in the string is enough (actually you can know them, it’s just those knowable groupings can contradict…).
Anyways, the idea that language is unlearnable is a bit confusing. If i’m not mistaken, you were exploring the idea of creating a graph of words from a corpus and using that to invoke some sort of oscillatory property. Unless there is learning of any sort involved the edges connecting the nodes must be generic and stay unchanged over time (time means the learning period not intergenerational time where word meanings can drift) otherwise that’s a form of learning. I think you meant that language has some learnable properties and some unlearnable ones like how you said there must be some anchors of some sort?
Could you elaborate more or share some simple examples to illustrate the above point?
Also, by novelty did you mean it like how one creates new metaphors or jokes? I’m not sure I have the same idea about “novelty” as you but in my mind what I think you meant by ‘novelty’ could be argued against with the following idea: “There is a set of made up words and a simple set of combination rules such that an infinite number of unique sentences can be created.”?
And by contradictory did you mean like how the word forest “refers” / “should be grouped up with”/ “attend to” to the man or the dragon in the following sentence: “Joe is hunting the dragon in the forest.” In my mind the forest can refer to Joe, the dragon or both. It implies the both of them are in the forest but less certain for the dragon as it may be kidnapping princesses in the capital city. Also, “the dragon in the forest” may be anickname given to the dragon and so the sentence does not have any information of where the hunt takes place. Is this what you meant by contradictory groupings or is it something different? IMO transformers are really great at these kinds of things as in taking into account multiple possibilities at once with its trillions of parameters.
Ha. Sorry if I seemed to come down hard on you with that. I mostly wanted to draw a sharp contrast with what @DanML was saying about the book he was recommending. Your feeling that there must be some system is not unusual. Dan said he had it too. As I said, the entire history of linguistics has been a search for that system. The same search, only distinguished since Chomsky, by different reactions to Chomsky’s strong assertion that it can’t be learned.
So you’re by no means alone.
All of ML is looking for it too. The ML community ignores formal linguistics. With good reason. Because if they asked, formal linguistics would say it can’t be learned. So the ML community too ignores the history and works on pure conviction that something learnable must be there.
Since linguistics had it’s big contradictory learned structure shock though, we’ve become more familiar with complex systems, chaos, and the idea that some things are only completely described by themselves. So we should be more open to that explanation for the linguistics history. But chaos is not a big thing in science even now. I guess because science generally has to limit itself to systems which can be abstracted. Otherwise what’s the point. It’s no use saying we can’t know! So even now that we appreciate systems of this sort exist, that understanding has not proven especially useful in practice.
Stephen Wolfram was promoting a “New Kind of Science” on these lines for a bit. With the idea that some things are unknowable, so we must get used to the idea of unknowability. In practice I don’t think it’s gone far.
That’s actually where AI has an advantage. Because with cognition the idea that a system might be unknowable doesn’t matter so much. With the weather it’s a problem. With the weather you want to know the actual weather. Chaos puts a hard limit on the accuracy of weather forecasts. But with AI we’re a bit better off. With AI it’s enough to make another system of the same type.
Yes, the word “learn” is a problem here. That’s one reason I usually put it in quotes.
Obviously we all “learn” a language. We’re using one now. It’s more a question of degrees of abstraction.
I’m proposing that language can’t be abstracted more than any set of examples of its use. The examples can be “learned”. In the sense they can be recorded, and even coded into the connections of a graph. But you can’t abstract it further than that.
It’s the same problem as that other example I pulled up:
It doesn’t mean you can’t have the graph. It just means you can only have the graph. Nothing smaller will do.
The “anchors” become the kind of clustering you decide is “the best” for your problem.
I think that network example might illustrate the point. Let me know if it doesn’t.
Beyond the idea that different clusters might be relevant to different situations, you are only left with the problem of how exactly you want to cluster them.
For language I just think the relevant way to cluster networks of language sequences, is according to clusters which best predict the next element in a sequence.
And to do that I conjecture, it may be enough to set the network oscillating by driving it on that sequence, and see how it clusters about it.
It would be similar to the way transformers work. They also work by clustering sequences according to how well they predict the next element in a sequence. With particular sequences selected by “prompts”.
The main difference would be that transformers assume those clusters will be stable in some way, and that it will therefore make sense to learn them once, and use those learned clusters repeatedly.
I just suggest that the fact doing so generates… trillions (is it now?) of parameters, might actually be suggesting the clusters are better seen as subjective, generatively novel, and contradictory, similar to the clusters in the quoted article.
By novelty I mean new clusters all the time. It can be new metaphors and jokes. Or it could just be a new sentence. Or more specifically clusters about those new sentences, “across” them in a sense. which predict them.
The refutation of that would indeed be, as you say, “a simple set of combination rules such that an infinite number of unique sentences can be created”. But no such set has ever been found. And I say the history of contradiction found in theoretical linguistics when they attempted to find such rules, indicates that it cannot be found.
If it could be found, you’d be right. I say the evidence indicates it can’t be found. The history of linguistics, and trillions of transformer “parameters”, back me up.
I’m not sure. There is relevance in the multiple groupings which are possible about the different interpretations of such a sentence.
Classic examples I struggled with when I was working on machine translation systems addressing these problems were:
“He saw the girl with the telescope”
or “He hit the nail with the hammer”.
A favourite was:
“I buy the paper to read the news”
“I teach the boy to read the news”
That one wasn’t just an ambiguous association, the entire interpretation of the sentence changes from “reading the news” being the goal of an action, to being the action taught.
Or examples like “garden path” sentences, where the grouping changes in mid-phrase:
“The horse raced past the barn fell”
But such ambiguities are as you say, resolved in context.
I’m more interested in the groupings which form among the network of examples to resolve these ambiguities.
I guess it might be the same. These examples illustrate the ambiguities. And they are resolved in context. But they don’t illustrate how are resolved. So I’m interested in the same examples. But I’m interested in different groupings within a network of examples, which flip between one interpretation or another with the different interpretation of the elements in the sentence, but which are not visible in different groupings of the surface forms of the examples themselves.
So, if you imagine sets/clusters of similar words projecting up and down around the words in these ambiguous examples. Not along the strings of the examples themselves, but out, up and down, on top of and below the strings of the examples.
Something like:
I teach the boy to read the paper
eat his supper
make his bed
do his homework
...
I’m interested in how these clusters might form and reform, as different ambiguous groupings of the surface sentence are resolved.
It is these clusters which resolve the surface ambiguity.
I agree. They do it well. All this particularity is necessary. And transformers capture more particularity than anything else up to now.
But it begs the question. What are trillions of parameters telling us about our conception of the problem?
To treat all this particularity as an abstraction seems back-to-front to me. It’s using the techniques of finding abstraction, when actually it seems there is no abstraction. Everything is particularity. They are just enumerating detail. The idea of “learning”, comes from the time when we thought there would be a small number of abstractions which would suffice. But there’s not. These things are just learning trillions of particular examples.
I’m arguing with the idea, held over from the time we were “sure” there were complete abstractions, that it is both more efficient, and can actually be complete, to try to “learn” (in the sense of abstract) all these particular clusterings, before we meet them.
So, I’m arguing for a solution much like transformers. But dynamic. Finding the predictive groupings from the dynamics of the network. So it is possible to find all of an infinite number of often contradictory groupings. Instead of trying to “learn” them all, and inevitably falling short.
I agree very much with the above but I don’t get what you mean by “transformers assume those clusters will be stable”. Transformers take thousands of words into context when “forming best clusters”. I even think that transformers are replicating deep/slow thinking limited only by its context window size.
There was a research that suggests Transformers are Hopfield networks after some slight modifications and if it is allowed to oscillate at each layer of the transformer. Not sure if this is fundamentally different from your approach but there’s the oscillatory component to it. Youtube link: Hopfield Networks is All You Need (Paper Explained) - YouTube
Darn i was so sure there could be one even if it meant an abstract language of sort like a programming language. I mean in a cheaty kind of way I can make an infinite number of unique sentences just by incrementing the number by one in this sentence: “I want to have 1 piece of apple.” But that’s besides the point and irrelevant now that I know more on what you meant.
Are you basing the above opinion more on the cluster formation side of things based on the article you linked above titled “The Many Truths of Community Detection”? Or is it more about how it’s impossible to learn how to choose best clusters based on context?
Can you restate exactly which bits you agree with. Because to me that sounds like you saying, “I agree clusters won’t be stable, but don’t get what you mean by saying they won’t be stable”!!
Was it just the prediction of sequences using clusters bit that you agreed with?
It seems clear to me that because they start from a paradigm of “learning”, transformers, and likely Hopfield nets, will try to find clusters which predict better by following prediction “energy” gradients to try and find minima. That’s great, but once you find such a minima, that is your cluster. It won’t change. You will have trained your network with weights which embody it.
That to me is a “stable” cluster. And seeking a maximally predicting cluster in that way, seems to me to assume the cluster will be stable. Otherwise you wouldn’t try to embody it in network weights.
You might say the network learns many different clusters dependent on context (context = “attention”.) And those many different clusters might be thought of as a form of “instability”. Is that what you meant by “take thousands of words into context”? But however many such clusters you learn, even if they are contradictory (distinguished only by “attention” context?), their number will be fixed after the learning process.
Having a fixed number after learning is just an artifact of a paradigm of learning which seeks to gradually adjust weights, and follow energy gradients to prediction minima.
That “learning” paradigm has the advantage that you don’t have to assume as much about the system to be learned. You can try all sorts of things, and then just follow your prediction energy gradient down to a minima.
So it has the advantage of being dependent on fewer assumptions about what is to be learned.
But it has the disadvantage of being dependent on the idea that weights will gradually converge, and then remain static. You start from the slope, and trace it back. So the slope must be fixed (and “smooth”/“linear”?)
By contrast, I think the prediction energy “slope” flips back and forth discontinuously. That means you can’t trace back along it. (Is the ability to “trace back along it” an assumption of “linearity” in back-prop?)
So how do you find an energy minima predictive state in a network which flips from one maximally predictive cluster to another, discontinuously, as context changes?
Fortunately, the tracing of slopes smoothly down to minima need not be the only way to find clusters which maximally predict in a sequence network. If maximally predicting means having a lot of shared connections, such clusters will also tend to synchronize any oscillations. In any case, synchronized oscillations are another kind of minimized energy state of a network. And in a sequence network, a synchronized cluster will surely be a minimized energy state of sequence.
I looked at the paper, but couldn’t find any mention of oscillations. Does Kilcher mention oscillations? Can you cue the spot in the video?
I’m not familiar with Hopfield nets. But the paper was just saying they can achieve equivalent forms of learning. Which I don’t doubt.
Do they not also gradually adjust weights, to find minima along smoothly varying, “linear”, energy surfaces?
I’m basing it firstly on what I observed when trying to learn grammars when I first started working on the grammar description problem. It’s fairly simple really. You can learn grammars fine.
You just classify words based on a vector of their contexts. QED… or so it seems…
The only trick is that the vectors contradict! It’s just a element ordering problem. You order all the elements (contexts) one way, and it gives you a nice (grammatical) “class”. The trouble is, you order them another way, and it gives you another class. Both classes are correct. The word participates in both of them. But not both at the same time. And while other words will belong to those same classes… mostly. None of them will have all the same elements. Some might have some elements, others might have other elements. It all becomes a big hodge podge, and the only way to capture it all is to order them the way you want, when you want to.
So this is something I observed directly.
It’s only after, that I found other people were noticing similar “problems”. Having a physics background I felt there was an intriguing parallel to uncertainty principles, and even gauge theories of particle formation. And then I found other people noticing the same thing. The first one was Robert Laughlin, “Reinventing Physics from the Bottom Down”. Emphasizing the irreducibility of structure at each level even for physics. I mentioned in another thread the parallel Bob Coecke has drawn between language structure and quantum mechanics… etc. etc. Then I found there seemed to be a parallel in the diagonalization proof of Goedel’s incompleteness theorem (though that’s still conjecture, I haven’t nailed an equivalence down for that.) But the same parallel exists in the same applicability of category theory I found being applied to language structure. Category theory being itself a response to Goedel incompleteness in the maths context.
Oh, and then I found the observation of contradictions in the history of language learning…
The network clustering paper I linked is just the latest in a list of many examples.
But initially it was something I noticed myself, when trying to learn grammar.
Have you considered that in the search for “underlying rules” part of the problem is that much of the human network training arises from embodiment? Since an AI program does not have this grounding it is flailing around trying to recreate the portion that is part of the humans learned experience without this grounding.
Look to this paper to see that much of semantics are grounded in the parts of the brain associated with how we sense and control the lived experience.
The functions are distributed over many processing centers with significant portions split between the grammar store (motor portion) and object store (sensory portion). The filling in process of speech generation seems to me to be somewhat related to the Hopfield model mentioned above, with various processing areas providing cues and constraints as the utterance is being assembled. I see the cues as a combination of external perception, internal stored precepts, and the current contents of consciousness. The constraints being stored precepts.
If you consider the frontal lobe as motor programs generator based on subcortical commands, speech is just another goal oriented motor command sequence, guided by the connectionist feedback from the sensory regions.
As is “thinking.”
Don’t forget that since this is a motor program, the cerebellum will be providing coordination and sequencing between the various components of the motor program.
If you look at this the right way, you may notice some similarity to the major components of the transformer networks and how the brain is organized. The large number of processing areas (about 100 in the human brain) corresponds to multiple attention heads. The “loop of consciousness” roughly corresponds to the transformer deep in/out buffer.
Oops yes that sounds like how you said. But yes everything feels familiar other than the parts related to the clusters being stable.
Absolutely spot on! That’s what I was trying to say about taking thousands of words in context. My opinion on that is it shouldn’t be taking thousands of words into context. A bit of exaggeration. The amount of different contexts made of thousands of words could be a combinatorial nightmare. It could be trying to find general rules of combination consisting of much smaller parts.
I don’t know enough to answer or understand what you’re claiming. Would be helpful if other people that knows more about it could discuss this together. The parts I quoted above could be a point that you can elaborate and emphasize more when discussing your views with a deep learning proponent.
at 20:01 is one of the parts where he starts to explain the hopfield network’s multiple update mechanism of iteratively trying to find energy minima. I took this as being somewhat similar to the process of achieving a state of stable oscillation. Stable meaning finding the correct clusters and those clusters not changing. And yes you’re right this is similar to standard transformer. I was trying to point out some similarities but it seems it’s fundamentally different from your idea.
My thinking is that there’s only so many different clusters that can be formed in a sentence. Learning them all shouldn’t be a problem and that could be separate from learning to resolve the many different potential clusterings based on context. But I don’t think this observation is enough to refute your idea.
I think it is. That’s what it is trying to do. And they find many such “general rules”. Many, many… And as generalizations they are useful. The system generalizes. And it generalizes in specific ways according to the prompts. Everything becomes about supplying the right prompts to isolate the generalization you want.
But to say they are successful because they have found generalizations “consisting of much smaller parts” I think we can agree is questionable. If the generalizations are in terms of “much smaller parts”, why trillions of parameters?
No-one asks these questions, because just making them bigger is working wonderfully well for now. A bit like climbing the tallest tree to get to the moon, you seem to be making marvelous progress, until you’re not.
Well the first step to understanding is to identify a question. At least you are thinking about the problem. I think most people going into “AI” just now don’t think about what’s fundamentally happening at all. They are just wow’ed by the trillions of generalizations these things suddenly make accessible automatically, and their superior memory. They dismiss the already evident limits of those trillions of ossified generalizations in “hallucination”, perhaps propose some kind of past tech bolt on to filter the “hallucination” out, and just use all these wonderful new APIs to make pretty pictures, or try to get funding engineering prompts etc.
I see what you mean by a kind of oscillation. It’s in an iterative procedure of magnifying a particular key, to retrieve a particular value.
Also, what he says about not aggregating the values, but aggregate the keys, ~minute 20.40, sounds a little like the distinction I was trying to draw between “similarity” of a pixel greyscale value, and “similarity” of context defining a diverse set of pixel greyscale values, in my other thread with @bkaz.
But I even if a Hopfield net is iteratively magnifying its lookup, it is still performing a lookup…
I don’t know. Maybe if the network were completely untrained, it might equate to what I am suggesting. If you equated “keys” to sequential network connections… And then iteratively fed back the “lookup” of that key, to be a “lookup” of some structure latent in the network. It might start to come to the same thing.
Then the insight I’m promoting would be that you would not want to train the network! Any “training” would just be throwing away structure which is latent in it. In that case a Hopfield network might be an overly complex way of simply setting the network oscillating and seeking a maximum in the oscillation.
Having thrown away the training, I’m not sure how much of what was left would make sense as a Hopfield network.
But yeah, there might be a parallel there, you’re right.
I haven’t been deeply into the maths of this. And Wolfram is a self-absorbed PITA in many ways (In particular I think Wolfram has disappeared up his own immaculate conception in his ideas about language, which he seems to reduce to a list of labels, so he can equate it to his “Wolfram Language” computation library, which has made his millions, completely ignoring his own ideas about system!) But Wolfram has looked at the power of simple combination rules to generate complexity. If you google him, as an example of someone who has looked at the power of combination unconstrained by ideas of limitation by abstraction, I think you’ll find sufficient evidence that simple combination rules can grow without bound.
The relationship to embodiment is tricky. There’s two ways to interpret embodiment in the context of intelligence, or language. One is that embodiment is the solution to the puzzle that we can’t find an abstract system. This theory says that intelligence is attached to a particular body, our human body, and therefore we need to recreate that body to recreate intelligence.
This was the direction George Lakoff took the search for a “layer below” after Cognitivism split from Chomsky in the '70s. Here’s a nice presentation of that (I think this is the one I saw):
George Lakoff: How Brains Think: The Embodiment Hypothesis
(Oh wow, look at this. Just discovered. This is more recent. Dipping into the beginning, he appears to further reifying the drift to focus on “social justice” abuses of language that I said has become the direction of the Cognitivist and Functional schools of theoretical linguistics in recent years, and aligned his idea of embodiment with current identity politics! Body = identity. He’s using it to add fire to the rejection of Enlightenment reason and say abstract truth does not exist, only the truth of identity, a particular body, exists. Wow. That would be terribly dangerous. It would further justify the current rejection of objective truth in the humanities, and fracturing of politics into identarian conflict. I want to do the opposite. I want to emphasize the idea that there is a sense of objectivity to truth, and that it is to be found in the (chaotic) process of generating truth from example. If we can accept a unifying process by which different “truths” are formed, we can escape conflict based on the identity of the person forming it. He’s going the opposite way. Wow, the push to equate everything with identity in the humanities seems strong!
I don’t see how you can have AI with this, of course. I don’t think Lakoff’s ideas have been useful for technology anywhere, only for creating more social division! Instead it is abstractions on language which have proven useful, the other direction, going up above language, instead of trying to go down below it…
So that’s the direction Lakoff has taken the search for an objective basis for language, and thought, after he split from Chomsky’s search. Like the postmodernists which have dominated the humanities, Lakoff is equating meaning with identity.
I hold with a weaker form of that (weaker for identity dependence, stronger for an independent system of meaning.) I hold that intelligence is embodied. But that many bodies can be intelligent. So you can’t copy one or other example of intelligence without copying the exact body. But you can have intelligence.
To me it is exactly the case of chaos. Like the weather. You can’t copy the weather without copying the exact planet it is embodied in. Actually impossible. But you can create another weather system. It will of course be different to the weather of our planet, and so imperfect for weather prediction. But it will still be weather.
This is chaos. Chaos is embodied. Down to the butterfly. That’s its canonical property. But the idea of chaos is not limited to any particular body.
Identifying its embodied property with chaos means you can lift intelligence off the exact embodied substrate. You can actually model it from any level. Because the property, chaos, can be the same. The chaos it generates won’t be the same. But it will still be chaos. So it will still be a kind of intelligence.
This explains why transformers might appear to demonstrate a kind of intelligence, even without grounding in data more embodied than the “body” (corpus) they are trained on. (It is just that being trained, they abstract away that body of text, and so abstract away the ability of that text to be interpreted in novel ways. That’s where “intelligence” escapes them, not the fact of being based on text as such.)
The brain has many more parameters (synapses) not including the glial cells and other things including chemicals that govern its functions like how synapses move and find its way around in the physical space etc. There are also many reasons why increasing parameter size is useful. For one thing, increasing the context window size requires more parameters and longer window size could equate to more working memory if I were to draw analogy to the brain. Regarding the hallucination part, yea I think there’s still some fundamental limitation in the design like its lack of working memory and ability to determine confidence level of its output based on past memories (like maybe how certain combinations are possible but not registered in memory).
I can’t judge if a trillion parameters is big or not. What should that number be compared to? On the question why trillions or huge amounts of parameters is needed if it is learning smaller parts, I think there are huge amounts of them but finite and the AI must also encode how they interact with each other.
I saw one of his videos posted on this forum recently. Funny that I somehow had similar dark thoughts about the guy and I thought I was being judgmental. But yea he had awesome things to say and opened my eyes to the greater universe, but not sure if its too helpful towards AGI.
A first approximation would be to compare them to the data they are “learned” from. Does anyone have a number for that? A number to compare the number of parameters “learned” by a contemporary system, and the size of the data set used to train it?
Here, there’s an analysis from Wikipedia:
“Sixty percent of the weighted pre-training dataset for GPT-3 comes from a filtered version of Common Crawl consisting of 410 billion byte-pair-encoded tokens.[1]: 9 Other sources are 19 billion tokens from WebText2 representing 22% of the weighted total, 12 billion tokens from Books1 representing 8%, 55 billion tokens from Books2 representing 8%, and 3 billion tokens from Wikipedia representing 3%.”
So that’s 500 billion or so tokens abstracted to 175 billion parameters? I guess you could say that’s significantly smaller. ~1/3 rd? Apparently gzip achieves “90% for larger text-based assets”.
It seems “Open” AI has not released details of the training data for GPT-4. But I find record of 170 trillion trained parameters? What’s that 170,000,000,000,000?
This article about BARD says the data set was 1000 times bigger than previously. So if the increase for OpenAI going to GPT-4 was comparable, ~500 billion tokens trained on would now be 500 trillion? So about the same ratio, trained parameters 1/3 the data set size?
Maybe you could justify an argument it is finding a smaller system based on those numbers.
If anyone has better data I would be interested to see it.
My hypothesis would be that at some point the data set would hit a hard ceiling of all the information on the planet! But that with a sufficient training budget, the “parameter” size would continue to increase.
I’m also comparing the size of the data set used to train these things and the amount of data typically absorbed by an infant. I sketched an argument around that in this earlier thread:
trainable parameters: fixed number of weights the model (matrices) has. These are changed during the training process.
training data: amount of text used to train the parameters above, measured in tokens.
Some empirical evidence shows “optimal” ratio between the two speculates there could be a ratio of 20 data tokens/ parameter according to Chinchilla, or even larger (70/1 or more) according to LLaMA.
The later is more expensive to train but aims to obtain smaller (= cheaper to run) pre-trained models that are more accessible to mere mortals.
PS here you find a table mentioning both # of parameters and training dataset size for various models. Notice the “explosion” of models in March 2023
Thanks for that. That’s good data. Here’s my “back of envelope” calculations on that:
BERT 2018 340 million[19] 3.3 billion words ~0.1
GPT-2 2019 1.5 billion[22] 40GB[23] (~10 billion tokens) ~0.15
GPT-3 2020 175 billion[11] 499 billion tokens ~0.33
GPT-Neo March 2021 2.7 billion[27] 825 GiB ??
GPT-J June 2021 6 billion[30] 825 GiB ??
Megatron-Turing October 2021 530 billion[32] 338.6 billion tokens ~1.5
Ernie 3.0 Titan December 2021 260 billion[33] 4 Tb ~0.065
Claude[34] December 2021 52 billion[35] 400 billion tokens ~0.15
GLaM December 2021 1.2 trillion[37] 1.6 trillion tokens ~0.75
Gopher December 2021 280 billion[38] 300 billion tokens ~0.9
LaMDA January 2022 137 billion[40] 1.56T words,[40] 168 billion ~0.8
GPT-NeoX February 2022 20 billion[41] 825 GiB ??
Chinchilla March 2022 70 billion[42] 1.4 trillion tokens ~0.05
PaLM April 2022 540 billion[43] 768 billion tokens ~0.6
OPT May 2022 175 billion[44] 180 billion tokens ~0.95
YaLM 100B June 2022 100 billion[46] 1.7TB ~0.05
Minerva June 2022 540 billion[47] 38.5B ~13
BLOOM July 2022 175 billion[14] 350 billion tokens (1.6TB) ~0.5
AlexaTM November 2022 20 billion[51] 1.3 trillion ~0.015
LLaMA February 2023 65 billion[54] 1.4 trillion ~0.05
GPT-4 March 2023 Unknown[f] Unknown
Cerebras-GPT March 2023 13 billion ??
Falcon March 2023 40 billion ??
BloombergGPT March 2023 50 billion 363 billion token dataset ~0.15
PanGu-Σ March 2023 1.085 trillion 329 billion tokens ~3
Seems mixed. Going from 90% compression for BERT, to later models with both some very large numbers like 1300% expansion for Minerva! But also a 95% compression for LLaMA.
My rough figures might be cooked though. Errors in arithmetic please.
The term “compression” you use here is incorrect since it compares apples to oranges - a token represents a text word (or part of it) while a parameter is a number which usually is a 16/32bit float for training and can be compressed to a 8bit or even 4bit for inference.
PS beware each input token is expanded to a vector of 1-10k values before being “crunched in” by the network. So it is enlarged to a couple dozens of kbytes in large models.
PS2 Minerva starts from pre-trained PaLM by training it further with data from math/science. A domain-specific fine-tuning. Look at the comments in the last column.
“PT” from GPT stands for “Pre-trained Transformer” which means anyone may continue training it with much fewer tokens (than used originally) for specialization in a specific domain. Well, assuming you can get it and afford the extra 1-5% computing cost.
Yeah, sure. A very general sense of “compression” used there.
I’m hypothesizing a general trend anyway. And that seems mixed. But with some blow outs, which might indicate parameters increasing even as data size remains constant.
The measuring unit used is tokens/parameter ratio. The chinchilla experiments found an “optimal” 20/1 value for it. They showed that a Chinchilla at 70B parameters slightly outperformed a 280B model trained with 4.5x fewer (300B vs 1.4T) tokens but both using the same compute budget for training.
Optimal means that reducing parameters further - and increasing # of tokens proportionally the performance starts to drop.
However, LLaMA experiment said “so what? Let’s make best small models, let’s train them at 70:1 data to token ratio” . The models still improved, sub-optimally but their goal was to obtain a transformer so small it can be run on “common” hardware.
So they obtained 7B to 65B models that are best performers in their lightweight category.
PS The LLaMA smaller models have higher than 20:1 ratio.
A model trained on 4 times the tokens only slightly outperformed one trained on 4 times the parameters? The larger data set is still better, but only slightly?
The question was, as @roboto hypothesized, whether the parameters represent any kind of simplification of the data.
Or, as I hypothesize, the parameters actually represent a kind of expansion of latent structure in the data, and there’s no limit to how much increasing them might improve the model.
If parameters were a simplification of the data you might expect that the number of meaningful parameters would decrease as the data size decreased. If you’re simplifying something it seems reasonable to assume it results in something smaller. But this seems to be indicating “simplifying” is much the same as just increasing the data size. With this result you might argue that calculating more parameters “expanded” structure to a degree roughly comparable (4x) to the degree to which the data size was reduced (1/4.) Calculating more parameters wasn’t simplifying more, it was expanding in roughly the same way as adding data.
I’m hypothesizing the models will continue to get better as the number of parameters are increased, even if data size is limited. This seems to me to be consistent with that.
It strikes me that the sense of “optimal” in this 20:1 ratio is mostly talking about compute overhead. It’s a floor on the number of parameters they can get away with. It’s not a ceiling on the number of parameters which would be useful. They always want fewer parameters because it’s cheaper and smaller. And 20:1 is as much as you can get away with before your model really starts to decline (though LLaMa pushed it?) But what about just increasing the parameters?
Has there been any limit noted to improvement with increase in the number of calculated parameters? Other than that it is not “optimal” because your model gets even more expensive and large.
From what I’m seeing here it would seem the evidence is consistent with the idea that just increasing the number of parameters infinitely would increase model performance in a way proportionate to that which infinitely increasing data size would. To the extent it’s been observed, it’s been proportional. Just that there’s a 20:1 floor on the number of parameters.
Interesting. So a “simplification” (if that’s what parameters are) of a larger number of tokens always needs to be larger… That’s already a suspect sense of “simplification”. And it’s roughly in proportion.
So that’s numbers for decreasing parameters in proportion to an increase in tokens.
What about increasing parameters in proportion to a drop in tokens?
For Chinchilla you said that resulted in a model which was only “slightly” worse for proportionately fewer tokens.
But they didn’t push this. They didn’t try just increasing the number of parameters?
I guess they couldn’t get numbers for that, because it would mean collecting more data to compare.
I’m guessing they haven’t been so interested in doing it, either. That probably seems like a dumb idea. Because loading up on the token side is going to result in cheaper and smaller models. If it’s all proportional, why would you load up on the side which results in larger and more expensive models?
But it does seem from these numbers you have given, that just endlessly churning out parameters might result in models which continue to get better, and in a way proportional to what you would achieve by endlessly increasing the data size.
Seems that way to me, anyway. Is there something I’m missing?
It’s an expansion of alternative simplifications. Actual simplification happens when a small subset of these params is activated in specific use cases.
Think of it as multiple compression algorithms working in parallel. Each will compress an input, but the sum of these compressions may be even longer than the input.