A first approximation would be to compare them to the data they are “learned” from. Does anyone have a number for that? A number to compare the number of parameters “learned” by a contemporary system, and the size of the data set used to train it?
Here, there’s an analysis from Wikipedia:
“Sixty percent of the weighted pre-training dataset for GPT-3 comes from a filtered version of Common Crawl consisting of 410 billion byte-pair-encoded tokens.[1]: 9 Other sources are 19 billion tokens from WebText2 representing 22% of the weighted total, 12 billion tokens from Books1 representing 8%, 55 billion tokens from Books2 representing 8%, and 3 billion tokens from Wikipedia representing 3%.”
So that’s 500 billion or so tokens abstracted to 175 billion parameters? I guess you could say that’s significantly smaller. ~1/3 rd? Apparently gzip achieves “90% for larger text-based assets”.
It seems “Open” AI has not released details of the training data for GPT-4. But I find record of 170 trillion trained parameters? What’s that 170,000,000,000,000?
This article about BARD says the data set was 1000 times bigger than previously. So if the increase for OpenAI going to GPT-4 was comparable, ~500 billion tokens trained on would now be 500 trillion? So about the same ratio, trained parameters 1/3 the data set size?
Maybe you could justify an argument it is finding a smaller system based on those numbers.
If anyone has better data I would be interested to see it.
My hypothesis would be that at some point the data set would hit a hard ceiling of all the information on the planet! But that with a sufficient training budget, the “parameter” size would continue to increase.
I’m also comparing the size of the data set used to train these things and the amount of data typically absorbed by an infant. I sketched an argument around that in this earlier thread: