A model trained on 4 times the tokens only slightly outperformed one trained on 4 times the parameters? The larger data set is still better, but only slightly?
The question was, as @roboto hypothesized, whether the parameters represent any kind of simplification of the data.
Or, as I hypothesize, the parameters actually represent a kind of expansion of latent structure in the data, and there’s no limit to how much increasing them might improve the model.
If parameters were a simplification of the data you might expect that the number of meaningful parameters would decrease as the data size decreased. If you’re simplifying something it seems reasonable to assume it results in something smaller. But this seems to be indicating “simplifying” is much the same as just increasing the data size. With this result you might argue that calculating more parameters “expanded” structure to a degree roughly comparable (4x) to the degree to which the data size was reduced (1/4.) Calculating more parameters wasn’t simplifying more, it was expanding in roughly the same way as adding data.
I’m hypothesizing the models will continue to get better as the number of parameters are increased, even if data size is limited. This seems to me to be consistent with that.
It strikes me that the sense of “optimal” in this 20:1 ratio is mostly talking about compute overhead. It’s a floor on the number of parameters they can get away with. It’s not a ceiling on the number of parameters which would be useful. They always want fewer parameters because it’s cheaper and smaller. And 20:1 is as much as you can get away with before your model really starts to decline (though LLaMa pushed it?) But what about just increasing the parameters?
Has there been any limit noted to improvement with increase in the number of calculated parameters? Other than that it is not “optimal” because your model gets even more expensive and large.
From what I’m seeing here it would seem the evidence is consistent with the idea that just increasing the number of parameters infinitely would increase model performance in a way proportionate to that which infinitely increasing data size would. To the extent it’s been observed, it’s been proportional. Just that there’s a 20:1 floor on the number of parameters.
Interesting. So a “simplification” (if that’s what parameters are) of a larger number of tokens always needs to be larger… That’s already a suspect sense of “simplification”. And it’s roughly in proportion.
So that’s numbers for decreasing parameters in proportion to an increase in tokens.
What about increasing parameters in proportion to a drop in tokens?
For Chinchilla you said that resulted in a model which was only “slightly” worse for proportionately fewer tokens.
But they didn’t push this. They didn’t try just increasing the number of parameters?
I guess they couldn’t get numbers for that, because it would mean collecting more data to compare.
I’m guessing they haven’t been so interested in doing it, either. That probably seems like a dumb idea. Because loading up on the token side is going to result in cheaper and smaller models. If it’s all proportional, why would you load up on the side which results in larger and more expensive models?
But it does seem from these numbers you have given, that just endlessly churning out parameters might result in models which continue to get better, and in a way proportional to what you would achieve by endlessly increasing the data size.
Seems that way to me, anyway. Is there something I’m missing?