Yup. I don’t link papers here because I don’t think this is the correct forum, but I’ll be happy to oblige if you want to back up my most assertive statements.
- https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- [2009.07118] It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
- Andrea Madotto | Language Model as Few-Shot Learners for Task-Oriented Dialogue Systems is brief and interesting blogpost
- and of course, [2005.14165] Language Models are Few-Shot Learners
No, that’s not a concern at all. In practice, even for the most complex tasks you don’t need more than 2k dimensions in a layer at-most. It goes back to why Glove and other word/sentence embeddings were in [256, 512]
range. We don’t know, but empirically they just work as well as ones in the thousands.
I don’t get what you mean. Gradient descent doesn’t brute-force the entire possible parameter space - infact, that’s the entire problem it was supposed to solve.
Again, if you actually attempt to brute force parameters, let’s say you are bounded in [0,1] per parameter for simplicity, and you bucket it to 0.00001 precision (thought in practice, precision is much lower with fp16 and you can push fp32 to 7 decimal digits).
This discretization gives us 100000 possible values per parameter, and we’ll ignore the biases now for simplicity. Thus, your entire space of possible parameters for GPT3 is: 175000000000^{100000}
So yea, just a tiny bit away from brute-forcing it
Hm? The energy… is spent on GPUs. What?
I believe there were some recent papers arguing meta-learning is basically the language model learning gradient descent encoded in its weights. I haven’t read those, nor does there seem to be any particular consensus among the community but I guess that’s always on the table.
I feel it just learns a more complex and nuanced algo that’s kinda near full-meta-learning, but now quite. Eventually, scale would get us there but we’d have to do it smartly to be compute and data-efficient.