Yup. I don’t link papers here because I don’t think this is the correct forum, but I’ll be happy to oblige if you want to back up my most assertive statements.
 https://d4mucfpksywv.cloudfront.net/betterlanguagemodels/language_models_are_unsupervised_multitask_learners.pdf
 [2009.07118] It's Not Just Size That Matters: Small Language Models Are Also FewShot Learners

Andrea Madotto  Language Model as FewShot Learners for TaskOriented Dialogue Systems is brief and interesting blogpost
 and of course, [2005.14165] Language Models are FewShot Learners
No, that’s not a concern at all. In practice, even for the most complex tasks you don’t need more than 2k dimensions in a layer atmost. It goes back to why Glove and other word/sentence embeddings were in [256, 512]
range. We don’t know, but empirically they just work as well as ones in the thousands.
I don’t get what you mean. Gradient descent doesn’t bruteforce the entire possible parameter space  infact, that’s the entire problem it was supposed to solve.
Again, if you actually attempt to brute force parameters, let’s say you are bounded in [0,1] per parameter for simplicity, and you bucket it to 0.00001 precision (thought in practice, precision is much lower with fp16 and you can push fp32 to 7 decimal digits).
This discretization gives us 100000 possible values per parameter, and we’ll ignore the biases now for simplicity. Thus, your entire space of possible parameters for GPT3 is: 175000000000^{100000}
So yea, just a tiny bit away from bruteforcing it
Hm? The energy… is spent on GPUs. What?
I believe there were some recent papers arguing metalearning is basically the language model learning gradient descent encoded in its weights. I haven’t read those, nor does there seem to be any particular consensus among the community but I guess that’s always on the table.
I feel it just learns a more complex and nuanced algo that’s kinda near fullmetalearning, but now quite. Eventually, scale would get us there but we’d have to do it smartly to be compute and dataefficient.