The whole point of using LLMs is that they can and are the only models we have ever created that actually do it. If you have any contradictory papers, youāre welcome to post them here and have it dissected.
Otherwise, take the ability to explain jokes for instance. When you provide a so-called "prompt "to an LLM, it would try and learn things from the prompt itself - which we call meta-learning. you can provide it patterns even which it can learn from and replicate accurately. This is analagous to your seabird flying parallel to a ferry analogy. Its ability to learn such correlations, and attribute to base rewards (food biologically, loss mathematically for models) is, as you just described āintelligenceā.
Suffice to say, none of these models are explicitly trained to do any of these explictly, which binds to my final point.
Language is simply proxy for data and patterns. the point that many in the wider scientific community miss (probably because of the huge diversity of opinion) - transformers arenāt there to replicate language. the sole aim is to inculcate full meta-learning capabilities.
As scale goes up, 0-shot capabilities goes up. we donāt know how far this trend works but right now, things look very rosy at such large scales. This, this is meta-learning.
The proof of its meta-learning capabilities going outside language? Code. Any Software engineers here would attest to the complexity of writing code. PaLM, the new LLM performs on-par Codex, trained on 50X less data.
This is where meta-learning is apparent - with such less data about something, it still learns the task on par other models.
The same way why the behavior of seagulll is replicated by other animals (like dogs too). This is the meta-learning abilities and hence why Language is just a proxy.
To end this already long rant, most research right now is focusing on multiple modalities - language (includes symbolic), image, audio etc. for the same reason. These large backpropgated optimzers learn to meta-learn because its the most effective way of getting lower loss across so many different modalities. CLIP, and DALL-E-2 are prime examples