In what context? Codex can write code and functions well out of its training dataset (which is a good chunk of GitHub) - In that sense, for complex prompts it can create novel functions to aid it in generation of other parts of code. Would that count? Does VPT, pretty much the best Minecraft model which can craft and utilize virtual tools count? Like using an iron sword?
Putting aside the fact that its a pretty bad statement, if DL models come up with any intrinsic algorithms to aid them in their task (for which we have just recently started finding evidence) does that count as “inventing” its own tools?
Yes, I am.
Even traditional (and old, there are more new and modern examples of this) DL simulations show agents learning new complex behavior to “cheat” their environment - novel strategies, exploiting physics engines etc. That raven is just another example which doesn’t understand how a car works (you can close the door and that’s it) but knows how to locate food and consume it - if you think that’s the peak of abilities required for AGI, then I don’t suppose any of our conversations are going to be productive.
That’s a huge misconception, so let me put it in bold. Benchmarks are a way where you can judge if a model has gained some new abilities without putting it through a complex environment and battery of challenges. If PaLM suddenly shows spikes in understanding nuanced humour and complex conversations, then you can bet it may be able to understand jokes.
It’s simply an exhaustive way to quickly asses the capabilities of a model and see what to expect. The goal isn’t to grok all of them (there are plenty baselines) but to quickly and accurately judge a model’s capabilities.
Grokking all benchmarks wouldn’t give you AGI - but if you have a model which can pass any no. of benchmarks thrown at it, well, then it may give you a tiny hint that you have a general model on your hand 
Without going into embodiment, I would just serve a friendly reminder here which - Boston Dynamics relies on conventional optimization theory, which is quite an ancient and mature branch of creating mathematical models and optimizing them to perform certain maneuvers.
To put it simply, all Boston Dynamic robots are as hardcoded as symbolic systems. You can specify what actions you want to do, input the details of an environment and their mathematical frameworks effectively calculate a trajectory to perform it. That’s all. You can watch videos wayyy older (like 10-15 years old, TED demos are great - look at this 10 year old video)
Hopefully, you’ll notice some similarities between the capabilities of Boston Dynamics robots and the quadrupeds in above demo. The only difference is that BD uses perception algorithms (think SLAM) instead of relying on cameras and shiny balls.
That’s the farthest technique to AGI - but I agree. If a multi-modal DL model actually bootstrapped on the entire internet is formed, then the best test would definitely be the real world.