An interesting benchmark for intelligent agents

The suite of intelligent tests provided on this benchmark are pretty interesting: Animal-AI Evaluation. The tests are easy for humans, hard for machines.

There is a competition going on here.


I recently watched Jeff on the AI podcast with Lex. Very good, he talked about needing new benchmarks too.

A little while ago I gave some thought to this topic and perhaps my thoughts might be of some service to someone. (For more details on this see the MaestroAI readme at

I realized you can probably categorize environments along a spectrum where degrees in kind are arranged from simple to complex. For instance, if an environment is static and simple, it would be on one end of the spectrum, if an environment is complex and dynamic, it would be on the other end of the spectrum.

It seemed natural to me that with an example of each kind of environment one would have a plethora of benchmarks to test their AI against in order to get an understanding for how advanced (how generally intelligent it is).

It seems to me, that when it comes to categorizing environments, you can have differences in degree (such as a large state-space vs a small one) and you can have differences in kind (such as a static state-space where the agent is the only actor in the environment vs a dynamic one where the environment includes other actors or mutates on its own).

If a particular simulation could be created for each possible environment type then you could incrementally learn Intelligence principles that satisfy requirements of all environments of a certain level of complexity type.

Consider tic-tac-toe. It’s a dynamic environement because there’s another player - your actions do not necessarily always lead to the exact same outcome. But even though it’s a dynamic system, it’s such a small environment that the state-space of all possible mutations of the environment can be memorized. Thus you do not even need any level of intelligence or inference in your AI that plays tic-tac-toe: its memory structure can be a database lookup table.

Chess and Go are both in the same environment type as tic-tac-toe as far as static vs dynamic state-spaces, but their state space is so large that a database lookup table is insufficient. The bot could never explore the whole environment, let alone memorize and look up its raw representations of all possible transitions.

Thus we have, with this example a 2x2 grid of possible environment types:

Static Dynamic
Small a number line 000-999, traversable with +1 or -1 tic-tac-toe
large a fully entropic n-dimensional grid (where taking a movement transports the agent to a determined but essentially random location on the grid) Go

There are additional vectors (differences in kind) one can use to categorize environments as well. Different ways to traverse the environment from two options of symmetrical movement to millions of tiny actuators that allow for high n-dimensional asymmetrical movement and you have two new vectors corresponding to two new environment types to be added to the above grid.

Anyway, my main idea is that if an environment could be defined for every type of environment, for every combination of environment categories, you could create an AGI Benchmark Suite that any AGI agent could start interacting with (as a sensorimotor engine) to be able to rate AI algorithms on their generalization abilities.

Of course, perhaps that’s what the Animal-AI Olympics essentially is, I don’t know.

1 Like

This is a real AI competition, but the reward is poor. Maybe people are not eagerly looking forward to AI (I mean the real intelligence.). It’s sad.