The Measure of Intelligence & ARC dataset

François Chollet has published an important essay on the measure of intelligence:

We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks, such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to “buy” arbitrary levels of skills for a system, in a way that masks the system’s own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience, as critical pieces to be accounted for in characterizing intelligent systems.

You would find obvious to measure the “skill-acquisition efficiency” instead of raw skills to assess intelligence. But it wasn’t the case in current AI benchmarks.

The new dataset dedicated to measure machine intelligence is called Abstraction and Reasoning Corpus (ARC):

At first, I thought that this benchmark was lacking temporal data. Brains are learning with continuous sensory data streams, so we need those kind of data to compare human and artificial intelligence. This was the reason why Numenta created the NAB benchmark.

But after considering it, the kind of intelligence measured in this benchmark is a high-level abstract abitility, like the one measured in IQ tests. This is the targeted kind of intelligence when we talk about machine intelligence.

I consider the prediction of temporal data streams while making sensorimotor interactions as a needed intermediate step towards machine intelligence, but not directly as intelligence. This is the current focus of Numenta with HTM.

When this intermediate step will be reached, the next step will be to detach the symbols from the sensorimotor interactions they were grounded on, in order to make more abstract reasoning by playing directly with the symbols. The following paper was enlightening for me:

Extract from The symbol detachment problem, by Giovanni Pezzulo & Cristiano Castelfranchi, 2007:
Intelligence in strict sense (not in a trivially broad sense where just it means efficiency, adaptiveness of the behavior, like in insects) is […] the capacity to build a mental representation of the problem, and to work on it (e.g. reasoning), solving the problem ‘mentally’, that is working on the internal representation, which is necessarily at least in part detached since the agent has to modify something, to simulate, to imagine something which is not already there. Perhaps, on the mental ‘map’ the agent will act just by trials and errors, but it will not do so in its external behavior

Still a long road before us!


Wow, that is a long paper! I want to read it, but it is going to take me awhile.

1 Like

Chollet suggests relating system priors to the human priors of “Core Knowledge”, i.e, what basically amounts to the Gestalt psychology-like principles generally captured by IQ tests. Around these principles, he has developed a machine-friendly IQ test, the Abstraction and Reasoning Corpus. Alongside the notion of priors he introduces some termninology such as skill-acquisition and curriculum and formally grounds this in algorithmic information theory in order to arrive at a (perhaps tentative) defintiion and measure of intelligence. This is all to serve as a guiding lantern, not a final word.

This paper is, if nothing else, a good survey on the history of artificial intelligence.


Just finished reading this, I am really intrigued by it. My main takeaways are that:

  1. The machine learning community overwhelming favors optimizing ‘the wrong thing’: that optimizing for maximal skill in a particular task (e.g. chess, go, DOTA2, etc) is fundamentally orthogonal to producing intelligent learning systems.
  2. Building human-level intelligence likely involves building systems with the same four priors that humans possess.

I’m curious if others have thoughts on the paper – it seems to be a rather powerful statement about the state of AI research in 2019 and a gentle call towards course correction.

1 Like

While the paper authors set a very ambitious task, to measure and compare intelligence, it is very constrained in what it actually does. The tasks proposed are a small sampling of the types of things the brain does with a spotlight on what may well end up being “second order” effects of the brain.
Much of what humans do are expressed in part or in whole in dogs, cats, cetacea, and corvids yet they would not be able to do any of these tasks. Non-human primates may be able to do some of them but I think that the resulting score would end up being far lower than it should be.

I am not sure that what is measured is all there is to intelligence.


I agree with Mark that the suggested benchmark is a small sampling of the types of things the brain does, and that it lacks much of what humans and animals both do.

But considering this sampling as “second order” effects of the brain depends on the point of view. On my side, I consider that these “second order” effects are more mysterious, more complex to generate and differentiate us from the other animals. As such, I think they are good proxies to evaluate a human-like cognition (I try not to use the term intelligence because there is no shared definition, and I know that it is the same for the term cognition :wink: ).

I think that we cannot build the “second order” machinery without having the “first order” machinery if we follow a biological approach. Numenta and others are focusing on the “first order” and this is already a big challenge!


The reason I consider that to be a “second order” effect is that without learning a language, humans are little better than the other primates. A large portion of what we consider intelligent behavior is tricks we learn as we acquire language. The ARC dataset makes the claim that the test must include equivalent backgrounds and priors. Without the priors that humans pick up while learning language they would be vastly less than what we normally consider to be human level intelligence.


Mark, really appreciate you posted this research paper. this is one of the most useful information I got so far. This paper definitely spelled out exactly what my thought is in regarding to the importance of defining intelligence. Took me a bit on finish reading it. But here are some of the useful excerpts I got from it:

For instance, common-sense dictionary definitions of intelligence may be useful to make sure we are talking about the same concepts, but they are not useful for our purpose, as they are not actionable, explanatory, or measurable.
Totally agree on this. It is very important for a definition to be actionable.

What’s worse, very little attention has been devoted to rigorously defining it or benchmarking our progress towards it
As I have learned so far with my first post.

When it comes to creating artificial human-like intelligence, low-level sensorimotor priors are too specific to be of interest
I think the author used better terminology than I did. But as indicated before I don’t think the sensorimotor side is necessary.

the purpose of our definition is to be actionable. quantitative foundation for new general intelligence benchmarks
And this is exactly what I had in mind. Without defining intelligence (as actionable definition), it is impossible to quantitatively verify if the implementation has achieved what it sets out to do.

Intelligence must involve learning and adaptation
The author had an amazing definition which includes a mathematical equation as a measure. But I think this phrase sums it up on the very very high level what the essential elements are and we can further break down what learning is and what adaptation is

Time efficiency
Energy efficiency
I share the same view with my original comment

This paper is very very important. Really really appreciated you have shared this with me. I will definitely read it more and also follow the author’s work closely. And I highly recommend everyone to read this paper. Very insightful. And very important first step.


It is funny to see how “time” or movement of objects could help answer those tests (e.g. extensions and horizontal movements). Some objects also have hierarchy, (eg. a red frame has a blue objects inside). I think the object recognition (whether two objects are the same in pic 1 and pic 2) and their change or movement could be first step. Then the analog of movements and objects are next, if we want to solve the problem in a humanish way.

Ah yes, the human priors of moving things and colors and object recognition. How would an “intelligent agent” be able to follow the instructions without a vast suite of these priors?

Octopus critters are very clever but I wonder how they would fare on these tests?

1 Like

The concept of movement is fascinating. If we cut time into many segments, and if an object move, we would say it is the same object in t1 and t2. Even the color or shape is changed, we would say it is a change to the object instead of considering there are two distinct objects.

Thus, I am thinking about that movement is a memory-efficient way to represent a sequence of information. We could fill memory with the same objects instead of repeated information.

We also happen to have the same objects (same shape and color) in some examples. It might be possible (in the future) to come up with some representation of movement as the transformation of similar objects in the time segment, and an agent could have source materials to develop an understanding of movement.

The ability to accumulate information is necessary.

1 Like