The Measure of Intelligence & ARC dataset

François Chollet has published an important essay on the measure of intelligence:

We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks, such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to “buy” arbitrary levels of skills for a system, in a way that masks the system’s own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience, as critical pieces to be accounted for in characterizing intelligent systems.

You would find obvious to measure the “skill-acquisition efficiency” instead of raw skills to assess intelligence. But it wasn’t the case in current AI benchmarks.

The new dataset dedicated to measure machine intelligence is called Abstraction and Reasoning Corpus (ARC):

At first, I thought that this benchmark was lacking temporal data. Brains are learning with continuous sensory data streams, so we need those kind of data to compare human and artificial intelligence. This was the reason why Numenta created the NAB benchmark.

But after considering it, the kind of intelligence measured in this benchmark is a high-level abstract abitility, like the one measured in IQ tests. This is the targeted kind of intelligence when we talk about machine intelligence.

I consider the prediction of temporal data streams while making sensorimotor interactions as a needed intermediate step towards machine intelligence, but not directly as intelligence. This is the current focus of Numenta with HTM.

When this intermediate step will be reached, the next step will be to detach the symbols from the sensorimotor interactions they were grounded on, in order to make more abstract reasoning by playing directly with the symbols. The following paper was enlightening for me:

Extract from The symbol detachment problem, by Giovanni Pezzulo & Cristiano Castelfranchi, 2007:
Intelligence in strict sense (not in a trivially broad sense where just it means efficiency, adaptiveness of the behavior, like in insects) is […] the capacity to build a mental representation of the problem, and to work on it (e.g. reasoning), solving the problem ‘mentally’, that is working on the internal representation, which is necessarily at least in part detached since the agent has to modify something, to simulate, to imagine something which is not already there. Perhaps, on the mental ‘map’ the agent will act just by trials and errors, but it will not do so in its external behavior

Still a long road before us!


Wow, that is a long paper! I want to read it, but it is going to take me awhile.

1 Like

Chollet suggests relating system priors to the human priors of “Core Knowledge”, i.e, what basically amounts to the Gestalt psychology-like principles generally captured by IQ tests. Around these principles, he has developed a machine-friendly IQ test, the Abstraction and Reasoning Corpus. Alongside the notion of priors he introduces some termninology such as skill-acquisition and curriculum and formally grounds this in algorithmic information theory in order to arrive at a (perhaps tentative) defintiion and measure of intelligence. This is all to serve as a guiding lantern, not a final word.

This paper is, if nothing else, a good survey on the history of artificial intelligence.


Just finished reading this, I am really intrigued by it. My main takeaways are that:

  1. The machine learning community overwhelming favors optimizing ‘the wrong thing’: that optimizing for maximal skill in a particular task (e.g. chess, go, DOTA2, etc) is fundamentally orthogonal to producing intelligent learning systems.
  2. Building human-level intelligence likely involves building systems with the same four priors that humans possess.

I’m curious if others have thoughts on the paper – it seems to be a rather powerful statement about the state of AI research in 2019 and a gentle call towards course correction.

While the paper authors set a very ambitious task, to measure and compare intelligence, it is very constrained in what it actually does. The tasks proposed are a small sampling of the types of things the brain does with a spotlight on what may well end up being “second order” effects of the brain.
Much of what humans do are expressed in part or in whole in dogs, cats, cetacea, and corvids yet they would not be able to do any of these tasks. Non-human primates may be able to do some of them but I think that the resulting score would end up being far lower than it should be.

I am not sure that what is measured is all there is to intelligence.


I agree with Mark that the suggested benchmark is a small sampling of the types of things the brain does, and that it lacks much of what humans and animals both do.

But considering this sampling as “second order” effects of the brain depends on the point of view. On my side, I consider that these “second order” effects are more mysterious, more complex to generate and differentiate us from the other animals. As such, I think they are good proxies to evaluate a human-like cognition (I try not to use the term intelligence because there is no shared definition, and I know that it is the same for the term cognition :wink: ).

I think that we cannot build the “second order” machinery without having the “first order” machinery if we follow a biological approach. Numenta and others are focusing on the “first order” and this is already a big challenge!


The reason I consider that to be a “second order” effect is that without learning a language, humans are little better than the other primates. A large portion of what we consider intelligent behavior is tricks we learn as we acquire language. The ARC dataset makes the claim that the test must include equivalent backgrounds and priors. Without the priors that humans pick up while learning language they would be vastly less than what we normally consider to be human level intelligence.