The Thousand Brains Theory and hierarchical reinforcement learning

If I understand the Thousand Brains Theory correctly, one of the basic ideas is:

  1. The human brain learns compositional, hierarchical representations of objects, e.g. a car is learned as an object, a car seat is learned as its own object, and the head of a car seat is learned as another object still. We can zoom in or out as much as we want, go up or down the stack of levels as much as we want, and we are always at Level 1 — whatever part we are thinking about is, to us, as much of an object as the whole. We wouldn’t get confused if we saw a car seat sitting on the floor in a car factory. It’s easy for us to cognitively separate an object from a bigger object it is a part of. In other words, we can abstract the object from its usual context.

  2. This compositional, hierarchical representation extends beyond physical objects to abstract concepts — anything we can think about. For example, we can think of The Thousand Brains Theory as being composed of (1) and (2), among other components.

I was just reading about hierarchical reinforcement learning, and the parallel struck me. Hierarchical reinforcement learning is about learning actions in a hierarchical, compositional way. For example, the action getting a cup of coffee is composed of smaller actions:

  • grabbing a mug from the cupboard
  • pouring the coffee
  • adding sugar

These smaller actions are also composed of still smaller actions. The action grabbing a mug from the cupboard is composed of:

  • opening the cupboard
  • picking up the closest mug by its handle
  • moving the mug out of the cupboard
  • closing the cupboard

Each of these actions is composed of smaller sensorimotor actions most of us probably aren’t even aware of. The stuff that toddlers and robots struggle with.

Hierarchical reinforcement learning is potentially game-changing because, if it can be implemented successfully, it would in theory get rid of a lot of the combinatorial explosions that happen with actions that involve many steps. Brute force cracking a 5-character password takes less than 1 second; cracking a 50-character password takes something like 10^77 years. For comparison, all galaxies will cease to exist — except for black holes — in 10^40 years (but don’t worry, civilization can still survive). As OpenAI puts it, reinforcement learning is “brute force search” over possible actions. By building up bigger actions out of smaller actions, hierarchical reinforcement learning reduces the number of possible action combinations that an agent needs to try to find the right combination.

For example, imagine you have an agent — a prototype household robot, in development — that is already trained on picking up a variety of objects, and has no problem picking up mugs. Suppose it’s also been trained on pouring liquids, opening doors and cupboards, and measuring out quantities with spoons. Since this is a general-purpose household robot, suppose it’s trained on a few dozen or a few hundred actions like this. Now you can train a virtual version of your robot in a kitchen simulator. It can try a vast number of combinations of its known actions, perhaps doing centuries of simulated exploration in a single day.

The reward function can be set like this: +1 points for producing a mug full of coffee with two spoons of sugar, 0 for anything else. Or, to make training easier, maybe: +1 for a mug, +2 for a mug filled with coffee, and +3 for a mug filled with coffee and two spoons of sugar. (You could also add -1 for leaving cupboards open, -2 for spilling liquids or powders, and -5 for dropping any objects). The agent randomly explores combinations of actions, trying to find combinations that increase its score. A time limit can be imposed for each round of exploration to avoid needlessly long sequences of actions.

When the agent is randomly exploring across combinations of actions like pick up object or pour liquid, imagine how much faster training can happen than if the actions are moving its arms, fingers (or pincers), and legs (or wheels). The possible combinations of actions is much smaller.

This is related to the problem of credit assignment in reinforcement learning. In non-hierarchical reinforcement learning, the agent can’t distinguish between failing on the overall action and failing on any of the smaller actions. If it doesn’t produce a mug filled with coffee and sugar, it doesn’t know which of its perhaps hundreds or thousands of movements are to blame. If it succeeds, it doesn’t know which movements deserve the credit.

I thought it was striking how what Numenta has theorized about physical objects and abstract concepts also extends to actions, and that AI researchers are trying to get reinforcement learning systems to think about actions the same way humans do.

2 Likes

I think this is very important for advancing the thousand brains theory. I think many subcortical structures will be found necessary for the achievement of AGI.

HTM is consistent with phenomenological aspects of perception in addition to neuroscience. One thing that it lacks (AFAIK), is an account of the sequential aspects of thinking and behavior. HRL would be one approach to this, but it doesn’t seem founded in neuroscience. There’s no reason to disqualify it on that basis, but there’s probably much more to learn about how to do this from biological brains.

A good starting point from a neuroscience perspective may be to consider recent research. Consider for example the work of Bush et. al. looking at signs of how the brain prepares for action. There seems to be a sequential queue that is populated in parallel from parahippocampal and cerebellar sources.

Here is their own summary:

Our results demonstrate the “competitive queuing” (CQ) of upcoming action representations, extending previous computational and non-human primate recording studies to non-invasive measures in humans. In addition, we show that CQ reflects an ordinal template that generalizes across specific motor actions at each position. Finally, we demonstrate that CQ predicts participants’ production accuracy and originates from parahippocampal and cerebellar sources. These results suggest that the brain learns and controls multiple sequences by flexibly combining representations of specific actions and interval timing with high-level, parallel representations of sequence position.

I have a feeling that Jeff and some of the other HTM theorists could make good use of this research in addition to the excellent post by strangecosmos.

Jack

1 Like