If I understand the Thousand Brains Theory correctly, one of the basic ideas is:
The human brain learns compositional, hierarchical representations of objects, e.g. a car is learned as an object, a car seat is learned as its own object, and the head of a car seat is learned as another object still. We can zoom in or out as much as we want, go up or down the stack of levels as much as we want, and we are always at Level 1 — whatever part we are thinking about is, to us, as much of an object as the whole. We wouldn’t get confused if we saw a car seat sitting on the floor in a car factory. It’s easy for us to cognitively separate an object from a bigger object it is a part of. In other words, we can abstract the object from its usual context.
This compositional, hierarchical representation extends beyond physical objects to abstract concepts — anything we can think about. For example, we can think of The Thousand Brains Theory as being composed of (1) and (2), among other components.
I was just reading about hierarchical reinforcement learning, and the parallel struck me. Hierarchical reinforcement learning is about learning actions in a hierarchical, compositional way. For example, the action getting a cup of coffee is composed of smaller actions:
- grabbing a mug from the cupboard
- pouring the coffee
- adding sugar
These smaller actions are also composed of still smaller actions. The action grabbing a mug from the cupboard is composed of:
- opening the cupboard
- picking up the closest mug by its handle
- moving the mug out of the cupboard
- closing the cupboard
Each of these actions is composed of smaller sensorimotor actions most of us probably aren’t even aware of. The stuff that toddlers and robots struggle with.
Hierarchical reinforcement learning is potentially game-changing because, if it can be implemented successfully, it would in theory get rid of a lot of the combinatorial explosions that happen with actions that involve many steps. Brute force cracking a 5-character password takes less than 1 second; cracking a 50-character password takes something like 10^77 years. For comparison, all galaxies will cease to exist — except for black holes — in 10^40 years (but don’t worry, civilization can still survive). As OpenAI puts it, reinforcement learning is “brute force search” over possible actions. By building up bigger actions out of smaller actions, hierarchical reinforcement learning reduces the number of possible action combinations that an agent needs to try to find the right combination.
For example, imagine you have an agent — a prototype household robot, in development — that is already trained on picking up a variety of objects, and has no problem picking up mugs. Suppose it’s also been trained on pouring liquids, opening doors and cupboards, and measuring out quantities with spoons. Since this is a general-purpose household robot, suppose it’s trained on a few dozen or a few hundred actions like this. Now you can train a virtual version of your robot in a kitchen simulator. It can try a vast number of combinations of its known actions, perhaps doing centuries of simulated exploration in a single day.
The reward function can be set like this: +1 points for producing a mug full of coffee with two spoons of sugar, 0 for anything else. Or, to make training easier, maybe: +1 for a mug, +2 for a mug filled with coffee, and +3 for a mug filled with coffee and two spoons of sugar. (You could also add -1 for leaving cupboards open, -2 for spilling liquids or powders, and -5 for dropping any objects). The agent randomly explores combinations of actions, trying to find combinations that increase its score. A time limit can be imposed for each round of exploration to avoid needlessly long sequences of actions.
When the agent is randomly exploring across combinations of actions like pick up object or pour liquid, imagine how much faster training can happen than if the actions are moving its arms, fingers (or pincers), and legs (or wheels). The possible combinations of actions is much smaller.
This is related to the problem of credit assignment in reinforcement learning. In non-hierarchical reinforcement learning, the agent can’t distinguish between failing on the overall action and failing on any of the smaller actions. If it doesn’t produce a mug filled with coffee and sugar, it doesn’t know which of its perhaps hundreds or thousands of movements are to blame. If it succeeds, it doesn’t know which movements deserve the credit.
I thought it was striking how what Numenta has theorized about physical objects and abstract concepts also extends to actions, and that AI researchers are trying to get reinforcement learning systems to think about actions the same way humans do.