Proposing a Model for the Basal Ganglia and Reinforcement Learning in HTM

Extensions to Incorporate Feedback, Hierarchy, and Curiosity

Thanks to my discussions with Casey and Paul_Lamb, I’ve found some details I’d like to add to the model.

Curiosity

A quick and small extension here; as Paul suggested, provided that the Striatum is performing some kind of temporal memory, bursting there could serve as a mechanism for a form of curiosity. In the case that a suggested action appears that has not been seen yet (at least given the context), Striatal “minicolumns” would be expected to begin bursting. This could throw off the population coding, possibly providing a cortical column with an extra bias, pushing for actions to be taken without a clear prediction of whether they will result in “good” or “bad” results.

Action Selection

The core model I’ve suggested seems to explain most of how actions are selected; columns provide a union of SDRs for actions, the Striatum recognizes patterns in that SDR that it associates with positive/negative reinforcement, and then manipulates the inhibition in each column based on the patterns it found (more positive patterns mean less inhibition, more negative patterns mean more inhibition). The amount of inhibition has a major impact on the sparsity, and too much could easily stop all activity in a column, giving columns a large bias against suggesting bad actions.

The problem is that an entire column seems too big; there may be many very different actions represented in a column simultaneously. How does the brain separate them?

bgtopology

Let’s say that we have a bit of cortex, similar to the one in the picture. Each square in the grid is a column. White columns are inactive, red columns are suggesting “bad” actions, blue are suggesting “good” actions, and columns that are both red and blue are suggesting a union of “good” and “bad” actions. Let’s call those ones ambiguous.

The brain is very topologically structured; as a result, columns that represent similar patterns will tend to be close together. The brain also has a motor map, mapping regions of the motor cortex to muscles. High-level regions of the Executive Hierarchy in the frontal lobe will likely be fairly different from this map, but lower-level regions should tend to be organized closer to the motor map. The entire Executive Hierarchy (frontal lobe) likely will only be sparsely populated by active columns; after all, the PFC might contain a large number of strategies for solving problems, but only a small number will be used at any point in time.

This also means that different actions should result in different columns becoming active. If one particular column suggests two different actions, the BG will be unable to separate them. However, when passed to the next region the two suggested actions should result in many different columns becoming active, provided that the two actions are sufficiently different.

In the example in the above image, a single active column in a higher-level region (left) may result in many columns becoming active in a lower-level region (right). A “good” (blue) action in the left region will likely trigger many other “good” patterns in columns in the next region. A “bad” (red) action will likewise trigger many other “bad” patterns in lower-level columns. There may also be some overlap.

Due to topology, these “good/bad” patterns should now be spread out across many columns. The only case where this would not occur is if the two suggested actions are almost identical in terms of their lower-level sub-actions, but such a situation should be rare (as they would then not really be different actions).

In this example, the BG would very quickly produce high inhibition in the red columns, low inhibition in the blue columns, and medium inhibition in the ambiguous columns. The red columns would be quickly blocked out from the inhibition, the blue columns would become very active, and ambiguous columns would become more selective about which neurons fire.

bgconnect

This other diagram shows connections between regions in a hierarchy, and regions with the BG (Paul asked for the one on the right). L5 of a cortical column is the primary motor layer in the cortex, and functions as the main interface with the BG (right diagram). In the left diagram, the blue arrow shows feed-forward connections, and the red shows feedback. The red and blue arrows are specifically feed forward/back connections between regions at different levels of a hierarchy.

Note the feedback connection from L5/L6 to L1/L6. The L5 to L1 is what matters here. L1, while normally omitted in HTM theory due to its lack of neurons, is actually full of apical dendrites (feedback dendrites) from neurons in the other layers, including L5.

What this means is that, in our example, blue columns in the right region will provide feedback to the same neurons that caused them to fire. This means that the neurons associated with the “good” action in the column in the left region will become further depolarized, and neurons associated with the “bad” action in the same column, lacking such feedback from the inhibited red columns in the right region, would not be depolarized to the same extent. As the column is ambiguous due to the striatum recognizing both “good” and “bad” patterns in it, the BG slightly raises local inhibition and the column becomes more selective about which neurons will fire. With the extra depolarization from the feedback, the neurons associated with the “good” pattern will quickly win out over the “bad” pattern.

To Summarize: Biasing entire columns is going to lead to some “bad” actions getting through. A hierarchy of reinforcement learners would provide the BG with additional chances to further filter out these bad actions. When topology is taken into account, different actions suggested by the same column should lead to activity in different columns in the next region, which will be easier for the BG to filter them out. Adding apical feedback to this model allows these corrections to propagate backwards through the hierarchy, creating biases against the neurons that contributed to the blocked actions in the next region (the neurons that are part of the bad actions).

.

Clearly some experimentation is needed here to test out this model and these extensions. I don’t think that these extensions are manditory for the model to work, but I think they would definitely allow faster learning and more accurate behavior.

Also, if anyone from Numenta could join the discussion, that would be quite interesting. I’m sure Numenta has some ideas and info that could help develop this model.