Proposing a Model for the Basal Ganglia and Reinforcement Learning in HTM

I thought maybe I’d add a few details about how this could be implemented with HTM.

bgmodel

So in the diagram we have some columns ( C ), and each has two Striatal Columns (S+ and S-). Distal connections (Green arrows) connect all of these columns together. Everything shown here is from the same region of the cortex.

The blue arrows are direct output from L5 (or whatever layer you happen to use for motor output). Any feedback from the basal ganglia (purple arrows) would also only affect that layer (L5). However, the output from L5 would also be used as an output of the system anyway (the left-most branch of the blue arrows are outputs). The yellow arrows are feed-forward inputs.

The red arrows are a reward signal. S+ and S- columns react differently to these. In S+, the reward signal lowers the inhibition (Inhib = Inhib - Reward), and in S- the reward signal increases the inhibition (Inhib = Inhib + Reward). Aside from this, S+ and S- would act mostly like normal columns and would use temporal memory.

The purple arrows in the diagram represent scalar values; rather than S+ and S- outputting an SDR, they instead output a single scalar value. This value is just the number of neurons that have fired in the most recent time step. Let’s just call this the population. The population of S+ is subtracted from the population of S-, adjusted so that it remains above 0 (either add a positive value, or cut it off at 0). The resulting value is then fed back to the associated cortical column ( C ) and lowers the local inhibition (Inhib = Inhib - X).

Now, as it is very important that the sparsity and inhibition are able to vary over time and location, it would likely not be a good idea to select which neurons fire by a “take the N with the highest values” method, as this will always produce the same sparsity. Rather, neurons should fire so long as their value in a given time step exceeds a threshold. This threshold is calculated by subtracting the local inhibition from a fixed value.

Overall, this is how the system should work:

  • A column C receives a feed-forward input (yellow arrow). All green arrows represent inputs to distal dendrites. Spatial and temporal memory occur.
  • The output of each column C is sent to three places; an S+ column, an S- column, and the next region in the hierarchy / motor output.
  • S+ and S- take the output of C as feed-forward input, and perform spatial and temporal memory.
  • A reward signal is given to all S+ and S- columns. The signal lowers inhibition in S+ columns and increases it in S- columns.
  • S+ and S-, as they have a threshold-based firing method, contain more active neurons when they recognize more patterns in the output of C. Due to the reward-driven inhibition, S+ more frequently learns “good” patterns and S- more frequently learns “bad” patterns.
  • The number of neurons firing in S- is subtracted from the number firing in S+. The result (X) is sent back to C and is subtracted from the default inhibition to give the new inhibition (Inhib = Default - X).
  • This causes C to become more active when it contains patterns associated with positive reinforcement, and less when it contains patterns associated with negative reinforcement.

A few things to note:

  • Inhibition calculations may have to be tweaked a bit to prevent them from ever getting too low. You don’t want every neuron in an entire column becoming active because the threshold dropped to 0.
  • The output of a column can’t be controlled on the level of individual neurons unless each neuron is given its own S+ and S- column. This is fine if the output isn’t motor output (i.e, a non-motor region that still interacts with the BG, such as the PFC or anything else not at the top of the frontal hierarchy). However, this does create some issues for direct motor output, as you can’t easily control individual neurons.
  • The standard form of motor output for this model will be a scalar value (population count) for each column. However if many outputs are needed, these columns could be scaled down fairly small so long as they each still have their own S+ and S- columns.
  • The precise SDR of a region may still be able to be controlled if the reward is given based on how much the output SDR of a column matches the expected SDR. However, a population-coding-based approach will probably be much faster.
2 Likes