Exploring Reinforcement Learning in HTM

Paul_Lamb · March 24, 2017, 8:49pm

Thought I would post an update on where I am at with these experiments. I’ve gone through several iterations, through which I have simplified the original idea, eliminated some of the biologically infeasible elements, and aligned more with traditional HTM concepts. Definitely gone past the 2 months that @sunguralikaan mentioned, but I am definitely learning as I go

The biggest epiphany for me came from realizing that the concepts of “imagination” and “curiosity” (which were the most biologically implausible elements of my original design) can be simulated by existing functions of a spatial pooler.

Spatial poolers currently simulate inhibition by selecting a percentage of columns that best connect to the current input space, and only those columns activate. A slight modification of this function allows it to replace my earlier concept of “imagination” – selecting a percentage of columns that best connect to the most positive reinforcement input space, and only those activate. The columns in the motor layer map to the motor commands, so the winning columns drive what actions are taken.

Spatial poolers also have a function for “boosting”, which allows columns that haven’t been used in a while to slowly accumulate a higher score, and eventually win out over other columns that have been used more frequently. This can be used to replace my earlier concept of “curiosity”. Actions the system hasn’t tried in a while, such as new actions or those which previously resulted in a negative reinforcement, will eventually be tried again, allowing the system to explore and re-attempt actions that could lead to new outcomes.

I drew up a diagram to help visualize what the current design looks like:

The sequence and feature/location layers are complimentary – both using the same spatial pooler (same columns activate for both layers) – i.e. both receiving proximal input from the sensors. The sequence layer receives distal input from other cells in its own layer, while the feature/location layer receives distal input from an array of cells representing an allocentric location.

The motor layer receives proximal input from the reinforcement layer, via the modified spatial pooler which chooses a percentage of motor columns which have the highest reinforcement score with boosting. This layer receives distal input from active cells in both the sequence layer and the feature/location layer. Columns represent motor commands, while cells in the column represent the sensory context.

Columns in the reinforcement layer represent how positive or negative a reinforcement is. In my implementation, I am using columns to the left to represent more negative reinforcement, while columns to the right represent more positive reinforcement (with columns near the center being neutral). This is just to make it easier to visualize. Columns represent positivity/negativity, and cells in the columns represent sensory-motor context. Cells in this layer receive distal input from active cells in the motor layer. All active and predictive cells (i.e. not just active cells) in the reinforcement layer are passed as inputs through the modified spatial pooler, which chooses a percentage of the motor columns which best map to the most positive reinforcement, with boosting. Note that this is probably the most biologically infeasible element of the system, since predictive cells in reality do not transmit information (and thus would not be capable of inhibiting other cells).

Another unique property of the reinforcement layer is that it extends predictions further and further back through time as a sensory-motor context is re-encountered. This allows the system to act on rewards/punishments that might happen several timesteps into the future. For example a series of negative actions might be necessary to receive a big reward. This is accomplished by each timestep, active cells in the reinforcement layer grow distal connections not only to cells that were active in the motor layer in the previous timestep, but also a percentage of new connections to cells that were active in the timestep before that, up to some maximum that is greater than the activation threshold. This allows predictions to bubble back through time each time a particular sensory-motor context is re-encountered. I described the theory behind this in more detail on another recent thread.

There is still some more tweaking to do, but it is definitely starting to come together. I am still not entirely satisfied with the relationship between the reinforcement and motor layers (in particular the transmission of predictive states). I’m playing around with a system that has another layer and utilizes apical dendrites to activate cells in predictive state from distal input. Will post more on that if I can work out the details.

The most recent change which I got from watching the HTM Chat with Jeff is the association of the sequence and feature/location layers. Location input itself, however, is currently just an array of input cells representing an allocentric location, which the feature/location layer connects to distally. Egocentric location is still missing, as well as tighter feedback between the two regions.

Next steps will be to start modifying the sensory-motor elements to align more with the dual-region two-layer circuits described by Jeff. I am also applying the recent changes to my implementation, and will post a demo app when it is ready. I have an idea for a better application for testing this than my original “robot navigating a maze”.

Topic		Replies	Views
HTM and Negative Reinforcement Tangential Theories	17	2352	November 13, 2016
Proposing a Model for the Basal Ganglia and Reinforcement Learning in HTM Tangential Theories theory , basal-ganglia , reinforcement	16	2891	August 12, 2017
I have a hypothesis, please help me find out where I'm wrong General Neuroscience	18	652	October 14, 2021
How to incorporate goals in HTM: discussion Numenta Theory	18	1567	March 9, 2021
Preliminary details about new theory work on sensory-motor inference Numenta Theory sequence-memory , research	80	15182	March 30, 2017

Exploring Reinforcement Learning in HTM

Related topics