Exploring Reinforcement Learning in HTM

I have been working on a test to implement reinforcement learning using HTM concepts. I am still debugging some problems (with the implementation, not necessarily with the theory itself) so I am not ready to release the code just yet. But I thought I would discuss the approach I am using, and maybe get some feedback and ideas. There are a few aspects which are not biologically feasible, but I am hoping some useful information might be distilled from this exercise.

When I refer to Layer X “projecting to” Layer Y below, what I mean is that cells in Layer Y grow distal connections to cells in Layer X, rather than to cells within the same layer. I may be using this terminology wrong, so please correct me if this causes confusion.

At a high level, the basic idea is to have three high-order sequence memory layers (multiple cells per column). The first layer learns patterns and context from sensory input, including positional information (see some of my other theories on the forum here for my thoughts on that). This layer projects to the second layer. The second layer receives input from motor commands, and projects to the third layer. The third layer receives reinforcement input (reward/ punishment).

Each layer’s specific inputs pass through its own spatial pooler to select active columns. In other words, different columns are active in each layer at any given time. In the second layer, the columns represent the combination of all motor commands at a given point in time. In the third layer, rewards and punishments are simply encoded as scalars.

The columns and cells in the three layers are connected like so:

With this setup, the first layer can make inferences about what sensory information will come next based on the current context. Columns represent the input, and cells within the columns represent the context.

The second layer makes inferences about what motor commands will come next based on the current context from the first layer. Columns in this layer represent the motor commands, and cells within the columns represent the sensory context.

The third layer makes inferences about rewards or punishments that will come next based on the current context from the second layer. Columns represent the reinforcement, and cells within the columns represent the sensory-motor context. The columns in this layer are divided into three groups: reward, punishment, and novelty (I’ll explain novelty in a bit).

The reason for a difference between cells for reward and those for punishment in order to allow for both positive and negative reinforcement (versus positive reinforcement only). When evaluating a particular action, the system must be able to predict both the positive and the negative future outcomes (otherwise the union of predicted outcomes would always represent a positive representation, and system could never learn to avoid negative experiences)

How good or bad a particular set of motor commands is weighted in a given context consists not just of the immediate reward/punishment in that state, but also the predicted rewards/punishments of possible next actions. This allows the system to take actions which might have an immediate negative result in order to achieve a future positive result.

Active cells depict the immediate reward or punishment, and predictive cells depict future rewards and punishments from a given context. The density of predictive cells indicate the level of rewards and punishments. This is roughly equivalent to Backward View TD(λ) (not exactly, but the concept is similar). For more info about Backward View TD(λ) from a mathematical perspective, see https://youtu.be/PnHCvfgC_ZA starting at 1:30:26 (and subsequent videos in the lecture go into it in further depth).

Besides rewards and punishments, I have also introduced the concept of “novelty”. These columns represent the level of unknown outcomes a particular action might lead to (i.e. future actions down a particular path that the system has not yet tried). The purpose of this is to allow the system to explore actions it hasn’t tried yet, versus always only ever going with the very first positive action it has done in a particular context.

The system will have a curiosity level that grows over time, and is reduced any time it does something novel. The more novel a path is, the more the system’s curiosity is satisfied. A combination of novelty score and curiosity level can eventually outweigh punishments that the system has encountered in the past, and cause it to try a particular action again in order to explore subsequent actions down that negative path that it hasn’t tried yet (and which could lead to rewards).

A breakdown of the process:

  1. SP process to select active columns in the first layer
  2. TM process (Activate, Predict, Learn) for cells in the first layer
  3. Increment curiosity level
  4. Upon motor commands, SP process to select active columns in the second layer
  5. TM process for cells in the second layer (distal connections with cells in the first layer)
  6. Imagine possible actions (simulate activating input combinations without learning and compare predicted reinforcement)
  7. Novelty weight increases with curiosity level
  8. Choose action with highest score (novelty * curiosity + reward - punishment)
  9. Upon reinforcement, SP process to select active columns in the third layer
  10. TM process for cells in the third layer (distal connections with cells in the second layer)

The system could be used in an online learning mode, where it initially chooses novel actions and receives reinforcement, then uses that to take better actions when it re-encounters semantically similar contexts.

The system could also be trained in a supervised mode, where it doesn’t take any actions of its own, but learns by observing (for example a human user). A combination of the two could also be used (first being trained by a human user, then enabling the ability to take actions on its own).


Well if the reward is to be predicted by set of “experts” then those experts can be prior states/actions. If state/action x happened recently then you could say x=true and make a prediction of a reward based on that. Of course you end up with multiple experts like state/action/when=true or false. Then you can use Multiplicative Weight Updates (MWU) to find a near optimal prediction by assigning weights to each expert. Unfortunately that is a bit naive in it’s simplicty, as the world is very complicated.

1 Like

From a somewhat higher level, the basic strategy is to simply use high-order sequence memory for each component in the system. First there is a context for the current sensory input (sensing feature D in “A-B-C-D” different from feature D in “B-C-F-D”). Then there is a context for the current motor commands (command 1 in “A-B-C-D-1” different from 1 in “B-C-F-D-1”) And finally there is a context for reinforcement (reward x in “A-B-C-D-1-x” different from x in “B-C-F-D-1-x”). By simply remembering the reinforcements which follow each motor command in each sensory context encountered, a union of reinforcement cells is predicted each time a semantically similar state is re-encountered and different motor commands are tried. Over time the system should in theory approach an ideal policy for behaving in its environment.

Sounds good. And more likely to find complex triggers for a reward.

Have you designed the part involving selecting an action based on these evaluation cells and executing it? In my opinion that is the hardest part if you want to do it in a plausible way using the HTM structures.

1 Like

Yes, this step is written and seems to work. The strategy I am using is to activate each of the possible motor combinations and scoring them based on counts of predictive cells in the third layer representing reward, punishment, and novelty (weighted by curiosity level). This strategy will obviously not scale to a large number of motor commands. In my test, there are only four directional “thrusters” and three “speeds”, so only 48 motor combinations to “imagine”.

Agreed… not only will my strategy not scale well to a large number of motor commands, but it is definitely not biologically plausible. An equivalent to this “imagine” phase is still necessary, but will need to explore more plausible ways to do it. This is one area I intend to focus on after I get the initial implementation functioning.

The main area that I am having trouble with at the moment is the propagation of reinforcement predictions further back in time. This is necessary to be able to look further than one step into the future (i.e. a future big reward might require taking several actions that themselves incur small punishments in the short term). Somehow, the predictive cells in the third layer need to gradually over time approach a representation of possible future reinforcements over many time steps, not just one step into the future.

What I am working on to solve this problem is a system that walks back through the activation history and forms distal connections with cells in the second layer that were active multiple timesteps into the past. The percentage of connections created decreases the further back in time, until it reaches a minimum threshold and stops. This burndown rate is configurable. This behavior resembles the strategy described in Backward View TD(λ). I’m having some problems with the implementation though (don’t have it working yet).

Another possible option might be to mimic the pseudocode from older revisions of the temporal memory whitepaper which described a similar behavior to what I am trying to do (extending predictions further and further back in time).

I may be able to speed you up a couple of months.

This was the first option I explored too. It was exactly as you described and I also used the term of imagining.:blush:

This limits you on multiple levels. First of you don’t have the ability to produce unknown behaviors. It also implicates that the intelligence with no previous knowledge knows what activities to try. Secondly every iteration involves multiple isolated stages of stimulating the same region with different inputs. Seemed unlikely at that time and still seems unlikely. I tried it with 27 motor combinations back then and as the time passed I spent more time patching stuff and working around this very handcrafted mechanism that it seemed futile after a point. Hopefully you’ll do better.

Here is a valuable conclusion from many months of futile attempts. I would advise to find a way to stimulate that region with the union of all the motor actions. That is the best theory I have about plausibility and scale that seems to work in practice. You then have to justify whether the chosen action leads to a learned state on that current activation by cross validating with the union of the predictive cells from the current state and the depolarized cells by the motor actions. This is what I believe Neocortex->Basal Ganglia->Thalamus->Neocortex does [1].

This is the holy grail of obtaining any sort of sequential behavior. First of I think you might be on the right track with TD Lamda Backwards view as we are on the same page on that one. We both might be wrong though.

Activation history is simply not plausible. The problem is even the reinforcement learning related code in Nupic union pooler seems to workaround this by activation history from what I’ve seen at their code. I might be wrong though. You can know the union of recent activities and maybe how much the time passed since each activated cell (trace part of TD Lambda) by a decay mechanism. Working with discrete activation history seems flat out wrong to me.

I hope these will save you a good amount of time or come up with a better approach that can enlighten all of us.

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4771745/
Wonderful paper by the way.

Edit: Actually there is a biologically plausible way to know the activation history; hippocampal replay. I tried implementing such a thing and it worked as intended but it made the whole thing more complex. So I concluded that I am trying to patch up my reward circuitry by introducing another feature. The main idea is hippocampus stores pointers to the activations of neocortex. Research suggests that neocortex replays the events that happened in their order or some other order while we are awake and asleep in conjunction with hippocampus. Multiple works suggest that this process creates a Tolman Map of states in mice brain even discovering new paths between states without experiencing it. I cannot dig up the exact references at the moment though. Despite this I believe that this is a workaround of a deficiency of reward circuitry for simple tasks. That is why I said it is not plausible, not without hippocampus.

1 Like

I addressed this particular problem with the “novelty” concept. It is true that you must be aware ahead of time all possible combinations of motor commands, though (i.e. you couldn’t plug this into any arbitrary system with possible motor commands that are initially unknown).

Definitely this is a core problem with the strategy.

Agree with the “unlikely” observation, and why I have called out this phase of the process as the most biologically infeasible.

[quote=“sunguralikaan, post:7, topic:1545”]
I may be able to speed you up a couple of months.[/quote]

Despite the obvious limitations, at the moment the “imagining” strategy appears to work for the limited test case that I am writing it for. My development style often involves hooking up components that I know ahead of time to be deficient, and go back to replace them with better solutions. A lot can be learned from building out a whole system in a limited or implausible use case – it gives you insights into other peripheral problems that you weren’t aware of from the outset, and solidifies a higher-level understanding of the overall system. In other words, those couple of months are an important part of the process :slight_smile:

I agree that this seems to be a promising strategy. Where I’ll probably go next when I get to redesigning this phase of the process, is exploring @Bitking 's “boss and advisor” idea. The main deficiency I can see with my initial thoughts are the ability for the system to try novel actions, or retry actions that it previously learned to be negative. This will probably involve incorporating the “novelty” concept.

1 Like

Thought I would post an update on where I am at with these experiments. I’ve gone through several iterations, through which I have simplified the original idea, eliminated some of the biologically infeasible elements, and aligned more with traditional HTM concepts. Definitely gone past the 2 months that @sunguralikaan mentioned, but I am definitely learning as I go :slight_smile:

The biggest epiphany for me came from realizing that the concepts of “imagination” and “curiosity” (which were the most biologically implausible elements of my original design) can be simulated by existing functions of a spatial pooler.

Spatial poolers currently simulate inhibition by selecting a percentage of columns that best connect to the current input space, and only those columns activate. A slight modification of this function allows it to replace my earlier concept of “imagination” – selecting a percentage of columns that best connect to the most positive reinforcement input space, and only those activate. The columns in the motor layer map to the motor commands, so the winning columns drive what actions are taken.

Spatial poolers also have a function for “boosting”, which allows columns that haven’t been used in a while to slowly accumulate a higher score, and eventually win out over other columns that have been used more frequently. This can be used to replace my earlier concept of “curiosity”. Actions the system hasn’t tried in a while, such as new actions or those which previously resulted in a negative reinforcement, will eventually be tried again, allowing the system to explore and re-attempt actions that could lead to new outcomes.

I drew up a diagram to help visualize what the current design looks like:

The sequence and feature/location layers are complimentary – both using the same spatial pooler (same columns activate for both layers) – i.e. both receiving proximal input from the sensors. The sequence layer receives distal input from other cells in its own layer, while the feature/location layer receives distal input from an array of cells representing an allocentric location.

The motor layer receives proximal input from the reinforcement layer, via the modified spatial pooler which chooses a percentage of motor columns which have the highest reinforcement score with boosting. This layer receives distal input from active cells in both the sequence layer and the feature/location layer. Columns represent motor commands, while cells in the column represent the sensory context.

Columns in the reinforcement layer represent how positive or negative a reinforcement is. In my implementation, I am using columns to the left to represent more negative reinforcement, while columns to the right represent more positive reinforcement (with columns near the center being neutral). This is just to make it easier to visualize. Columns represent positivity/negativity, and cells in the columns represent sensory-motor context. Cells in this layer receive distal input from active cells in the motor layer. All active and predictive cells (i.e. not just active cells) in the reinforcement layer are passed as inputs through the modified spatial pooler, which chooses a percentage of the motor columns which best map to the most positive reinforcement, with boosting. Note that this is probably the most biologically infeasible element of the system, since predictive cells in reality do not transmit information (and thus would not be capable of inhibiting other cells).

Another unique property of the reinforcement layer is that it extends predictions further and further back through time as a sensory-motor context is re-encountered. This allows the system to act on rewards/punishments that might happen several timesteps into the future. For example a series of negative actions might be necessary to receive a big reward. This is accomplished by each timestep, active cells in the reinforcement layer grow distal connections not only to cells that were active in the motor layer in the previous timestep, but also a percentage of new connections to cells that were active in the timestep before that, up to some maximum that is greater than the activation threshold. This allows predictions to bubble back through time each time a particular sensory-motor context is re-encountered. I described the theory behind this in more detail on another recent thread.

There is still some more tweaking to do, but it is definitely starting to come together. I am still not entirely satisfied with the relationship between the reinforcement and motor layers (in particular the transmission of predictive states). I’m playing around with a system that has another layer and utilizes apical dendrites to activate cells in predictive state from distal input. Will post more on that if I can work out the details.

The most recent change which I got from watching the HTM Chat with Jeff is the association of the sequence and feature/location layers. Location input itself, however, is currently just an array of input cells representing an allocentric location, which the feature/location layer connects to distally. Egocentric location is still missing, as well as tighter feedback between the two regions.

Next steps will be to start modifying the sensory-motor elements to align more with the dual-region two-layer circuits described by Jeff. I am also applying the recent changes to my implementation, and will post a demo app when it is ready. I have an idea for a better application for testing this than my original “robot navigating a maze”.


I thought of a solution for the biologically infeasible problem of transmitting predictive states, by using a two-layer circuit to pool reinforcement input. This tweak also eliminates the need to extend reinforcement predictions backwards through time (handled now by a function of the temporal pooler), allowing the implementation to align even more closely with traditional HTM concepts.

Updated diagram for reference:

1 Like

So how is this progressing? Are you able to test your theories on some test bed? I’ve been away for a couple of weeks with my head under the sand attacking similar problems.

This is what I am going with at the moment. You have the distal depolarization caused by the sensory context and the apical depolarization caused by the reward circuitry (reinforcement layer). There is a main layer in my architecture that takes apical input from reinforcement layer and distal input from sensory layer. Reinforcement layer is fed by this main layer (both distal and proximal dendrites) and TD lambda values are computed on the resulting neurons of reinforcement layer. So the activation on reinforcement layer represents the “value” of the activation in the main layer using its context too. Unlike normal temporal memory, the apical dendrites between these two are modified with the error signal of the TD (strengthen/create if the error is positive and weaken/destroy if it is negative). Reinforcement layer depolarizes main layer based on reward and sensory layer depolarizes main layer based on sensory information for the next step. If both depolarization overlap, those neurons are activated which are connected to some motor outputs. Additionally, I have two reinforcement layers mimicking the Go and No-Go circuits in Basal Ganglia (Striatum D1 and D2). The synapse adaptation rules are the exact opposite. So the resulting activation on the main layer avoids and activates motor commands simultaneously. The main idea was to have the capability to also avoid things. The experiments I conducted lead me to believe that without some sort of avoidance you cannot reduce the search space to try for better motor commands among all the unnecessary ones. As result the agent gets stuck in behavior loops trying to get rid of learnt unnecessary (sometimes cyclic) actions and rebuilding them some time later.

A potential pitfall with a “chosen” activation is trying to make it meaningful. A chosen activation would have columns from a lot of different “real” activations. As a result, you either have to converge this “imaginational” activation to the closest “real” activation or somehow resolve the conflicts on this chosen activation because there will be contradicting/implausible groups of columns on the same chosen activation.

The solution I went with is to use the reward signal (error in TD lambda) in a different way. Rather than using state values to boost columns, the error signal can be used to create and modify synapses between motor/sensory layer and reinforcement layer. It becomes more complex but gets rid of the problems introduced with chosen activations.

By the way I could not really grasp the exact architecture of yours from the diagrams. For example does the reinforcement layer’s sole input come from motor layer? If so what is the reasoning?
Thanks for sharing your progress.

1 Like

Yes, it is coming along well. I’ve taken a slight detour to tighten up my forward-looking temporal pooling implementation (about to finish that off today – I’ve gotten it pretty stable at this point). I should have a proof of concept done soon (will probably be additional issues to deal with once everything is hooked up).

I’ll draw up some diagrams to explain my proposed solution to this better when I get it working. At a high level, what I am working with now is to make it so adjacent columns in the motor layer inhibit each other, and are hooked up to motor commands as actions and non-actions. One will always win out over another (with the scores being input by the SP).

What this means is that there will always be an explicit representation of each potential command. For example say the possible motor commands are Up, Down, Left, Right, A, B, and any combination of them. The representation in the motor layer will always represent an exact state of each of those commands. For example, “NOT Up, NOT Down, NOT Left, Right, A, NOT B” might be the command to make Mario jump to the right. The columns for “Not Up” and “Up” can never both be chosen simultaneously.

It is still a theory, and may take some additional modifications to the SP once I get into the weeds, but that’s were the theory is at currently.


Hi Paul,

Understand this has almost been a year. Where are you on the RL journey?

Cheers, J

I wasn’t able to get the ideas I posted here on this thread to work well enough for what I want, and shifted my focus to a few other HTM-related projects I’m working on. I have learned a lot in other areas since I looked at this last, so I think the next step will be to go back to the drawing board and rethink the system as a whole. This will allow me to incorporate some of the newer concepts like grid cells, pooling sense + orientation to represent features, and distinguishing the functions of separator vs integrator vs attractor.

BTW, this might sound like a deviation from RL, but the more I have looked at the problem, the more it has become clear that this is an integral part of SMI, and large parts of that whole system need to be understood better for RL to be useful.


Thanks, Paul. It’s great to see the iterative approach the community is taking. The more we refine the fundamentals, the more solid the foundation will be.

There is so much to learn.


I have to say that I disagree with the idea of having both a positive and negative representation/signal. Reward and punishment are just two sides of the same coin. They are opposite ends of a single spectrum.

There’s no need to have an explicit signal for indicating things that should be avoided, you just need to reward the evasion of punishment/pain so that in the presence of a potentially painful/punishing stimulus there’s a natural aversion that is actually the pursuit of everything but the observed threat. It’s not that it’s specifically evading the threat, it’s just pursuing everything that isn’t the threat, which appears as evasion.

There is a baseline activity in the dopamine projections from the substantia nigra that’s pretty constant in the absence of any unpredicted reward stimulus (which I personally believe is simply an implicit reward from hierarchical novelty, resulting in natural pursuit and engagement in playful/explorative/fidgeting/autostimulating behavior), and this activity ceases
in the face of pain/punishment exactly the same as when they stop signalling when there is an expected/predicted reward that never arrives (e.g. the ‘hurt’ of having your hopes up and being let down). This induces a sort of re-training that is accompanied by shifts in overall activity until a ‘normal’ level of dopaminergic activity is restored - causing an automatic avoidance of unpleasantness that is the exact same mechanism as the pursuit of reward.

Ditch the positive/negative signal dichotomy! It only adds unnecessary complexity. There is still the suppressive effect of local inhibition, for producing sparse representations, and there is a suppressive dopamine effect from a second set of dopaminergic projections from the SN to the striatum, but inhibiting activity is not really a ‘signal’ that is then reflected by the inhibited neurons. An inhibited neuron is a non-signalling neuron, so there’s no biological basis in having a “negative” signal. These inhibiting dopamine projections must be some kind of attentive focus in the pursuit of reward, filtering out less promising potential volition signals. There’s really only the prediction of reward and the learning of responses/actions which facilitate the approach of reward, where the ‘punishment’ of pain/suffering is simply a state which when transitioned out of produces the reward of relief.

EDIT: Case in point https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3997524/

1 Like

alert mechanisms, fight or flight activations, amygdala fear response. These are all things that seem to be clustered about negatives perceptions

As far as favoring activities that are not unpleasant in sort of a random exploration: I have to wonder how many times a deer flees at top speed when it does not have a preditor nipping at its heels.

Whether a punishment is simply “absence of reward” or a different class of input, it really doesn’t affect the overall complexity IMO. The system still needs to remember negative outcomes in order to avoid them in the future, and this requires “labeling” them as such.

My current thoughts on this are that emotions are the “classes of input” which underpin this “labeling” system, and as such, there are more than two of them. There is a wide spectrum of emotional flavoring that can be added to a given context/ choice.


To my knowledge, pyramidal neurons are never completely silent without some explicit inhibition, they have a base rate of fire. A neuron with absolute silence effects the computation (carries information in other words) because this results in a different behavior in its post synaptic targets. The orientation sensitive cells of V1 is an example to this.

In addition, negative signal doesn’t have to be a straightforward inhibition. You could check direct and indirect pathways of ganglia on this. There is also the research on D1/D2 dopamine recepting cells of striatum that is theorized to capture positive/negative surprises. Majority of the computational models of ganglia in computational neuroscience seems to agree with it [1]. So I would say there is a somewhat strong biological basis.

At any point you may have 1000 actions to take. If one of them is to be avoided you would have to reinforce 999 of the others to achieve this. That seems like a waste even if its the same functionality.

1 Like