Reinforcement Learning and HTM Algorithm

Respected Seniors,

I am currently a master student doing research in deep learning mostly in Reinforcement Learning. some days ago i started reading a papers related to HTM and by analyzing it concept i wanted to equip if possible my reinforcement learning agent with HTM algorithm. to avoid my RL agent several trials before performing an action. there fore i would like to know whether it feasible and how can i design a new algorithm by incorporating an HTM algorithm. Thence i will be very grateful to anyone who can assist me.



I think HTM could be an excellent Model to use for RL systems, but I see a big problem. The problem is that all the RL systems today are still set up as spatial processing problems. An HTM model needs to change at every time step. It must be an online learning model. If you can find an RL framework that allows models to be updated at every time step, I would start there. You could use the model’s anomaly scores as part of the training signal.


Hey @Tresor welcome, glad to see your enthusiasm.

I’d say first key thing to properly applying HTM (or any algorithm) is knowing exactly what it does. It takes in __ kind of data and gives __ kind of outputs. The current open source implementation of HTM Theory is NuPIC, which takes in sequential data of certain types and outputs anomaly scores and forecasts of the following input. This means that any analysis scenario with NuPIC must use temporal data.

With a full understanding of NuPIC’s application scope, you can formulate an RL problem that fits. I don’t know much about RL but it seems like an area where HTM could really shine if formulated well, I’m curious to hear what you come up with!


Be sure to check out @sunguralikaan work here: HTM Based Autonomous Agent

I have also spent a lot of time studying how to apply RL to HTM. I essentially started with this high-level view:

Given a state, predict all the future expected reward for each possible action that can be performed from that state, and choose the action which is predicted to give the maximum reward. Then use the actual reward to improve future predictions when that state is encountered again in the future.

From this, you can see one obvious place that HTM could be applied. The temporal memory algorithm captures the semantics of temporal patterns, and recalls them when they are encountered again. This could most easily be connected with a backward view TD(λ) algorithm (, where each bit in the output of the TM layer would have some weight on what is the current state. Learning in a setup like this could also happen online – as the states (sequences) are learned and re-encountered, simultaneously the model could be predicting and running TD(λ) to improve its predictions over time.

I personally believe one could take this even further than just marrying the TM algorithm with an RL algorithm. The TM algorithm out of the box actually does a lot of the legwork required for backward view RL, by connecting a series of states into an “eligibility trace”. What is missing is a way to pool sequences along with their reinforcement “value”.

What I am thinking about currently is combining multiple sources of information into common stable pooled representations. The activity in a TM layer tracking sensory input would be one source, a TM tracking motor actions would be another, and emotions would be another. Combined, these would produce representations that include not only sequences of states and actions, but also their emotional context. Emotional context could then be weighed against the current needs of the system and used for action selection. The pooled representations, once chosen, could then be used to temporally unfold the motor sequence, comparing predicted sensory input and predicted emotional context with reality and updating the model online.


Thanks you very much for your useful response
i think i have to seek from other side maybe compare HTM RL with another Reinforcement Learning method since HTM learns online and learns continuously while the RL is a batch learning offline process. it s seems to incompatible from biological view. However i will still on

you suggested.


please can HTM replace Q function in reinforcement model to define policy?
i have read HTM cannot be use to define policy because the algorithm, inspired by the neocortex, currently does not have a comprehensive mathematical framework.

1 Like

At a high level, the Q function (assuming I understand it correctly) simply quantifies the “goodness” of a particular action in a given state. It is easy to imagine how the temporal memory algorithm could be applied to perform that function. The challenge, as you have mentioned, is that it wouldn’t be a mathematical approach to the function, so you would have to devise some sort of adapter/translator to connect it with the larger system.


What HTM excels at is learning from temporal sequences, predicting the next inputs and detecting when the predicability of the system changes. So it may not replace the fitness function for the different actions, but it could predict the effects of different potential actions, which the Q function could evaluate to decide which course to take.


I am not a stats person - when do I find a nice newbie into to using the Q function with neural networks?


This is a great topic! I would actually start not by replacing the value function. But as mentioned in the other topic, maybe the most direct way to apply TM would be to model the transition and reward functions and use a model-based algorithm. If you can model the world, you can “imagine” possible transitions, replay them in your head and learn offline from the imagined transitions instead of having to actively explore the environment.

Note that a Q function is not a map of state and action to a reward, but rather a map of state and action to a scalar value, which is composed of the expected reward at the next state plus all future rewards that can be obtained from the next state following a given policy. This can be a little tricky to model.


This one looks straightforward:

and if you are familiar with pyTorch and want to implement:


I’m not sure a system could learn much by imagining. This would apply better to planning I think. Learning really needs to come from the real world, and planning would involve applying what is learned to new scenarios. Learning would then follow from how well reality matches up with the plan.


It really depends on how good your model of the world is. If you have a perfect model, then there is no difference between actually experiencing the environment or just imagining the experience.

I agree with you that learning from transitions sampled from an imperfect model is hard, it introduces a lot of uncertainty, which is why model-based algorithms never really took of. What we have nowadays in RL is a hybrid model, where the world is not explicitly modeled, but implicitly modeled by keeping a buffer of past experiences and learning by oversampling past experiences.

But I still think it is an interesting research direction to improve model-based algorithms with better maps of the world, and see if it leads to better results compared to model-free + experience replay algorithms (such as DQN and its variants).


To elaborate a little better what I mean by planning and learning, just in case someone not familiar with RL is following this.

Say you are playing a game of chess against a machine. You have a large table that shows you exactly which action the computer will take given a specific board configuration and an action you take. So at any point you can check this table and know exactly how your opponent will respond. This is what I refer to as a “perfect model of the world”.

Now I give you a task - you have to beat the computer in less than 10 moves. Even though you have a perfect model of the world, you still have to learn how to do that, learn a policy that will tell you the best action to take given each state of the world. But learning this policy doesn’t require any interaction - all you have to do is replay the game thousands of times in your head, until you have a winning strategy, before you even touch the board.

So learning in RL framework refers to learning a policy. Having a map of the world doesn’t imply you know how to achieve goals in this world, you still have to learn. The distinction we make in RL is that if you have this perfect model, you can solve the problem analytically or by solving a set of recursive Bellman equations - that we refer to as “planning”. But if you don’t have this model, say I randomly remove 50% of all rows in the table, now you may have to actually experience the environment in order to learn your policy. Now we refer to it as “learning”. So in RL the distinction between planning and learning is only on how much you need to interact with the world to learn the policy. And the line between the two is blur, you can both plan and learn at the same time, which is what model-based algorithms do.

(all the points above are about RL, not HTM)


I doubt biological brains do this (at least it doesn’t feel anything like this when I play chess). It is more like predicting the opponent’s state of mind, and formulating a plan with a few backup options based on how I expect they will react. Definitely doesn’t seem to involve thousands of simulations (it is difficult to hold onto any more than a few dozen)


I agree 100% with you. Our working memory is way too small for this, specially during the game. But at idle times some planning might be happening at a much smaller scale (my own opinion).

I will give you a personal example. I enjoy rock climbing. Some days I will try to climb a route, and fall several times at the same point. When that happens, I spend hours in the following days reliving that situation, thinking of different sequence of moves and what can possibly happen at each sequence. I play it so many times in my head that after some time I know what I have to do to solve the problem (I have an improved policy). And usually the next time I try, I can complete the route in the first or second attempt.

I personally think learning by interacting with an imagined version of the world, even with all the uncertainty from our imperfect model, could help with learning a policy in RL settings.


That is NOT how I play chess. I look at the legal moves open to me and sort through the ones that improve my position and reduce my opponents position. This is pure pattern recognition. Sometimes I work through a complicated exchange and see what material is exchanged and what the board will look like afterwards. Sequences of patterns. I train on chess tactics to recognize common situations and outcomes so I don’t have to work them out during the game. Sequences of patterns. At my very best I can see about 10 moves ahead in each of the possible best candidate move - having to work through each one serially. Mostly this all works out to pattern recognition based on prior exposure. Seeing a pattern is very different from running simulation and when I do work through movement sequences it is very hard. This is more of the pattern matching chained together - not what I would think of as running large numbers of simulated moves. At some point in the game I see what could be a winning end position and then work to match that pattern with the one in front of me. This feels like a different type of pattern manipulation but it is still pattern matching.

In all this the moves I do examine have already been selected from pattern matched to what I know to be “good” moves.

Grandmasters do about the same things I do but they have a larger library of patterns.

I am rated about 1850, the very best players are rated ~2000 to 2500 on this scale. Rank newbies are 100 to 800 on this scale.


From what I understand of the brain - you have to pay some attention to your senses to experience them. The degree of attention is related to the perception of relevance to you. As these experiences are registered in the temporal lobe through to the hippocampus the memory is colored as good or bad by your limbic system - the reward is remembered right along with the experience. Since you don’t know if it will be good or bad while you are experiencing it - it does make sense that the experience is held in the buffer of the hippocampus to be combined with the reward coloring at the end of the experience.
This is the basis of forming value (salience) of perceived objects in future encounters.
This method is distinctly different from the RL as employed be the DL camp. It is also the basis of judgment and common sense, properties absent from most DL implementations.


I think I can help support what you are saying by being able to word as: after (sparse or dense) addressing of a unique experience/location the memory is colored as good or bad by adjusting associated confidence level according to an action having worked or not for meeting current needs - sensory conditions like shock or food are included in the data, this way punishment or reward (or in between by shock/punishment being too brief to stop a starving critter) is something recalled/reexperienced by what an experience looked and felt like to sensory.

To go from there to chess level planning would be stepwise premotor recall of competing possibilities, without having to physically move the pieces to conceptualize the result of each possible motor action that moves pieces to new locations on a 2D board.


No - I don’t think the memory is adjusted at all. The good or bad is stored as a property like a color.
This goes with the concept that an object is a collection of features. In this case - the feature is some signal communicated by the amygdala.
In the chess case, some patterns are weighted as good and I look for these as I am sorting through possible moves. There is no motor involvement, I “just see” good moves.

1 Like