Reinforcement Learning and HTM Algorithm


Respected Seniors,

I am currently a master student doing research in deep learning mostly in Reinforcement Learning. some days ago i started reading a papers related to HTM and by analyzing it concept i wanted to equip if possible my reinforcement learning agent with HTM algorithm. to avoid my RL agent several trials before performing an action. there fore i would like to know whether it feasible and how can i design a new algorithm by incorporating an HTM algorithm. Thence i will be very grateful to anyone who can assist me.



I think HTM could be an excellent Model to use for RL systems, but I see a big problem. The problem is that all the RL systems today are still set up as spatial processing problems. An HTM model needs to change at every time step. It must be an online learning model. If you can find an RL framework that allows models to be updated at every time step, I would start there. You could use the model’s anomaly scores as part of the training signal.


Hey @Tresor welcome, glad to see your enthusiasm.

I’d say first key thing to properly applying HTM (or any algorithm) is knowing exactly what it does. It takes in __ kind of data and gives __ kind of outputs. The current open source implementation of HTM Theory is NuPIC, which takes in sequential data of certain types and outputs anomaly scores and forecasts of the following input. This means that any analysis scenario with NuPIC must use temporal data.

With a full understanding of NuPIC’s application scope, you can formulate an RL problem that fits. I don’t know much about RL but it seems like an area where HTM could really shine if formulated well, I’m be curious to hear what you come up with!


Be sure to check out @sunguralikaan work here: HTM Based Autonomous Agent

I have also spent a lot of time studying how to apply RL to HTM. I essentially started with this high-level view:

Given a state, predict all the future expected reward for each possible action that can be performed from that state, and choose the action which is predicted to give the maximum reward. Then use the actual reward to improve future predictions when that state is encountered again in the future.

From this, you can see one obvious place that HTM could be applied. The temporal memory algorithm captures the semantics of temporal patterns, and recalls them when they are encountered again. This could most easily be connected with a backward view TD(λ) algorithm (, where each bit in the output of the TM layer would have some weight on what is the current state. Learning in a setup like this could also happen online – as the states (sequences) are learned and re-encountered, simultaneously the model could be predicting and running TD(λ) to improve its predictions over time.

I personally believe one could take this even further than just marrying the TM algorithm with an RL algorithm. The TM algorithm out of the box actually does a lot of the legwork required for backward view RL, by connecting a series of states into an “eligibility trace”. What is missing is a way to pool sequences along with their reinforcement “value”.

What I am thinking about currently is combining multiple sources of information into common stable pooled representations. The activity in a TM layer tracking sensory input would be one source, a TM tracking motor actions would be another, and emotions would be another. Combined, these would produce representations that include not only sequences of states and actions, but also their emotional context. Emotional context could then be weighed against the current needs of the system and used for action selection. The pooled representations, once chosen, could then be used to temporally unfold the motor sequence, comparing predicted sensory input and predicted emotional context with reality and updating the model online.


Thanks you very much for your useful response
i think i have to seek from other side maybe compare HTM RL with another Reinforcement Learning method since HTM learns online and learns continuously while the RL is a batch learning offline process. it s seems to incompatible from biological view. However i will still on

you suggested.

1 Like