Hierarchical Temporal Memory Agent in standard Reinforcement Learning Environment

Hi all,

This is my BS’s final year project report using HTM to perform reinforcement learning. The report mainly contains 2 things: How Etaler works and trying to use HTM in a standard RL environment. Etaler is always available at https://github.com/etaler/Etaler and the RL agent can be find in https://github.com/marty1885/HTM-CartPole. The agent architecture is loosely designed after @sunguralikaan’s HTM Based Autonomous Agent paper.

English isn’t my first language so there will be mistakes in my report. And I should have spent more time in the RL part. Hopefully they don’t effect too much. Also since this is not a a paper, I have taken some liberty to joke around here and there.

Hopefully you find it interesting.

Edit: Oops I forgot to post the link to the report :stuck_out_tongue: Silly me


@marty1885 thanks for your report.
Could you please give more information about your L3, L5 layer? Which function does it look like (in comparison to SP/TM or even ColumnPooler in HTM theory)?

Do you have any details about the overperforming of GC encoder vs ScalarEncoder oder RDSE? better TM prediction?



L3 is simply a TM later and L5 is TM with Apical feedbacks. They are just standard HTM layers. I don’t have the expertise to invent my own.

My old post Path anomoly detection using Grid Cells and Temporal Memory compares GC and scalar on their prediction and anomaly detection capability. I can run experiments of the agent using Scalar encoder (RDSE is not implemented in Etaler) if you want the result.


Congrats @marty1885, it seems like you’ve been working on that a long time!


Thanks! I hope my work could inspire others also looking into RL and SMI.


I am not sure that GC encoder is better for predicting scalar signals than other encoders.

What kind of experiment can I do to proof it?

Your D1/D2 model seems to pretty accurately reflect the structure of the striatum (though in the striatum, D1 and D2 would be different cell subpopulations within the same tissue, just with different dopamine receptors. There are also cells that appear to be neutral to dopamine in the mix). There’s also the rest of the basal ganglia which does not seem to be modelled here.

The Globus Pallidus, etc. appears to apply some simple weighting on the activity of D1 and D2, and combine them to generate the output of the basal ganglia. However, rather than directly drive the cortex, they instead drive the relay cells in the thalamus that then drive the cortex.

From my own analysis, it seems like that’s enough to implement some simple search algorithms in cortical circuitry. Specifically, algorithms equivalent to the DFS + Smart Backtracking approaches that are typically used in SAT Solvers, Code Synthesis, and other domains where the goal is to solve some kind of crazy-hard problem that humans seem to be naturals at. Very interesting stuff.

This would also suggest a very different approach to problem-solving between HTM and traditional neural networks, and may suggest that some GOFAI-like algos naturally emerge from HTM, though that’s a fairly deep rabbit hole that I’m not going to get into here.

Also, on the subject of memory bandwidth requirements for HTM, it seems to me like that, at least on large scales, HTM might not require quite as much bandwidth as would be expected. Brain activity is rather sparse on large scales, and it’s not uncommon to have sparsity across cortical columns, not just within them.

I’ve been thinking a bit about some things that could be done with bloom filter -like data structures to figure out which minicolumns to quickly discount minicolumns that won’t become active. Seems like an interesting approach to optimization.

Also, props for discussing HTM on VLIWs!


I believe you are right in a distributed system. There’s not much communication happening across long distances. So it’s easy to share the load across computers. But in a single computer. We are bottlenecked severely by bandwidth. The act of fetching the synapses from RAM is enough to saturate DDR4.

In your Results section you mention:

…the agent is just evaluating the expected reward of the intimidate(?) next time step. Leading to the agent not capable of seeing further consequence
of it’s actions.

I interpreted the use of the word “intimidate” to mean either “intermediate” or “immediate”. In either case you discovered that HTM in its microcosmic form cannot deal with larger contexts of space/time on its own. This is where I feel that the vast majority of people (not you specifically, but I’m going to use your work as evidence of what I am talking about) fail to respect the true hierarchical nature of the brain. Our cortex isn’t processing everything as though one single cycle of the HTM algorithm can perform all higher-level temporal recognition. Large-scale temporal pattern recognition is only possible through patterns “echoing” and being recycled through the cortex.

The literal hierarchy that we have discerned in the visual cortex, v1/v2/v3/v4 are only visible to us because they have formed around spatial encoding of visual data that we can tease out - but recognizing something from patterns and motion occurring over time, not merely shape/form, requires feedback and recycling of signals, or otherwise the echoing of signals to build up a context that can only be built up over time (like reverberations themselves, in the literal sense). This is a very abstract thing to think about but it’s really the only way that any kind of network can learn temporal contexts and not be processing information completely oblivious to what was just happening two or more moments ago.

Temporal processing and recognition is not about generating a steady response, it’s about generating a response that is consistent with each recurrence and unfolding of events. Events are hierarchical unto themselves, and have their own temporal contexts, and assemble to a higher level spatio-temporal context. It’s basically just hierarchical auto-associative memory.

Reinforcement learning with HTM in it’s commonly understood form is possible for something like simple input/output mapping with a single timestep of temporal prediction, but for a much deeper temporal context the output of one HTM must feed into more (and even back into itself). All of the existing HTM implementations will fail to learn and predict the notes of a song playing unless the song only has a few notes (and the notes are all different). Without anything recognizing higher level context there’s nothing to predict the next note with other than the previous two notes.


Oops. Typo. Thanks!

You’re absolutely right. I think the main reason HTM is under-performing is due to it’s incapability to handle long-term contexts. And I can only hope the Thousands Brains Theory can provide a solution. With that said, I think the Ogma team is on to something. By having higher layers of the cortex running on a lower speed. They can determine the current context better than HTM.

1 Like

@marty1885 i do not think that makes sense with different HTM layers, where the higher layer run slower than the lower ones, because the current HTM hierarchy does not support it and GridCells having different spaces/size are doing this task somehow for well navigation tasks.
About HTM cartpole: it is very challenge task in controls, may be your model is suitable for self-driving cars, where D1andD2 can be used for deciding driving direction or calculating the steering wheel angle.

congrats @marty1885 you did a great job so far.

1 Like