An open-source community research project on comparing HTM-RL to conventional RL

From my perspective, I think the goal would be a working SMI implementation based on HTM which includes RL for learning good and bad actions for a given task. This is essentially @sunguralikaan’s project, except instead of a game agent, it would be applied to benchmarks for comparison with other agents. The project would also include defining those benchmarks.

  1. HTM can only currently learn sequences and not act on them. To create a RL algorithm we must have someway of choosing an action that maximizes future expected reward.

  2. What I expect is possibly some improvement in the generalization (the agent can learn multiple tasks), data-efficiency (learns with less exploration) and capacity (can maintain knowledge of multiple tasks at the same time), of the agent compared to current state-of-the-art RL algorithms.


We can definitely reuse and adapt a lot of @sunguralikaan’s and @kaikun’s work to our advantage. The main goal which is not addressed by those projects though is the comparison to existing methods.

I feel that until HTM shows some superior performance in some areas of ML there will not be any academic interest in this technology. i.e. we have to put our money where our mouth is.


I want to get more of your (community) perspectives on this.
Do you think this is needed/relevant?
Is HTM even comparable to other methods?
Do we need to wait for more HTM research (motor/goal directed behavior) to be done?

1 Like

I suppose it depends on what you are using the sequence memory for.

If you remember prior unsuccessful trials you could build an “aversion” to repeating the failed sequence. A look-ahead probe at each choice could test the sequence and - as I just said - activate the dread that this is a failed path. The right answer at this point is the other (non-failing) choice.

This would end up making a width first search pattern.

The most naive idea which I outlined in Trying to make an HTM augmented/based RL algorithm is to let the model predict both the next state’s value and next action, and choose the action by sampling N random actions in addition to the predicted action and pick the one with the best predicted value. Supposedly the model will converge on the best policy.

I have not yet fully read @sunguralikaan’s work on making a HTM-TD(lambda) hybrid, which seems like a more sensible approach.


I would like to see this happening but it is quite a challenge. I thought a lot about this during my thesis and even went through some heated arguments with my advisor. As you pointed out, the academic interest requires a reference point, something to compare to. There are a couple of problems with this that I see:

1-The comparison should be between HTM-RL vs X-RL here (put some ML approach in place of X). HTM is not RL by itself so I am not sure about comparing it directly with another pure RL approach like Q-Learning.

2- For the sake of brainstorming, let’s say we picked Deep Reinforcement Learning (for example DQL). What goal oriented setting or benchmark would allow both HTM-RL and DRL to shine? One learns online and learns continuously while the other is a batch learning offline process. I am afraid the comparison here would be just for the sake of it. There have been DL-HTM or LSTM-HTM comparisons here mainly for streaming data. One study even compared QL directly with HTM. Some were anomaly detection tasks, some did not even have temporal dynamics.

3- How about biological plausibility? Obviously we have to take that into account while comparing so should we pick something that also has this concern? What are our options? My only contender on this point was Nengo which cognitive science crowd is interested in but not the ML crowd. I am still considering this by the way.

4- Any HTM-RL would have to be bioplausible for it to make sense in terms of the approach of HTM. If the applied RL approach is not concerned with biology, why not just throw out HTM for a more practical approach as well? So there is the difficulty of extending the current theory with a bioplausible RL module.

5- Is there really a setting that could make HTM-RL outperform others if we disregard biological plausibility, noise roboustness and online learning? After all, without performing better in some sense, all we can say is this functions like the cortex (I really wished this was enough for a lot of the researchers).

It would be great if we could come up with a plan or a roadmap considering these points.


Correct. One other question: Is the agent HTM-augmented or HTM-based? i.e. do we want to add HTM to an existing algorithm or come up with a mostly novel RL algorithm (which would be more biologically based)?

That’s a great question and should be discussed thoroughly. I think we should test on as much environments as we can, with emphasis on ones of academic interest and ones that have been the subject of established papers.

Can you share a link?

I don’t think it’s so obvious. I’m not a fan of pursuing biological intelligence just for its own sake, we should draw on biological ideas because they are useful to solve real problems, we should understand why they are good and if we invent some better way based on math or otherwise we should use it instead. I’m interested in HTM mainly because it may have some advantages in areas that current algorithms lack at.

Again, I think HTM may have something to offer beyond the capabilities of current algorithms (it has a lot of drawbacks too). The point is to test and measure those capabilities. If it turns out that you can get the same benefits with conventional approaches (which may well be the case), so be it and I would proceed to focus on those approaches instead of HTM.

I think so, the most promising approach, in my opinion, is to make a hybrid that takes advantage of both HTM and conventional RL (as you did). Noise robustness and online learning are very desirable features and we should strive to preserve them as much as possible.

I also don’t think it’s enough. The reason we are interested in the brain is that we think it hides some extra tricks that we are missing and we should try to understand those in order to create a truly general AI. I’m not interested in modeling the brain as close as possible, I’m interested in making AGI and the brain has shown pretty impressive capabilities that out-perform our current models.

I really appreciate your feedback! It’s great to hear from someone who actually did the heavy work and thought carefully about this stuff. Great comments!

Also, feel free to contribute as much as you’d like, I feel that we can use someone like you to guide the process and provide valuable references.


One thing that needs to be defined for this project, is the nature of the tasks that will be used for benchmarking. Like others have pointed out, we must overcome the fact that on the surface we are trying to compare apples to oranges. The goal, then (in my mind) is how/if HTM can be used to enhance existing RL algorithms to perform tasks better.

One important element is to define the possible types of interfaces. For example, if a task involves learning to play checkers, how is the agent made aware of the board? (does it get raw pixel data, a text matrix, etc?), and how does it move pieces? (control a mouse cursor, generate commands like “E2 C4”, etc) What are the basic sensory organs it must have? (image encoders, text encoders, etc) And what motor interfaces does it require? (text generation, button state generation, mouse cursor control, etc)

Another important element is to define the types of challenges. RL obviously shines when it comes to games, but other challenges might include sequence memory challenges like predicting words, or anomaly detection challenges (where HTM obviously shines).

I tend to agree, though one can also imagine inventing a mostly novel HTM-based RL algorithm.

I think this doesn’t matter very much as long as the input/output are the same for both HTM and the RL algorithm. Though my intuition is that HTM may be more useful for tasks which have high conceptual complexity but low perceptual complexity. Eg. The game of Go has high conceptual complexity (it’s hard to know what action to make) and low perceptual complexity (19x19 bit array), while the task of pointing the cursor to a box shown on screen has low conceptual complexity (the task is very simple) and high perceptual complexity (visual perception and fine motor control).

I feel like we can’t really know until we try lots of tasks. One possible direction is to use HTM’s anomaly detection as described in instead of a conventional ANN model and see if there’s a significant improvement.

This where this being a community project can be very helpful, we can easily have multiple people working on different approaches at the same time. And we can have a common set of tools and benchmarks to make it consistent and easy to try out new things.

1 Like

Is trying to beat existing state of art for a specific RL approach the correct focus at this point in time?

I was thinking more along the lines of very simple Unity-ML based environments and seeing what HTML/grid cell approaches can do in terms of RL - learning to achieve some goal.

Start simple, learn where HTM works/struggles, see best ways to bring in goals/rewards. The focus still purely on learning.

As more is learnt, decide to either pursue novel tasks with no non-HTM RL equivalents, or if deemed appropriate based on knowledge gained, select an existing RL experiment which is in HTM-RL sweet spot.

In Unity-ML it is easy to create environments where agents must sense and act upon the virtual world to achieve a goal.


Well, I’m not necessarily trying to beat state of the art algorithms, though that would be awesome. I’m trying to evaluate HTM against a baseline. We’ve already got two HTM-RL implementations (granted one is based on the other), we can already try to evaluate them against standard approaches. We also have a bunch of already implemented baseline environments and algorithms to test against.

After evaluation, we should, however, think about how to improve them or perhaps invent a new one, and test those improvements as well.

1 Like

I think I’ve found the place for HTM in the RL landspace. In model-based RL, the algorithm tries to learn a model of it’s environment (i.e. the state transition and reward probabilities) and plan actions based on this model by sampling scenarios to approximate the value of actions. HTM can act as this model by predicting the next state and reward based on the previous (state, action, reward), this constitutes learning the dynamics of the world.

We can then compare HTM to other model learning algorithms (bayesian nets, ANNs, Gaussian process) using some baseline algorithms (Dyna, Dyna+ with rewards for discovering anomalies, MC Tree search, etc…)


1 Like

For HTM to function effectively in this type of model (pure HTM sequence memory applied to state -> action -> reward), it will need a good pooling algorithm. When predicting reward, it should not predict only the reward for the next action, but should also predict the sum of rewards for all future actions after that next action (otherwise the system would never take an action with a predicted small negative reward which could lead to a large future positive reward).

Any temporal pooling algorithm could work here (for example there is a Union Pooler implementation in HTM research which could be applied). An ideal pooling algorithm, however, would be purely “forward looking” (where the representation only depicts future rewards, and does not include past rewards). One strategy that I have experimented with in the past is to grow proximal connections in the pooling layer using a TM-like algorithm, with previously active cells over multiple time steps (using a logarithmic decay algorithm). This results in active representations of predicted future rewards which extend further and further backwards in time the more a particular context is encountered.

Ultimately, though,I don’t think this is the best strategy, because it doesn’t scale. Over time, the representations become saturated, and reward predictions begin to suffer. It would be better to model the “go - no go” circuit of the basal ganglia.


nope, in model-based RL the model is only concerned with understanding the dynamics of the world (i.e. what will happen if I’m in state S doing action A?), not with maximizing the reward, its only concerned with the next step.

Maximizing the reward is typically done by applying model-free methods (monte carlo, TD, TD-lambda…) to sequences sampled from the model (using the algorithms mentioned above).

1 Like

Perhaps I don’t understand model-based RL then, but if a system is unable to maximize the reward, seems like it would be pretty “dumb”. What are the advantages of a system which is only capable of knowing the reward for the next step?

I guess is should be asked: how far forward in time does the human cortex run?

I am not asking about the “thinking process” but the cortex in general. I have posted my ideas on consciousness before and I see that as a very different process.

Do we have any research that suggests that the brain predicts more than a very short time horizon?

Perhaps a naive response, but to me “it must”. Planning for far into the future is an almost uniquely human trait, and the obvious place to look for that trait in our massive neocortex. Why would I work my butt off to earn overtime so I can buy the latest gadget in a few months. There must be an RL mechanism in place which allows me to accept numerous negative rewards over a long period of time with the hope of a big positive reward sometime down the road.

1 Like

Learning a goal that can be recalled and used to shape decisions is distinctly different than the temporal mechanism in the cortex predicting one second in the future.

HTM predictive state predicts one alpha cycle into the future. 100 ms or so. The models I have seen here do Ok with a few steps in the future. It seems that longer sequences don’t work very well.

I think that the HTM model needs continuous feedback from the outside world to keep from wandering off into nonsense.

1 Like

RL make a prediction of the future. About what the next frame or sequence will be.
When the new frame come it is compared with prediction. The difference is put into cost or lost function and is trained to correctly give the right prediction. Using back propagation.
Large error in prediction could be a bad reward. Selecting right prediction when
anomaly detector activates is a good reward.
But This is only strengthens right path like a reflex.

HTM method will remember right and the wrong way such as driving a car into
a accident.

But Machine learning methods are different then HTM.

1 Like