An open-source community research project on comparing HTM-RL to conventional RL


Learning a goal that can be recalled and used to shape decisions is distinctly different than the temporal mechanism in the cortex predicting one second in the future.

HTM predictive state predicts one alpha cycle into the future. 100 ms or so. The models I have seen here do Ok with a few steps in the future. It seems that longer sequences don’t work very well.

I think that the HTM model needs continuous feedback from the outside world to keep from wandering off into nonsense.


RL make a prediction of the future. About what the next frame or sequence will be.
When the new frame come it is compared with prediction. The difference is put into cost or lost function and is trained to correctly give the right prediction. Using back propagation.
Large error in prediction could be a bad reward. Selecting right prediction when
anomaly detector activates is a good reward.
But This is only strengthens right path like a reflex.

HTM method will remember right and the wrong way such as driving a car into
a accident.

But Machine learning methods are different then HTM.


Yes, definitely agree. Of course a large part of the former must involve the cortex, since it is the source of deep abstractions. I should point out that do not believe the idea of a pure TM implementation involving state -> action -> (long term) reward is at all how the brain does it. It is simply one way that HTM could be applied to these types of problems.


Absolutely. HTM is currently just about modeling the world and making prediction. Rewards and action choices are definitely something that will also need to be modeled in the future. It of course will involve modeling other structures of the brain besides the cortex, but I expect that it will be a necessary as part of understanding the SMI circuit (there has to be something to drive the system to choose one action over another, or take any action at all for that matter)

I love your haiku-like posts, BTW! :stuck_out_tongue_winking_eye:


There must be an RL mechanism in place which allows me to accept numerous negative rewards over a long period of time with the hope of a big positive reward sometime down the road.

This isn’t something that everyone with a neocortex possesses, either. Young human children don’t develop this until later, and some adults never properly “mature” to the point of accepting delayed gratification as a valid reward (MTV anyone?). For this reason I assume any RL mechanism is via an optional, higher level abstraction.

Completely riffing here so shoot me down, but if a reward were a SDR where some bits are linked to the object and its desirability (shiny gadget = dopamine?) and other bits to a representation of the temporal aspect (the idea of minutes in the future vs years in the future) then don’t you have the basis for an instant decision that applies to a reward over any length of time?


I don’t have any definitive answers, but it seems highly unlikely to me at least that RL is somehow embedded into the function of cortical columns in isolation…or HTM models for the same reasons. When you discuss RL in terms of biology, you’re talking about the functions of the limbic system. Reward and punishment are emotional responses to stimuli. I don’t think it’s possible for humans to perform RL without a reward circuit in place at least which includes the functions of the amygdala, nucleus accumbens, ventral tegmental area and even the cerebellum and pituitary gland. Far beyond cortex.

Sort of like what @Bitking mentioned, it’s not clear to me how a time-series modeling and prediction algorithm like HTM fits into the scope of RL. I’m all ears to ideas. It seems any adaptation of HTM to the world of RL would require brand new thinking and modeling to the point where it’s basically something else entirely.


Of course the system is able to maximize future reward. But the model component of it (HTM) is only concerned with learning the dynamics.
Take a look at for more about model based-RL.

My point is that HTM could be viewed as the model component in model based RL (not the whole algorithm).

BTW I think the coversation may be diverging from the main topic.


From my perspective, TD-lambda may not be only used to maximize rewards. Neither does basal ganglia and by extension dopamine secretion. They are there to minimize errors via a surprise in reward. A slight difference, but changes a lot in the bigger picture. There are studies on how dopamine secretion in ganglia behaves similarly to the error produced by TD-Lambda, not the reward. There was also a video shared in this forum where it is argued that error is the conductance of neural computation, I forgot where.

I am not sure about this, if HTM is there to emulate cortex, there could be another module emulating the reward circuitry (ganglia or amygdala) which could be combined. They don’t have to be merged.


@sunguralikaan I’m interested in hearing your opinion. What do you think about this approach (using HTM as the model in model based RL and comparing it to other types of model learning methods using some baseline model based RL algorithm)


This is what I meant by HTM-RL vs X-RL in my first post. I do not know whether that is the best way but I think the architecture I use qualifies as that. As I said above, RL and HTM does not have to merged.

From a biological perspective, our brain does not seem to merge them. The majority of reward circuitry is composed of subcortical structures. The model and reward circuitry seem to be separate in our brain. Reward circuitry works reciprocally with neocortex but still through separate structures.

From a research perspective, if the RL and HTM was somehow merged to come up with some novel structure, that poses problems on incorporating advancements to that new structure. For example, I already have problems incorporating grid cells and possibly a pooling layer to the current HTM-TD(lambda). A novel and merged structure would have a harder time keeping up with HTM advancements. This may or may not be a concern.

From an academic standpoint, proposing a merged or novel solution is more difficult as HTM itself is not a known or matured method in the eyes of researchers (this may be a local problem for me). So offering a merged structure built on the foundation of HTM is even more problematic to present. HTM-RL vs X-RL may be more doable; at least there is HTM as the academic baseline this way.

Common sense directs me towards comparing HTM-RL to X-RL where X is some other model. However, there are many different ways of applying RL to HTM. In order to add RL to HTM, you can score activations, you can score minicolumns, neurons or even dendrites, you can force activations, you can search among activations, you can feed possible actions and score their results, you can pool activations, you can try to mimic ganglia pathways, you can alter permanence parameters with reward or error signals. These are just some of the stuff I tried or considered. The approach can change wildly depending on your biological concerns. After all what is the state of an HTM model? Predictive cells, active cells, active minicolumns, active dendrites or even synapses, or some combination of these? This is just considering there is a single HTM layer.

For the comparison, I guess the simplest option is to do a single HTM layer combined with for example Q-learning or TD, where biological concerns are ignored for the sake of implementing an accessible RL that is closer to whatever you are comparing to. I have biological constraints as well which complicates the whole process. I am not sure all of us should have that considering we do not even have a baseline HTM-RL comparison.

TL;DR model-based RL seems to be one of the better ways.


Ok, I got it. You are proposing that HTM essentially would be there to learn and provide only the current state of the system (and predict the next state), and the rest of the application would consist of other ML techniques unrelated to HTM.


As I think I said, I’m not really super concerned about the biological plausibility of the algorithm. I am interested in evaluating whether those biological approaches offer an advantage over current techniques. So what I’m suggesting is to just take a known model-based algorithm and plop HTM in place of the model and see if it’s better than other models.

Later on we can think of more ways to make it better, some examples of those are: using anomalies as to provide curiosity driven exploration, using the reward signal to scale the HTM learning rate (permanence changes), etc…


Yes. HTM would learn the dynamics of the world: state transitions and rewards. This knowledge will later be used by the rest of the algorithm to make good short term decisions for example by sampling some scenarios from the model and measuring the return for each action.


One area that will take some work to fit HTM into this type of system, is the part about sampling scenarios and measuring the return. In vanilla HTM, the inputs must happen in order (if not, then as you know the minicolumns burst, which predicts the next input for all learned states which contain the current input). You’ll need to devise a smart strategy for resetting, rewinding, and toggling learning on/off.


As of right now, state of the art, Machine Leaning community are
using a deep NN to predict the next state in real time.
The depp nn can be paired with video ststem. And then run the video through
deep NN to predict the next state or frame, in video land. This can be as good as predictive thought.
For example, a bot, with the NN could be going along just fine, Like going around a circular race track, Then it gets a anti reward. The battery is very low. So it goes up
into video memory and look for an off ramp or the finish line.
Also the deep nn can be trained on video while it sleeps.

HTM would have SDR bit activation on features with in the video. Like steering wheel,
road lines, light post, and etc… Which the Deep nn is doing too. But what HTM is not
doing is taking information from one frame, or more, and mixing them together in a way
to build the next frame. HTM just uses hard memory saves.
HTM SDR activation bit would be best used in early layers of a unsupervised
deep neural network.

The Deep nn is acting like a sliding window algorithm. But instead of sliding along the
data, the Deep nn is anchored and the data is slid through.

HTM, semantics, and fuzzy prediction

How to sample and choose actions is well known and not a part of the HTM model. Enabling learning is done just before feeding the real next state, reward and then we disable learning. Does HTM still change some internal state even if learning is disabled?


Yes. If you for example have learned the following sequences:


Then you are in state “D after C after B after A”, and with learning disabled you sample “W”. You have now changed the internal state to “W after D after…”. At this point if you try to sample “R” for example, the minicolumns will burst, because you are no longer in the “D after C after…” context.


Yes. The predictive current state is affected by the past states.

This is why you see all the discussions of the evolution of ABXC —> ABC and the like on these posts.

Rats. Paul beat me to it.


Yeah so we need a way to reset the state each time we sample.


This is not part of the current HTM canon but I do see that idea as a very useful tool.
(A state checkpoint/restore on an HTM system)