Reinforcement Learning and HTM Algorithm

What HTM excels at is learning from temporal sequences, predicting the next inputs and detecting when the predicability of the system changes. So it may not replace the fitness function for the different actions, but it could predict the effects of different potential actions, which the Q function could evaluate to decide which course to take.


I am not a stats person - when do I find a nice newbie into to using the Q function with neural networks?


This is a great topic! I would actually start not by replacing the value function. But as mentioned in the other topic, maybe the most direct way to apply TM would be to model the transition and reward functions and use a model-based algorithm. If you can model the world, you can “imagine” possible transitions, replay them in your head and learn offline from the imagined transitions instead of having to actively explore the environment.

Note that a Q function is not a map of state and action to a reward, but rather a map of state and action to a scalar value, which is composed of the expected reward at the next state plus all future rewards that can be obtained from the next state following a given policy. This can be a little tricky to model.


This one looks straightforward:

and if you are familiar with pyTorch and want to implement:


I’m not sure a system could learn much by imagining. This would apply better to planning I think. Learning really needs to come from the real world, and planning would involve applying what is learned to new scenarios. Learning would then follow from how well reality matches up with the plan.


It really depends on how good your model of the world is. If you have a perfect model, then there is no difference between actually experiencing the environment or just imagining the experience.

I agree with you that learning from transitions sampled from an imperfect model is hard, it introduces a lot of uncertainty, which is why model-based algorithms never really took of. What we have nowadays in RL is a hybrid model, where the world is not explicitly modeled, but implicitly modeled by keeping a buffer of past experiences and learning by oversampling past experiences.

But I still think it is an interesting research direction to improve model-based algorithms with better maps of the world, and see if it leads to better results compared to model-free + experience replay algorithms (such as DQN and its variants).


To elaborate a little better what I mean by planning and learning, just in case someone not familiar with RL is following this.

Say you are playing a game of chess against a machine. You have a large table that shows you exactly which action the computer will take given a specific board configuration and an action you take. So at any point you can check this table and know exactly how your opponent will respond. This is what I refer to as a “perfect model of the world”.

Now I give you a task - you have to beat the computer in less than 10 moves. Even though you have a perfect model of the world, you still have to learn how to do that, learn a policy that will tell you the best action to take given each state of the world. But learning this policy doesn’t require any interaction - all you have to do is replay the game thousands of times in your head, until you have a winning strategy, before you even touch the board.

So learning in RL framework refers to learning a policy. Having a map of the world doesn’t imply you know how to achieve goals in this world, you still have to learn. The distinction we make in RL is that if you have this perfect model, you can solve the problem analytically or by solving a set of recursive Bellman equations - that we refer to as “planning”. But if you don’t have this model, say I randomly remove 50% of all rows in the table, now you may have to actually experience the environment in order to learn your policy. Now we refer to it as “learning”. So in RL the distinction between planning and learning is only on how much you need to interact with the world to learn the policy. And the line between the two is blur, you can both plan and learn at the same time, which is what model-based algorithms do.

(all the points above are about RL, not HTM)


I doubt biological brains do this (at least it doesn’t feel anything like this when I play chess). It is more like predicting the opponent’s state of mind, and formulating a plan with a few backup options based on how I expect they will react. Definitely doesn’t seem to involve thousands of simulations (it is difficult to hold onto any more than a few dozen)


I agree 100% with you. Our working memory is way too small for this, specially during the game. But at idle times some planning might be happening at a much smaller scale (my own opinion).

I will give you a personal example. I enjoy rock climbing. Some days I will try to climb a route, and fall several times at the same point. When that happens, I spend hours in the following days reliving that situation, thinking of different sequence of moves and what can possibly happen at each sequence. I play it so many times in my head that after some time I know what I have to do to solve the problem (I have an improved policy). And usually the next time I try, I can complete the route in the first or second attempt.

I personally think learning by interacting with an imagined version of the world, even with all the uncertainty from our imperfect model, could help with learning a policy in RL settings.


That is NOT how I play chess. I look at the legal moves open to me and sort through the ones that improve my position and reduce my opponents position. This is pure pattern recognition. Sometimes I work through a complicated exchange and see what material is exchanged and what the board will look like afterwards. Sequences of patterns. I train on chess tactics to recognize common situations and outcomes so I don’t have to work them out during the game. Sequences of patterns. At my very best I can see about 10 moves ahead in each of the possible best candidate move - having to work through each one serially. Mostly this all works out to pattern recognition based on prior exposure. Seeing a pattern is very different from running simulation and when I do work through movement sequences it is very hard. This is more of the pattern matching chained together - not what I would think of as running large numbers of simulated moves. At some point in the game I see what could be a winning end position and then work to match that pattern with the one in front of me. This feels like a different type of pattern manipulation but it is still pattern matching.

In all this the moves I do examine have already been selected from pattern matched to what I know to be “good” moves.

Grandmasters do about the same things I do but they have a larger library of patterns.

I am rated about 1850, the very best players are rated ~2000 to 2500 on this scale. Rank newbies are 100 to 800 on this scale.


From what I understand of the brain - you have to pay some attention to your senses to experience them. The degree of attention is related to the perception of relevance to you. As these experiences are registered in the temporal lobe through to the hippocampus the memory is colored as good or bad by your limbic system - the reward is remembered right along with the experience. Since you don’t know if it will be good or bad while you are experiencing it - it does make sense that the experience is held in the buffer of the hippocampus to be combined with the reward coloring at the end of the experience.
This is the basis of forming value (salience) of perceived objects in future encounters.
This method is distinctly different from the RL as employed be the DL camp. It is also the basis of judgment and common sense, properties absent from most DL implementations.


I think I can help support what you are saying by being able to word as: after (sparse or dense) addressing of a unique experience/location the memory is colored as good or bad by adjusting associated confidence level according to an action having worked or not for meeting current needs - sensory conditions like shock or food are included in the data, this way punishment or reward (or in between by shock/punishment being too brief to stop a starving critter) is something recalled/reexperienced by what an experience looked and felt like to sensory.

To go from there to chess level planning would be stepwise premotor recall of competing possibilities, without having to physically move the pieces to conceptualize the result of each possible motor action that moves pieces to new locations on a 2D board.


No - I don’t think the memory is adjusted at all. The good or bad is stored as a property like a color.
This goes with the concept that an object is a collection of features. In this case - the feature is some signal communicated by the amygdala.
In the chess case, some patterns are weighted as good and I look for these as I am sorting through possible moves. There is no motor involvement, I “just see” good moves.

1 Like

While writing I thought about getting onto the amygdala related detail, but did not want to get wordy or make you have to wait to get more just right.

The amygdala would be an emergency type fast response parallel memory subsystem where good or (especially) bad can be stored as a property.

What gets adjusted each time a given action is tried is in this case a non-amygdala confidence level that changes depending on how many times a given action worked and how many times it was tried.

Something never experienced before would have recall of nothing. In that case we right away know we have to take at least a random guess then save data representing whether it went good or bad after trying. After working a few or more times in a row confidence level increases to a point where if it did not for some reason work our confidence would go down a little but we would then still try again before giving up on it.

All the rest of the associated memory data can remain exactly the same.

1 Like

The amygdala is full of hardwired archetypes (genetically programmed) that are evaluated at a subconscious level. These programmed features are things like snakes, spiders, features of your own species, faces, secondary sexual characteristics, and the like. This is fed from a tap on the feed to the visual system and is mostly a one-way feed into the hippocampus.

This is the source of much of the good/bad signals.

More on the amygdala processing here:

Some of the other good/bad features come from other limbic system nodes such as the hypothalamus.

I think that there may be a mechanism for adding new features to the amygdala based on the utility this function would add but I am unaware of any mechanism to do so.


Yes. The way I see it: at the “subconscious level” other examples would be where hand is too close to flame and amygdala kicks in to quickly recoil back with fear the amygdala state is enough to make sure confidence in that action working gets lowered due to something having gone wrong from getting that close. We did not have to learn that sudden pain like that is bad as though there is choice over responding or not then take time thinking about it, the amygdala already knows it’s bad and immediately responds with fast reflex in other direction.

When something bad looking jumps in our way we can get startled by fear, without our at the logical level deciding whether we should be or not.

A one-way feed into the hippocampus is what the amygdala would need to have some control over what gets mapped in as the attractor to navigate towards. As a result “love is blind”, will “walk miles for” and all that.

While traveling: as long as everything that could go wrong didn’t the confidence in an action working increases (or remains at max value).

In the ID Lab model confidence goes down from bashing into solid objects, shocks to feet and navigational errors like heading not matching the spatial map given angle and magnitude. For at least us an amygdala adds extra feeling to confidence changing experiences, but for neocortical modeling purposes a feedback bit representing an “Ouch!” or “Oh crap!” from any system that can sense one is enough to know when something just went wrong somewhere.

The HTM part would be a neocortical sensory to sparsely addressed parallel memory made of virtual neurons, instead of sensory to densely addressed digital RAM made of silicone that is not at all parallel but similarly works well enough for ~28 or less address bits to have been worth experimenting with that over the years.

1 Like

This is not a good example - fast reflex due to pain is handled in the spine.

I cannot say this enough - the nervous system is a very complicated machine with many interacting systems. Putting a given function in the right system(s?) has taken me many years and I still get it wrong from time to time.

Localizing a function is can be very difficult as different aspects of even the simplest thing are likely to be shared over several systems at several levels of processing. Stating that “x does y” is likely to miss important parts of how the system functions.

The amygdala is often described as fear processing with little regard to the other things it can recognize and process. (and tag in the hippocampus) From an evolutionary point of view, this makes good sense - being afraid of everything is a good call in an uncertain world; almost anything can kill you.

The amygdala also recognizes other things like social cues in faces and secondary sexual characteristics. As much as some people would like it to be different - you do judge people by how likely it seems that they have the right stuff to procreate with you. Your ability to read expressions is unbelievably sensitive and built in at a sub-cortical level. Your processing of social cues such as dominance and submission, likewise.

This is all baked in by evolution and it takes considerable social education to overcome it. I think that recognizing the innate workings of the brain and designing social structures to work with this rather than suppress these natural drives would result in better-adjusted people. And fewer rogue humans.


I recently came across a TED talk about the importance of sleep. One of the takeaways I got from it was that sleep seems to be a process by which we’re moving learned information from short-term into long-term memory, perhaps by dreaming(?). This could even be modified/amplified by introducing very small impulses to the outside of the skull.

Within the concept of “replay”, this at least makes sense; short-term memory is relaying what it learned to the other parts of the brain by replaying bits and pieces of it, which the magic of our brains is then able to organize and store into long-term knowledge.

Certainly encourages me to sleep more, and consider that perhaps we should have columns occasionally playing back out to each other, somehow.


These two posts are on data coding and the role of sleep, and the related transfer between the hippocampus and cortex.

And this one that talks a little bit about the actual transfer:
… brain wave are known to excite the hippocampus in dreaming - driving a recall in both the cortex and hippocampus at the same time. If the hippocampus has leaned a response that makes it fire faster from new learning (using the “standard” spike timing learning ) this could trigger learning in the related cortex. This excitation sweeps through again and again as long as there are significant phase differences in response between the two areas."


Thanks for catching that one.

My example needs to include the automatic spinal reflex motor action recoil, which in turn alerts the amygdala something bad we need to fear just happened. There are two independent subsystems at work, instead of one.

I’m not used to having to include even spinal related detail.