Working Reinforcement Learning in HTM through Unsupervised Behavioral Learning

Hi all,

I’m very excited to share a little project I made in a night even though I should be studying. I think I found a way to get HTM to generate actions and learn how to do stuff on it’s own. I stole the idea from Ogma and their Unsupervised Behavioral Learning algorithm. It’s still cool and I want to share my progress.

How it works

It is well known that HTM is bad at supervised learning and following any guidance from reward signals. Instead of learning the action that gives the highest reward (called policy learning in ML). UBL learns the action associated with state transitions of the environment.

image
(Source: Ogma)

Let’s say the agent is a self driving car. Instead of learning that turning the steering wheel right would turn the car right. Thus gets itself closer to the destination. Therefor giving it a reward. UBL learns that given the current state and the end result of turning right. The action of turning the steering wheel right generates the desired effect.

But why is this useful? We could replace the “result” with the goal of the agent while inference to let the agent generate actions that leads to the goal being fulfilled. (And keep the original value for training. The two can happen at the same time. So it’s still online learning). In the original UBL, the goal would be generated by UBL learner, forming a hierarchy. But I ignored it for simplicity and saving time.

The associator in the HTM version is a Spatial Pooler with the last state and current state as input. And it is hard coded to inference on the goal and learn with the actual, new state. You can read the source code for detail.

Result


(Two runs with different parameters. The first have a higher global density of 0.15, the second with 0.1)

UBL works far better than my last attempt by mimicking the structuring of the cortical column. It’s far, far from competitive compared to deep learning. And UBL still requires some hack applied to HTM to function. But I think it is on the right direction.

With that said. UBL with HTM tends to collapse after a while, the plot is one that didn’t. And not every attempt at training it works. It can be caught in a loop of falling over and never recover.

(The initial spike is likely the result of cartpole-v1 has a trivial solution of acting randomly)


Also I want to thank @markNZed and the HLC for inspiration and encouragement.

5 Likes

Does this perform better than ogmaneo? They take a very similar appraoch to htm so curious to know which implementation is better ogma or htm?

I am having trouble setting up ogmaneo to compare the two algorithms. But I assume not. And, like I said, the hierarchy isn’t working for now. And Ogma have decoders that decodes their representation back into real values. Yet the HTM version directly interacts with the environment with SDRs.

Feel free to try to run the Ogma version and/or improve HTM. I’d like to know the result too.