I’m very excited to share a little project I made in a night
even though I should be studying. I think I found a way to get HTM to generate actions and learn how to do stuff on it’s own. I stole the idea from Ogma and their Unsupervised Behavioral Learning algorithm. It’s still cool and I want to share my progress.
It is well known that HTM is bad at supervised learning and following any guidance from reward signals. Instead of learning the action that gives the highest reward (called policy learning in ML). UBL learns the action associated with state transitions of the environment.
Let’s say the agent is a self driving car. Instead of learning that turning the steering wheel right would turn the car right. Thus gets itself closer to the destination. Therefor giving it a reward. UBL learns that given the current state and the end result of turning right. The action of turning the steering wheel right generates the desired effect.
But why is this useful? We could replace the “result” with the goal of the agent while inference to let the agent generate actions that leads to the goal being fulfilled. (And keep the original value for training. The two can happen at the same time. So it’s still online learning). In the original UBL, the goal would be generated by UBL learner, forming a hierarchy. But I ignored it for simplicity and saving time.
The associator in the HTM version is a Spatial Pooler with the last state and current state as input. And it is hard coded to inference on the goal and learn with the actual, new state. You can read the source code for detail.
(Two runs with different parameters. The first have a higher global density of 0.15, the second with 0.1)
UBL works far better than my last attempt by mimicking the structuring of the cortical column. It’s far, far from competitive compared to deep learning. And UBL still requires some hack applied to HTM to function. But I think it is on the right direction.
With that said. UBL with HTM tends to collapse after a while, the plot is one that didn’t. And not every attempt at training it works. It can be caught in a loop of falling over and never recover.
(The initial spike is likely the result of cartpole-v1 has a trivial solution of acting randomly)
Also I want to thank @markNZed and the HLC for inspiration and encouragement.