Trying to make an HTM augmented/based RL algorithm

As part of trying to test the generalization and data-efficiency properties of HTM, I decided to try to use OpenAI’s Gym library as a framework to experiment with different ideas. OpenAI also has a collection of state of the art RL algorithms that we can use to measure baseline performance (baselines).

RL is a framework consisting of an agent in an environment. The agent performs actions in the environment and receives observations and rewards. The agent’s goal is to maximize future expected reward.

A known problem with current RL techniques is the lack of data efficiency. i.e, the agent has to explore it’s environment a lot before converging to a good policy. That’s because those algorithms sample actions randomly from a distribution that’s updated after receiving a reward in order to maximize expected reward for an action. It takes a long time to gather enough statistics to converge on a good approximation for the expected reward for an action.
Another problem is dealing with sparse rewards, where the agent is given an indication of progress only once in a while.

An Idea for HTM based RL algorithm:
we give the HTM sequences of (observation, reward, nextAction)
where nextAction = argmax{a0 = predictNextAction(observation), a1, a2, ... , aN}(predictReward(observation, a)) where a1, a2, ..., aN are randomly sampled actions.

In other words, we tell the HTM model to learn sequences of the form (observation, reward, nextAction), At each step, we ask the model to predict the next action, we evaluate this action along with N other action and pick the one that the model predicts has the highest reward.
twist: can also predict the future reward for K steps ahead.

The point is that the model will learn to accurately predict both reward and nextAction that maximizes the reward.

Technical problems:

  1. nupic does not allow predicting multiple parameters.

Another idea: Curiosity
Basic idea: actively pursue anomalies.
The point is to give the agent some reward for discovering new anomalies. Doing this will possibly lead the agent to learn more efficiently by focusing it on exploring unknown states and actions instead of stumbling on them randomly. Those anomalies will be powered by an HTM model accompanying a regular RL model.
In the beginning, anomalies will be very dense and then fall off exponentially making the agent focus on exploring a lot at the start and optimize it’s goal later.

Potential Problems:

  • Can the agent even learn to produce anomalies? anomalies are by definition unexpected states, despite this, I believe it can be useful for a model to actively pursue states on the edges of its knowledge.
  • Exploration fixation: give too much reward per anomaly and the agent will become fixated on pursuing them even when they become sparse, the agent will not optimize its goal. solution: measure relative sparsity of recent anomalies and scale the reward accordingly such that sparse anomalies receive lower reward than dense anomalies.

Technical problems:

  1. nupic is python2 while baselines is python3.

BTW, would anybody like to help me research this?

1 Like

Autonomous learning in Deep RL: Overcoming sparse rewards:


Damn, I’ve been out-invented.
About HER, I’ve known about it. In fact they provide their algorithm as part of baselines.
The curiosity thing is exactly what I thought I invented.



I basically tried around the same the last month and bundled the complete setup and experiences.
Check it out.. It is now relatively easy to modify the agent and would be interesting to try out your ideas!

The first idea is approached differently by generating the action from the state you are in instead of predicting them. For curiosity you could play around with the reward function or how you use the TD error to update neurons (e.g. less updates for predicted neurons - such that newly learned stuff (unpredicted) is updated more)

Kind regards


Another thing I wanted to test is the ability of an HTM agent to learn two different tasks and retain performance on both compared to current techniques.


I am really interested in putting “fist line reward system” map to dedicated
SDR bits. Also, A second reward system will have their own dedicated
bits, A reward for matching temporal patterns in 3d space and in N dimension space.
A example for a fist line reward.
When a bot is in a low batter state it selects a sequence that will activate a
“gain in energy reward bit”. That is if the bot stores pattern loops that are
built of SDR matrices. That can branch into other SDR loops at a
decision point, by way of motors.

how far is your research please?

1 Like