As part of trying to test the generalization and dataefficiency properties of HTM, I decided to try to use OpenAI’s Gym library as a framework to experiment with different ideas. OpenAI also has a collection of state of the art RL algorithms that we can use to measure baseline performance (baselines).
RL is a framework consisting of an agent in an environment. The agent performs actions in the environment and receives observations and rewards. The agent’s goal is to maximize future expected reward.
A known problem with current RL techniques is the lack of data efficiency. i.e, the agent has to explore it’s environment a lot before converging to a good policy. That’s because those algorithms sample actions randomly from a distribution that’s updated after receiving a reward in order to maximize expected reward for an action. It takes a long time to gather enough statistics to converge on a good approximation for the expected reward for an action.
Another problem is dealing with sparse rewards, where the agent is given an indication of progress only once in a while.
An Idea for HTM based RL algorithm:
we give the HTM sequences of (observation, reward, nextAction)
where nextAction = argmax{a0 = predictNextAction(observation), a1, a2, ... , aN}(predictReward(observation, a))
where a1, a2, ..., aN
are randomly sampled actions.
In other words, we tell the HTM model to learn sequences of the form (observation, reward, nextAction)
, At each step, we ask the model to predict the next action, we evaluate this action along with N other action and pick the one that the model predicts has the highest reward.
twist: can also predict the future reward for K steps ahead.
The point is that the model will learn to accurately predict both reward
and nextAction
that maximizes the reward.
Technical problems:

nupic
does not allow predicting multiple parameters.
Another idea: Curiosity
Basic idea: actively pursue anomalies.
The point is to give the agent some reward for discovering new anomalies. Doing this will possibly lead the agent to learn more efficiently by focusing it on exploring unknown states and actions instead of stumbling on them randomly. Those anomalies will be powered by an HTM model accompanying a regular RL model.
In the beginning, anomalies will be very dense and then fall off exponentially making the agent focus on exploring a lot at the start and optimize it’s goal later.
Potential Problems:
 Can the agent even learn to produce anomalies? anomalies are by definition unexpected states, despite this, I believe it can be useful for a model to actively pursue states on the edges of its knowledge.
 Exploration fixation: give too much reward per anomaly and the agent will become fixated on pursuing them even when they become sparse, the agent will not optimize its goal. solution: measure relative sparsity of recent anomalies and scale the reward accordingly such that sparse anomalies receive lower reward than dense anomalies.
Technical problems:

nupic
ispython2
whilebaselines
ispython3
.