Be sure to check out @sunguralikaan work here: HTM Based Autonomous Agent
I have also spent a lot of time studying how to apply RL to HTM. I essentially started with this high-level view:
Given a state, predict all the future expected reward for each possible action that can be performed from that state, and choose the action which is predicted to give the maximum reward. Then use the actual reward to improve future predictions when that state is encountered again in the future.
From this, you can see one obvious place that HTM could be applied. The temporal memory algorithm captures the semantics of temporal patterns, and recalls them when they are encountered again. This could most easily be connected with a backward view TD(λ) algorithm (https://youtu.be/PnHCvfgC_ZA), where each bit in the output of the TM layer would have some weight on what is the current state. Learning in a setup like this could also happen online – as the states (sequences) are learned and re-encountered, simultaneously the model could be predicting and running TD(λ) to improve its predictions over time.
I personally believe one could take this even further than just marrying the TM algorithm with an RL algorithm. The TM algorithm out of the box actually does a lot of the legwork required for backward view RL, by connecting a series of states into an “eligibility trace”. What is missing is a way to pool sequences along with their reinforcement “value”.
What I am thinking about currently is combining multiple sources of information into common stable pooled representations. The activity in a TM layer tracking sensory input would be one source, a TM tracking motor actions would be another, and emotions would be another. Combined, these would produce representations that include not only sequences of states and actions, but also their emotional context. Emotional context could then be weighed against the current needs of the system and used for action selection. The pooled representations, once chosen, could then be used to temporally unfold the motor sequence, comparing predicted sensory input and predicted emotional context with reality and updating the model online.