The most naive idea which I outlined in Trying to make an HTM augmented/based RL algorithm is to let the model predict both the next state’s value and next action, and choose the action by sampling N random actions in addition to the predicted action and pick the one with the best predicted value. Supposedly the model will converge on the best policy.
I have not yet fully read @sunguralikaan’s work on making a HTM-TD(lambda) hybrid, which seems like a more sensible approach.