This is what I meant by HTM-RL vs X-RL in my first post. I do not know whether that is the best way but I think the architecture I use qualifies as that. As I said above, RL and HTM does not have to merged.
From a biological perspective, our brain does not seem to merge them. The majority of reward circuitry is composed of subcortical structures. The model and reward circuitry seem to be separate in our brain. Reward circuitry works reciprocally with neocortex but still through separate structures.
From a research perspective, if the RL and HTM was somehow merged to come up with some novel structure, that poses problems on incorporating advancements to that new structure. For example, I already have problems incorporating grid cells and possibly a pooling layer to the current HTM-TD(lambda). A novel and merged structure would have a harder time keeping up with HTM advancements. This may or may not be a concern.
From an academic standpoint, proposing a merged or novel solution is more difficult as HTM itself is not a known or matured method in the eyes of researchers (this may be a local problem for me). So offering a merged structure built on the foundation of HTM is even more problematic to present. HTM-RL vs X-RL may be more doable; at least there is HTM as the academic baseline this way.
Common sense directs me towards comparing HTM-RL to X-RL where X is some other model. However, there are many different ways of applying RL to HTM. In order to add RL to HTM, you can score activations, you can score minicolumns, neurons or even dendrites, you can force activations, you can search among activations, you can feed possible actions and score their results, you can pool activations, you can try to mimic ganglia pathways, you can alter permanence parameters with reward or error signals. These are just some of the stuff I tried or considered. The approach can change wildly depending on your biological concerns. After all what is the state of an HTM model? Predictive cells, active cells, active minicolumns, active dendrites or even synapses, or some combination of these? This is just considering there is a single HTM layer.
For the comparison, I guess the simplest option is to do a single HTM layer combined with for example Q-learning or TD, where biological concerns are ignored for the sake of implementing an accessible RL that is closer to whatever you are comparing to. I have biological constraints as well which complicates the whole process. I am not sure all of us should have that considering we do not even have a baseline HTM-RL comparison.
TL;DR model-based RL seems to be one of the better ways.