TBT does not have RL yet.
My current idea is :
TM-like structure.
Dendrite Input : State
FF : ??
Prediction: use Action SDR (the predicted SDR == actual Acion SDR bits) and a REWARD to figure out which dendrites to boost from t-1
The update rule has to keep the last step Active neurons and dendrites /in a buffer/, so that you can do Q-value calc like you do in RL.
Sort of like Ensemble RL that predict active bits of SDR which should match Action-SDR
Then pass this Action to CC. Sense is State.
So what is left is how you represent and apply a GOAL.