I just corrected the paragraph which I realized now. It was stating the opposite, sorry if it caused confusion.
So you were actually talking about Q-Learning. I thought you were talking about temporal difference learning RL algorithms in general. Q-learning (sometimes called Off Policy TD) is a specific variation of temporal difference learning algorithm that came out later. This is a nice and easy to read overview of TD RL umbrella.
In context of HTM, temporal difference learning lambda (specifically backwards view) is more inline with biology (dopamine[learning signal] usage of basal ganglia).
In Q-Learning, you calculate the Q(State, Action) values which creates a map that the agent can use to output the optimal transitions for every state in the long term. In temporal difference learning, you calculate the values for the state itself, not the action a taken in state s. This indirectly takes into account what the agent statistically does in the proceeding steps. So a state's value is actually tied to its policy which is what happens in biology.
Q-Learning implies that the agent has a decision making mechanism where it weights all the options and "chooses" one. That is not the case in biology on low level. Learning in biology is not about optimizing the reward path, it is about reducing the difference between expected value at timestep t and actual value at t+1, the error/learning signal; dopamine. The error is already calculated; Rt+1+γGt+1−Qt. [Gt+1 would be Vt+1 and Qt would be Vt in TD(lambda)]
You can still treat individual cells or columns of HTM as states and calculate state values for individual cells/columns that are active at time t+1. That's what ANN guys are doing mostly.
We are on the same page on this and that is what biology does to my understanding. On computational models of basal ganglia (may be modeled as a modified HTM with state[neuron] values), specifically striatum is believed to be functioning similar to temporal difference learning through its dopamine receptors D1 and D2, in conjunction with cortex.