I believe there could be multiple simultaneous ways RL works in the brain and at multiple levels, and I would like to propose just another model which I haven’t seen being considered here.
First, some assumptions on the way I understand reinforcement learning which might be either obvious or plain wrong, but they are necessary to minimize any further misunderstanding.
Basically, we have evolved such that our actions lead to favorable outcomes. The part of the reward system that decides whether or not a situation is favorable has been directly shaped by natural selection and hence it’s practically static (changing at a much slower rate than the cortex). This part probably resides in the lower brain and has strong ties with body organs necessary for survival (and other unimportant details) but the important assumption is that it’s something “outside” of the cortex. The cortex itself is “dumb” - doesn’t know what is good or bad - and the reward system has evolved to instruct the cortex into generating more favorable outcomes. Since a favorable outcome is the result of a limited set of actions, the system has to influence the cortex in a certain way that it maximizes the chances of such actions taking place again.
The issue is that the reward only comes now, as the action is completed, while the action is often a complex and lengthy process that started in the past. Since reinforcement learning is a trial and error process, we can assume there is no feedback until the very end of the process. The reward system is faced with the challenge of influencing the past while in the present.
What if the permanence of synapses is modulated directly by the reward system through the release of reward chemicals?
Before explaining the details, let’s review the current synapse model of HTM.
As it is in temporal memory, the permanence value of synapses is changed when the dendrite segment they belong to is activated by the activity of other cells. If a synapse of the active dendrite segment is connected to an already active cell, its permanence is temporarily increased, otherwise it’s temporarily decreased. If the cell that owns this dendrite segment (and which is now in a predictive state) becomes active in the next step, this means that the synapses correctly predicted the activity and so their temporary permanence change becomes… permanent. If the prediction was incorrect, the permanence changes are discarded. Basically, because the synapse permanence value is kept temporary during only one time step, this allows the cells to “look back” into the past for only one step to select the best predictive pattern.
What I propose is that in addition to the current mechanism, the permanence changes of synapses also have a temporary component that spans across many time steps, maybe 100s or 1000s or more. These permanence changes then become permanent in the presence of a reward signal, which could be a chemical such as dopamine that comes into contact with the synapses. If the reward signal does not come after a certain time, then the change components fade away so the learning is partially wiped. This allows the reward system to enhance the pathways corresponding to favorable actions that happened within a certain time window in the past, depending on how slow the permanence changes fade.
The faster the changes fade, the less influence the reward signal has on actions in the distant past and the more relevant are the recent ones.
Again, I am only talking about a temporary change component not the whole temporary change, this means that learning would happen the same way it happens currently in HTM, but a reward signal could favor some past changes slightly more than others, depending on their present outcome.
Now, it doesn’t have to be just dopamine. In fact, we could have punishing signals that use the same mechanism in the opposite way, i.e. a signal that wipes out temporary synapse changes of particularly unfavorable outcomes.
I have little knowledge of the biochemistry of neurotransmitters and such, but I believe this is a plausible mechanism which could be tested.