Reinforcement Learning at the Synapse Level


I believe there could be multiple simultaneous ways RL works in the brain and at multiple levels, and I would like to propose just another model which I haven’t seen being considered here.

First, some assumptions on the way I understand reinforcement learning which might be either obvious or plain wrong, but they are necessary to minimize any further misunderstanding.
Basically, we have evolved such that our actions lead to favorable outcomes. The part of the reward system that decides whether or not a situation is favorable has been directly shaped by natural selection and hence it’s practically static (changing at a much slower rate than the cortex). This part probably resides in the lower brain and has strong ties with body organs necessary for survival (and other unimportant details) but the important assumption is that it’s something “outside” of the cortex. The cortex itself is “dumb” - doesn’t know what is good or bad - and the reward system has evolved to instruct the cortex into generating more favorable outcomes. Since a favorable outcome is the result of a limited set of actions, the system has to influence the cortex in a certain way that it maximizes the chances of such actions taking place again.

The issue is that the reward only comes now, as the action is completed, while the action is often a complex and lengthy process that started in the past. Since reinforcement learning is a trial and error process, we can assume there is no feedback until the very end of the process. The reward system is faced with the challenge of influencing the past while in the present. :slight_smile:

What if the permanence of synapses is modulated directly by the reward system through the release of reward chemicals?

Before explaining the details, let’s review the current synapse model of HTM.
As it is in temporal memory, the permanence value of synapses is changed when the dendrite segment they belong to is activated by the activity of other cells. If a synapse of the active dendrite segment is connected to an already active cell, its permanence is temporarily increased, otherwise it’s temporarily decreased. If the cell that owns this dendrite segment (and which is now in a predictive state) becomes active in the next step, this means that the synapses correctly predicted the activity and so their temporary permanence change becomes… permanent. If the prediction was incorrect, the permanence changes are discarded. Basically, because the synapse permanence value is kept temporary during only one time step, this allows the cells to “look back” into the past for only one step to select the best predictive pattern.

What I propose is that in addition to the current mechanism, the permanence changes of synapses also have a temporary component that spans across many time steps, maybe 100s or 1000s or more. These permanence changes then become permanent in the presence of a reward signal, which could be a chemical such as dopamine that comes into contact with the synapses. If the reward signal does not come after a certain time, then the change components fade away so the learning is partially wiped. This allows the reward system to enhance the pathways corresponding to favorable actions that happened within a certain time window in the past, depending on how slow the permanence changes fade.

The faster the changes fade, the less influence the reward signal has on actions in the distant past and the more relevant are the recent ones.
Again, I am only talking about a temporary change component not the whole temporary change, this means that learning would happen the same way it happens currently in HTM, but a reward signal could favor some past changes slightly more than others, depending on their present outcome.

Now, it doesn’t have to be just dopamine. In fact, we could have punishing signals that use the same mechanism in the opposite way, i.e. a signal that wipes out temporary synapse changes of particularly unfavorable outcomes.

I have little knowledge of the biochemistry of neurotransmitters and such, but I believe this is a plausible mechanism which could be tested.


Hello, @dorinclisu, I also believe that is how reinforcement learning is implemented in biology. Some time ago, I presented a structure similar to what you proposed [1]; reward signals altering permanence values. Dopamine is known to alter corticostriatal connections; the connections originating at cortical layers and targeting basal ganglia (striatum). The internal connections of cortical regions however are not modulated by dopamine to my knowledge.

If you wonder about the biological plausibility of your proposal, below are some of the biological references that suggest synaptic plasticity of connections between striatum and cortex are altered by D1 and D2 dopamine levels. [2]

The underlying neural plasticity hinges upon phasic dopamine release, signalling reward or its expectation (Montague et al. 1996; Schultz 1998, 2013), and acting mainly within the striatum of the BG to enhance or depress synaptic strength (Centonze et al. 2001; Reynolds and Wickens 2002).

Apart from their opposing actions, a second key feature of the direct and indirect pathways is their differential regulation by dopamine (Albin et al. 1989; Gerfen and Surmeier 2011). The source of dopaminergic input to the striatum is the substantia nigra pars compacta (SNc), which is fed by a reciprocal input from the striatum but also by external sources, and acts as a modulatory gateway to BG circuits (Schultz 1998). In addition to mediating long term plasticity, noted above, dopamine also has a short-term influence upon striatal activity; it enhances the excitability of dSPNs and has the opposite effect upon iSPNs

This momentary regulation of SPN activity monitors the tonic level of dopamine afferent discharge, and is complemented by plastic changes of synaptic strength regulated by phasic dopamine signals (transitory peaks and troughs in the rate of dopaminergic discharge that reflect the presence and absence of reward (Schultz 2013). Phasic activation of D1 and D2 receptors promotes LTP and LTD (long term potentiation and depression) of glutamatergic synapses upon dSPNs and iSPNs, respectively; moreover, these actions are contingent upon recent spiking history, such that dopamine gates LTP or LTD of a synapse depending on recent conjunctions of pre-and post-synaptic depolarisation (Shen et al. 2008; Paille et al. 2013).

Proposing a Model for the Basal Ganglia and Reinforcement Learning in HTM

Thanks for the pointers! I didn’t want to emphasize on the role of dopamine or any other chemical in particular. I only mentioned it as a loose example, leaving the biochemistry details for people that are specialized and interested in it.
My proposal was something that came out of intuition more than concrete evidence, and I am personally more interested in how well this would work in a computer implementation than how accurate it is to a real brain. But hey if it’s shown to be accurate, that’s an even better confirmation.

Now I am not sure if you proposed the same thing. It is not clear to me how you mean the reward to alter the permanence values but that is perhaps because I am missing the context of sensorimotor inference and of the basal ganglia, which I haven’t studied closely yet. What is the reinforcement layer you mention, do you have cortical cells which are only concerned with RL? Do you have synapses solely for biasing the cells as dictated by the RL layer?

Also, I believe that how the time association is made to be one of the most important and tricky aspects and I would appreciate your thoughts on that becaue I couldn’t get a clue from your post! Now, I am proposing a fixed time delay window, with the change component shape of a decaying exponential or similar, but maybe a variable delay gaussian would be better. There are lots of details to be worked out.


Yes, one layer is concerned with RL. It basically learns the HTM layer underneath it by taking distal and proximal inputs directly from it. So the RL layer, predicts based on the activation of the lower HTM. The lower HTM layer has distal depolarizations as usual. On top of this, the lower layer has apical segments along with distal segments. The apical segments take input from the RL layer to imitate topdown feedback. The synapses of these apical segments are modulated with reward, so these segments only adapt to the important activations of RL layer. IF there is an apical depolarization on the lower layer caused by the RL layer, it means that the RL layer is predicting an important (salient) activation for the lower layer. If the distal depolarization of the lower layer (the predictions caused by the neighboring cells of the lower layer) matches with this top down depolarization those cells are activated without proximal input and they represent the important state among the possible ones. This is a tested mechanism by the way.

The apical synapses are adapted according to the reward based on the TD(lambda) algorithm. I’ll present this in a more comprehensive way in the following days.

What you are talking about is very similar to what Temporal Difference Learning Lambda does in reinforcement learning. States have an eligibility trace which represents how recent a state is occurred. For example the state of a trace is 1.0 if it was the previous state. The one before it may have a trace value of 0.9 depending on the lambda parameter which controls the decay of the trace. The active state before that maybe 0.8 and so on. The changes on these state are applied with respect to their traces. Higher traces go through more intense changes. So it is a sliding window of eligible states for adaptation and you adapt based on their recency. Now, replace the states with synapses and it becomes similar to what you propose. I described TD(lambda) backward view to be exact.

Also, this is a problematic approach because the model learns only the favorable ones, or learns them more. If you don’t model the punishing ones you keep making the same mistakes. The better approach would be to learn them too but avoid them. (Indirect pathway of basal ganglia)


This is my first post here on the HTM forum. Following it for some time by reading as much as i can and do some follow up reading on references mentioned in comment. This is not my field of expertise, but i am very interested. So I take the risk of maybe saying something (completely) wrong. I am aware that HTM theory is not complete yet and that people are looking for extension of the theory beyond its current limits. Reinforced learning is one of them and is discussed here and on other places in the forum. After this short introduction, my 2 cents for what it is worth. When i think about learning i always try to imaging what a chimpansee or child do by facing a problem. This discussion reminded me of the group of chimps in a cage in which they hung a banana from the ceiling on a rope. The only way to reach the banana was by stacking 2 boxes. After a lot of jumping and trial and error one chimp found the trick. The others were looking. The next time a banana was hung at the ceiling things went more smooth, because the trick was remembered by some and of course in the end every chimp knew what to do. The point of this short story is that you might need to have a memory for learning. And another quality might be handy here to, namely analogy. As chimps climb after some time they can (maybe?) make the inference when you climb higher you can reach something you could not reach before (in their natural environment this will happen a lot, that is climbing higher in the tree to get those juicy berries). Having a memory and being able to make analogies are biological very plausible and as far as i know implemented to some extend in existing AI software.

I forgot to mention some references, although most here are maybe already familiar with it: for me it was intriguing reading, because they try to reverse engineer functions of the brain: Stan Franklin et al, LIDA and Global Workspace Theory


OK, now I see the reason for the existence of a RL layer predicting the reward. Presenting an absolute reward signal instead of a difference between actual and expected reward would unnecessarily change the synapse configuration that are already capable of obtaining the reward. This mechanism is also ensuring that the brain is always striving for more and more. Looking forward to your presentation.

Indeed what you say might be more efficient. Because I assumed that the reward signal would affect all synapses in the sequence memory without discrimination (other than the last time they modified) the bad behaviors would not be explicitly learned, but implicitly learned instead. By having the pathways blocked, they would be automatically avoided.