Okay, so this post is going to be long. I’ll try to make the introduction here brief so that the majority of the post can focus on the important details.
Essentially, HTM needs reinforcement learning, as it will make HTM vastly more useful and likely drive up adoption. If I want a DNN to classify images, or predict stocks, or play GTA V, that’s reasonably easy to do. If I want to do that in HTM, it’ll be substantially harder. Sure, NuPIC’s classifier might be getting better, but it can maybe only solve one or two of the problems I listed. If you want a silver bullet that will solve all 3, you need reinforcement learning.
Okay, so how is reinforcement learning done in the brain? With the basal ganglia. How does the basal ganglia work? That’s what we need to figure out. Luckily, I’ve been reading up on the basal ganglia, so I have some ideas. Even if I’m wrong, having some place to start is better than having none.
I’ve talked about my ideas before, but here I’ll try to present them in a more organized way, as well as what it would take to implement them. Note however that my experience with HTM has mostly been with a few toy spatial poolers I wrote a few years ago, and I haven’t worked with NuPIC, HTM.Java, etc. I’m also fairly busy right now, so I might not get a chance to test these myself for a couple months. My main side project right now is an experimental compiler though, so writing something like an HTM implementation might be a good way to test that. It’ll be a while before the compiler’s ready for that though.
So while I might not get around to trying this out for a little while, I’m posting this here in case someone else wants to.
Overview for Basal Ganglia / Reinforcement Learning in the Brain
So first off, let’s talk about how the cortex interacts with the basal ganglia. The basal ganglia consists mainly of the striatum, the substantia nigra, and a few other nuclei. There are many pathways in it, but I’m not very sure about most of them yet, so I’ll be discussing a simplified model of it today. I’ll discuss a more complex model some other time.
The striatum takes excitatory inputs from the cortex (mostly from L5 of areas in the frontal lobe and some small occulo-motor regions in the occipital lobe), and takes dopaminergic inputs from the substantia nigra (SN). This input from SN seems to be a reward signal.
The frontal lobe also appears to be structured in a hierarchy with emotion-related areas at the bottom, abstract thinking and planning -related areas like the PFC in the middle, and with motor areas at the top. Each one of those layers provides input to, and receives an output from, an area of the basal ganglia. Different areas do connect to different parts of the striatum and thalamus (some thalamic nuclei are the output of the basal ganglia).
The motor neurons in the cortex are the direct output to muscles, and mostly use population and rate encoding. The basal ganglia and thalamus only send excitatory signals back to the cortex; they do not inhibit the cortex, and they do not directly control motor output.
With the entire frontal lobe interacting with the basal ganglia, it seems reasonable to say that the frontal lobe is a giant hierarchy of reinforcement learners. Lower levels of that hierarchy work with more abstract, long-term goals, and going up the hierarchy results in more specific, more short-term goals. This does appear to be a common idea in neuroscience, so I’m certainly not the first to suggest this particular detail.
Inside the Basal Ganglia
The reward signal mentioned earlier actually branches into two parallel paths; one has an excitatory effect on the parts of the striatum it connects to, and the other is inhibitory. Each of these parallel paths then sends some signals into some of the other nuclei. The main pathways through these nuclei, known as the direct and indirect pathways, consist of series of mostly inhibitory connections and don’t appear to do any learning. My idea is that they’re performing a simple bitwise operation.
More specifically, the striatum separates the input SDR from the cortex into two different SDRs (one for each pathway). One SDR, created by the part of the Striatum receiving excitatory dopaminergic signals from SN, consists of a subset of the SDR associated with positive feedback. The other SDR, created by the part inhibited by the SN, consists of a subset of the SDR associated with negative feedback. Let’s call the positive SDR as S+ and the negative one as S-. The pathways these SDRs pass through appear to result in S- being inverted, and then combined with S+ with what appears to reduce to a NAND operation. The output then disinhibits some thalamic neurons that excite the cortex.
[Edit: These SDRs only match up with the cortex SDR if the BG influences the cortex on a very fine-grained level. Chances are it doesn’t. Explained more in the Granularity section.]
In other words, the signal sent back to the cortex is the set of all SDR bits that are associated with positive reinforcement, minus any that are also associated with negative reinforcement (This however may be a slight misinterpretation depending on how the striatum works. I’ll get into that later).
Where things get a little more complicated are in the Striatum. I’m no expert in properties of Striatal neurons, but I know that the most common ones (Medium Spiny Neurons, or MSNs) are inhibitory, feature very large dendritites, and seem to be one of the only kinds of neurons in the basal ganglia that can learn. Some of these neurons feature excitatory dopamine receptors (thus involved in S+), others feature inhibitory dopamine receptors (S-), and then about 40% of them seem to feature both. I haven’t been able to find any info on how they contribute to it, but that 40% seems pretty significant.
There are other neurons too, but the MSNs are the main ones I’ll be focusing on.
I think that the striatum likely implements some form of temporal memory in order to predict which cortical SDRs to promote. Differences between MSNs and pyramidal cells might solely be to adjust the algorithm to deal with inhibitory neurons.
There is a big problem however; a temporal pooler is going to change the SDR.
One solution to this might be for each neuron in the cortex that projects to the striatum to have a small number of associated neurons in the striatum that it directly controls. I.e, if a cortical neuron (A) fires, then some neuron from its associated neuron group (B) will fire. If does not fire, B will remain completely inactive. Any outputs from B must then project back exactly to A. This to me seems unlikely, and will likely be too brittle; if one of those axons is damaged, it means you now may have a rogue cortical neuron unaffected by the basal ganglia.
Another solution is to abandon the idea that the basal ganglia sends an exact subset of the input SDR back to the cortex. I know this seems to be a popular idea at Numenta, but I’m not entirely sure that it would work exactly. Rather, I think a more coarse-grained SDR is likely received from the basal ganglia. This coarse-grained SDR acts as some kind of feedback to the cortex, but likely closer to the level of columns rather than neurons.
My reasoning, aside from the alternative being a bit brittle and requiring multiple neurons in the striatum for each in the cortex (something that doesn’t seem too likely based on their comparative volumes) comes down to neural coding and topology.
First off, motor output mostly seems to be managed by a combination of population and rate coding. In other words, if some column outputs to a particular muscle, which neurons fire doesn’t matter; all that matters to the muscle is how many neurons fire, and how fast they are firing.
In this case, you have a large number of neurons that all contribute to a single, nonbinary, scalar output. In that case, it doesn’t matter if the basal ganglia is able to specify exactly which neurons to promote or not; simply promoting the entire group of neurons is good enough.
As for nonmotor areas, that’s where topology comes in. Due to topology, neurons will only be able to connect to nearby neurons. As a result, a particular sequence stored in the cortex isn’t going to be spread across the entire cortex, but will instead be mostly confined to small number of nearby columns. Due to sparsity and inhibition, chances are each column can only represent a union of a small number of SDRs at a time.
Now if reinforcement learning was done on a more fine-grained level, the S- SDR in the striatum mentioned earlier might seem a bit redundant, as cases where a neuron is involved in both a positive and negative SDR simultaneously should be unlikely. However, if you have an entire column, then chances are it will be storing both good and bad SDRs, and so this makes sense.
Overall, my current model of the basal ganglia is this:
- L5 of the cortex feeds inputs to the striatum.
- The striatum receives a dopaminergic reward. If above a threshold, it excites neurons that contribute to the S+ output and inhibits those that contribute to S-. If below the threshold, S- is disinhibited.
- S+ is an SDR consisting of patterns in L5 activity recognized by the Striatum that are associated with positive reinforcment. S- is similar, but represents patterns associated with negative reinforcment. These patterns are probably learned via some form of temporal memory. Each column in L5 may be represented by several S+ and S- bits.
- S+ and S- bits for each column are combined with what effectively amounts to inverting S- and combining the result with S+ via a bitwise AND operation (due to population coding though, this may be better approximated with some arithmetic). The resulting value for each column decides how much excitatory feedback the column receives from the basal ganglia. Columns never receive negative feedback.
- Columns containing more “good” patterns are provided with more excitatory feedback. Columns containing more “bad” patterns are provided with less.
What needs to be answered/tested:
- What are the more precise properties of the striatum? How does its learning algorithm compare to that of the cortex?
- What are those 40% of Striatal neurons that respond to both inhibitory and excitatory dopaminergic input doing?
- What is the level of granularity that the basal ganglia controls the cortex with? Is it on the level of individual neurons (unlikely), or is it more coarse? Columns? Minicolumns? Bigger than columns? Somewhere inbetween?
- What do the other pathways in the basal ganglia do?
- How is the reward function calculated? Is it just hardwired, or does it learn too? (Learning would make sense with how flexible reward systems can be with humans, but maybe not necessary in all practical cases).
- This model needs to be tested experimentally. Perhaps we could get HTM to play some games?
Like I said, this is mostly my interpretation of the neuroscience. I may be wrong on some of this, but having some place to start is better than nothing. If anyone needs me to better explain something or add diagrams, let me know.
I don’t have the time to test this now, though I will when I get around to it. However, if anyone else wants to run with these ideas and start implementing and testing these ideas, go right ahead. If anyone has any applicable neuroscience info, or any ideas to contribute, feel free to bring them up. Reinforcement learning could make HTM much more useful, and I’m sure Numenta could use all the help they can get.
Edits / Additions:
A simpler explanation: If the basal ganglia works at a granularity on the level of columns, then that would mean that each column connected to the basal ganglia has an associated column in the striatum. The striatum learns to recognize patterns in that column’s output SDR, and associates each pattern with either positive or negative feedback. Depending on how much positive/negative patterns are found, the striatum sends back a bias signal. This means that if a column contains too many “bad ideas”, the basal ganglia will tell it to quiet down.
Another thing that might be important is that there seem to be a few sensory areas that provide inputs to the striatum. If the striatum is performing some kind of temporal memory, this information may serve as context to allow the striatum to better judge how many good/bad patterns are in the input SDR.
Quick Links to additional explanations and extensions: