Proposing a Model for the Basal Ganglia and Reinforcement Learning in HTM

Okay, so this post is going to be long. I’ll try to make the introduction here brief so that the majority of the post can focus on the important details.

Essentially, HTM needs reinforcement learning, as it will make HTM vastly more useful and likely drive up adoption. If I want a DNN to classify images, or predict stocks, or play GTA V, that’s reasonably easy to do. If I want to do that in HTM, it’ll be substantially harder. Sure, NuPIC’s classifier might be getting better, but it can maybe only solve one or two of the problems I listed. If you want a silver bullet that will solve all 3, you need reinforcement learning.

Okay, so how is reinforcement learning done in the brain? With the basal ganglia. How does the basal ganglia work? That’s what we need to figure out. Luckily, I’ve been reading up on the basal ganglia, so I have some ideas. Even if I’m wrong, having some place to start is better than having none.

I’ve talked about my ideas before, but here I’ll try to present them in a more organized way, as well as what it would take to implement them. Note however that my experience with HTM has mostly been with a few toy spatial poolers I wrote a few years ago, and I haven’t worked with NuPIC, HTM.Java, etc. I’m also fairly busy right now, so I might not get a chance to test these myself for a couple months. My main side project right now is an experimental compiler though, so writing something like an HTM implementation might be a good way to test that. It’ll be a while before the compiler’s ready for that though.

So while I might not get around to trying this out for a little while, I’m posting this here in case someone else wants to.

Overview for Basal Ganglia / Reinforcement Learning in the Brain

So first off, let’s talk about how the cortex interacts with the basal ganglia. The basal ganglia consists mainly of the striatum, the substantia nigra, and a few other nuclei. There are many pathways in it, but I’m not very sure about most of them yet, so I’ll be discussing a simplified model of it today. I’ll discuss a more complex model some other time.

The striatum takes excitatory inputs from the cortex (mostly from L5 of areas in the frontal lobe and some small occulo-motor regions in the occipital lobe), and takes dopaminergic inputs from the substantia nigra (SN). This input from SN seems to be a reward signal.

The frontal lobe also appears to be structured in a hierarchy with emotion-related areas at the bottom, abstract thinking and planning -related areas like the PFC in the middle, and with motor areas at the top. Each one of those layers provides input to, and receives an output from, an area of the basal ganglia. Different areas do connect to different parts of the striatum and thalamus (some thalamic nuclei are the output of the basal ganglia).

The motor neurons in the cortex are the direct output to muscles, and mostly use population and rate encoding. The basal ganglia and thalamus only send excitatory signals back to the cortex; they do not inhibit the cortex, and they do not directly control motor output.

With the entire frontal lobe interacting with the basal ganglia, it seems reasonable to say that the frontal lobe is a giant hierarchy of reinforcement learners. Lower levels of that hierarchy work with more abstract, long-term goals, and going up the hierarchy results in more specific, more short-term goals. This does appear to be a common idea in neuroscience, so I’m certainly not the first to suggest this particular detail.

Inside the Basal Ganglia

The reward signal mentioned earlier actually branches into two parallel paths; one has an excitatory effect on the parts of the striatum it connects to, and the other is inhibitory. Each of these parallel paths then sends some signals into some of the other nuclei. The main pathways through these nuclei, known as the direct and indirect pathways, consist of series of mostly inhibitory connections and don’t appear to do any learning. My idea is that they’re performing a simple bitwise operation.

More specifically, the striatum separates the input SDR from the cortex into two different SDRs (one for each pathway). One SDR, created by the part of the Striatum receiving excitatory dopaminergic signals from SN, consists of a subset of the SDR associated with positive feedback. The other SDR, created by the part inhibited by the SN, consists of a subset of the SDR associated with negative feedback. Let’s call the positive SDR as S+ and the negative one as S-. The pathways these SDRs pass through appear to result in S- being inverted, and then combined with S+ with what appears to reduce to a NAND operation. The output then disinhibits some thalamic neurons that excite the cortex.

[Edit: These SDRs only match up with the cortex SDR if the BG influences the cortex on a very fine-grained level. Chances are it doesn’t. Explained more in the Granularity section.]

In other words, the signal sent back to the cortex is the set of all SDR bits that are associated with positive reinforcement, minus any that are also associated with negative reinforcement (This however may be a slight misinterpretation depending on how the striatum works. I’ll get into that later).


Where things get a little more complicated are in the Striatum. I’m no expert in properties of Striatal neurons, but I know that the most common ones (Medium Spiny Neurons, or MSNs) are inhibitory, feature very large dendritites, and seem to be one of the only kinds of neurons in the basal ganglia that can learn. Some of these neurons feature excitatory dopamine receptors (thus involved in S+), others feature inhibitory dopamine receptors (S-), and then about 40% of them seem to feature both. I haven’t been able to find any info on how they contribute to it, but that 40% seems pretty significant.

There are other neurons too, but the MSNs are the main ones I’ll be focusing on.

I think that the striatum likely implements some form of temporal memory in order to predict which cortical SDRs to promote. Differences between MSNs and pyramidal cells might solely be to adjust the algorithm to deal with inhibitory neurons.

There is a big problem however; a temporal pooler is going to change the SDR.


One solution to this might be for each neuron in the cortex that projects to the striatum to have a small number of associated neurons in the striatum that it directly controls. I.e, if a cortical neuron (A) fires, then some neuron from its associated neuron group (B) will fire. If does not fire, B will remain completely inactive. Any outputs from B must then project back exactly to A. This to me seems unlikely, and will likely be too brittle; if one of those axons is damaged, it means you now may have a rogue cortical neuron unaffected by the basal ganglia.

Another solution is to abandon the idea that the basal ganglia sends an exact subset of the input SDR back to the cortex. I know this seems to be a popular idea at Numenta, but I’m not entirely sure that it would work exactly. Rather, I think a more coarse-grained SDR is likely received from the basal ganglia. This coarse-grained SDR acts as some kind of feedback to the cortex, but likely closer to the level of columns rather than neurons.

My reasoning, aside from the alternative being a bit brittle and requiring multiple neurons in the striatum for each in the cortex (something that doesn’t seem too likely based on their comparative volumes) comes down to neural coding and topology.

First off, motor output mostly seems to be managed by a combination of population and rate coding. In other words, if some column outputs to a particular muscle, which neurons fire doesn’t matter; all that matters to the muscle is how many neurons fire, and how fast they are firing.

In this case, you have a large number of neurons that all contribute to a single, nonbinary, scalar output. In that case, it doesn’t matter if the basal ganglia is able to specify exactly which neurons to promote or not; simply promoting the entire group of neurons is good enough.

As for nonmotor areas, that’s where topology comes in. Due to topology, neurons will only be able to connect to nearby neurons. As a result, a particular sequence stored in the cortex isn’t going to be spread across the entire cortex, but will instead be mostly confined to small number of nearby columns. Due to sparsity and inhibition, chances are each column can only represent a union of a small number of SDRs at a time.

Now if reinforcement learning was done on a more fine-grained level, the S- SDR in the striatum mentioned earlier might seem a bit redundant, as cases where a neuron is involved in both a positive and negative SDR simultaneously should be unlikely. However, if you have an entire column, then chances are it will be storing both good and bad SDRs, and so this makes sense.

Overall, my current model of the basal ganglia is this:

  • L5 of the cortex feeds inputs to the striatum.
  • The striatum receives a dopaminergic reward. If above a threshold, it excites neurons that contribute to the S+ output and inhibits those that contribute to S-. If below the threshold, S- is disinhibited.
  • S+ is an SDR consisting of patterns in L5 activity recognized by the Striatum that are associated with positive reinforcment. S- is similar, but represents patterns associated with negative reinforcment. These patterns are probably learned via some form of temporal memory. Each column in L5 may be represented by several S+ and S- bits.
  • S+ and S- bits for each column are combined with what effectively amounts to inverting S- and combining the result with S+ via a bitwise AND operation (due to population coding though, this may be better approximated with some arithmetic). The resulting value for each column decides how much excitatory feedback the column receives from the basal ganglia. Columns never receive negative feedback.
  • Columns containing more “good” patterns are provided with more excitatory feedback. Columns containing more “bad” patterns are provided with less.

What needs to be answered/tested:

  • What are the more precise properties of the striatum? How does its learning algorithm compare to that of the cortex?
  • What are those 40% of Striatal neurons that respond to both inhibitory and excitatory dopaminergic input doing?
  • What is the level of granularity that the basal ganglia controls the cortex with? Is it on the level of individual neurons (unlikely), or is it more coarse? Columns? Minicolumns? Bigger than columns? Somewhere inbetween?
  • What do the other pathways in the basal ganglia do?
  • How is the reward function calculated? Is it just hardwired, or does it learn too? (Learning would make sense with how flexible reward systems can be with humans, but maybe not necessary in all practical cases).
  • This model needs to be tested experimentally. Perhaps we could get HTM to play some games?

Like I said, this is mostly my interpretation of the neuroscience. I may be wrong on some of this, but having some place to start is better than nothing. If anyone needs me to better explain something or add diagrams, let me know.

I don’t have the time to test this now, though I will when I get around to it. However, if anyone else wants to run with these ideas and start implementing and testing these ideas, go right ahead. If anyone has any applicable neuroscience info, or any ideas to contribute, feel free to bring them up. Reinforcement learning could make HTM much more useful, and I’m sure Numenta could use all the help they can get.

Edits / Additions:

  • A simpler explanation: If the basal ganglia works at a granularity on the level of columns, then that would mean that each column connected to the basal ganglia has an associated column in the striatum. The striatum learns to recognize patterns in that column’s output SDR, and associates each pattern with either positive or negative feedback. Depending on how much positive/negative patterns are found, the striatum sends back a bias signal. This means that if a column contains too many “bad ideas”, the basal ganglia will tell it to quiet down.

  • Another thing that might be important is that there seem to be a few sensory areas that provide inputs to the striatum. If the striatum is performing some kind of temporal memory, this information may serve as context to allow the striatum to better judge how many good/bad patterns are in the input SDR.

  • Quick Links to additional explanations and extensions:


I thought maybe I’d add a few details about how this could be implemented with HTM.


So in the diagram we have some columns ( C ), and each has two Striatal Columns (S+ and S-). Distal connections (Green arrows) connect all of these columns together. Everything shown here is from the same region of the cortex.

The blue arrows are direct output from L5 (or whatever layer you happen to use for motor output). Any feedback from the basal ganglia (purple arrows) would also only affect that layer (L5). However, the output from L5 would also be used as an output of the system anyway (the left-most branch of the blue arrows are outputs). The yellow arrows are feed-forward inputs.

The red arrows are a reward signal. S+ and S- columns react differently to these. In S+, the reward signal lowers the inhibition (Inhib = Inhib - Reward), and in S- the reward signal increases the inhibition (Inhib = Inhib + Reward). Aside from this, S+ and S- would act mostly like normal columns and would use temporal memory.

The purple arrows in the diagram represent scalar values; rather than S+ and S- outputting an SDR, they instead output a single scalar value. This value is just the number of neurons that have fired in the most recent time step. Let’s just call this the population. The population of S+ is subtracted from the population of S-, adjusted so that it remains above 0 (either add a positive value, or cut it off at 0). The resulting value is then fed back to the associated cortical column ( C ) and lowers the local inhibition (Inhib = Inhib - X).

Now, as it is very important that the sparsity and inhibition are able to vary over time and location, it would likely not be a good idea to select which neurons fire by a “take the N with the highest values” method, as this will always produce the same sparsity. Rather, neurons should fire so long as their value in a given time step exceeds a threshold. This threshold is calculated by subtracting the local inhibition from a fixed value.

Overall, this is how the system should work:

  • A column C receives a feed-forward input (yellow arrow). All green arrows represent inputs to distal dendrites. Spatial and temporal memory occur.
  • The output of each column C is sent to three places; an S+ column, an S- column, and the next region in the hierarchy / motor output.
  • S+ and S- take the output of C as feed-forward input, and perform spatial and temporal memory.
  • A reward signal is given to all S+ and S- columns. The signal lowers inhibition in S+ columns and increases it in S- columns.
  • S+ and S-, as they have a threshold-based firing method, contain more active neurons when they recognize more patterns in the output of C. Due to the reward-driven inhibition, S+ more frequently learns “good” patterns and S- more frequently learns “bad” patterns.
  • The number of neurons firing in S- is subtracted from the number firing in S+. The result (X) is sent back to C and is subtracted from the default inhibition to give the new inhibition (Inhib = Default - X).
  • This causes C to become more active when it contains patterns associated with positive reinforcement, and less when it contains patterns associated with negative reinforcement.

A few things to note:

  • Inhibition calculations may have to be tweaked a bit to prevent them from ever getting too low. You don’t want every neuron in an entire column becoming active because the threshold dropped to 0.
  • The output of a column can’t be controlled on the level of individual neurons unless each neuron is given its own S+ and S- column. This is fine if the output isn’t motor output (i.e, a non-motor region that still interacts with the BG, such as the PFC or anything else not at the top of the frontal hierarchy). However, this does create some issues for direct motor output, as you can’t easily control individual neurons.
  • The standard form of motor output for this model will be a scalar value (population count) for each column. However if many outputs are needed, these columns could be scaled down fairly small so long as they each still have their own S+ and S- columns.
  • The precise SDR of a region may still be able to be controlled if the reward is given based on how much the output SDR of a column matches the expected SDR. However, a population-coding-based approach will probably be much faster.

According to “Distribution and morphology of nigral axons projecting to the thalamus in primates”, the examined projection from the basal ganglia to the thalamus involved fairly small terminal fields, targeting about 20 cells each (but there are some other details, so that doesn’t always mean the area targeted by a single cell.) That’s a little small, but I’m not sure how many cells were in the targeted thalamic nuclei. I’m also not sure if the terminal fields overlapped and caused more or less specific control somehow.

Maybe they serve different roles in different situations. With SDRs, since you can take the union of many and still tell them apart, it’s okay if a lot of cells have multiple roles.

There probably needs to be a way to select actions, based on both perceptual input and the suggested actions. There are motor maps in some regions, but they must be more specific than single columns, since there are so many possible movements.
I haven’t read them, but maybe “Distribution and morphology of nigral axons projecting to the thalamus in primates,” “Three-dimensional morphology and distribution of pallidal axons projecting to both the lateral region of the thalamus and the central complex in primates,” or “Axonal collateralization in primate basal ganglia and related thalamic nuclei” can help determine the granularity.

Thanks for the detailed post – I am hoping to distill some useful information from this to influence my RL project. Your drawing is also very helpful. I’ll need to read through this a few more times before I’ll fully understand it, but thought I would post a couple of initial questions I had after my first pass.

Am I correct in understanding that, prior to the inversion step, S- is a sparse representation which becomes a dense representation after the inversion? If so, inversion + NAND could be further reduced to just AND (i.e. the overlap between S+ and un-inverted S- SDRs).

This also seems to imply that the specific bits in the two SDRs generated by S+ and S- semantically similar to each other (such that overlapping bits represent the same contexts). Is that a correct understanding?

Am I correct in assuming this means information about rewards and punishments are not stored in the cortex, but isolated to the basal ganglia which provides a biasing signal back to the cortex (probably to columns for use in a SP-like process, rather than to the individual cells within in the columns)? This aligns with how I was imagining it to work in my own models as well.

There seems to be a slight contradiction here, but probably just due to a gap in my current understanding. Earlier, you had mentioned an output SDR being sent back to the cortex. Are these referring to two different things? I can see the optimization benefit from an implementation perspective of sending a scalar value rather than modeling a dendrite segment and synapses if this is only being used to influence a SP column scoring function.

I didn’t understand this point. What do you mean by changing the SDR? Do you mean the bits in the pooling matrix do not align with the bits in the S+ and S- SDRs? If so, I think the job of a pooling matrix is to produce a biasing signal. The specific bits utilized shouldn’t matter – what matters is selecting which columns (or cells) in the cortex structures are connecting with this matrix.

In your diagram, the “C” blocks contain multiple layers, I assume, correct? For example, when you mention “spatial and temporal pooling occur” in your first step of the process in reference to “column C”, you are not talking about temporal pooling in the motor layer, but rather in the “object” layer of the two-layer sensory-motor circuit?

I would find it helpful if you broke down the contents of “C” into their layers (even of that seems obvious from standard HTM theory), in order to have a more granular drawing of the connections both internally between layers and with other external features of the diagram.

Also, I’m curious whether branch exploration is addressed in the system, whereby a negative action might be reattempted with the possibility of discovering a positive outcome? I’m thinking a possible location for this would be the boosting function in the SP’s that are taking input from the purple feedback arrows in your diagram.

Okay, that seems to suggest that the Basal Ganglia works on a very fine-grained map of the cortex. However, that’s connections from the basal ganglia to the thalamus? There’s also connections from the cortex to the striatum, between regions of the basal ganglia, and from the thalamus to the cortex. If the terminal fields are all pretty similar though, then it should only be a few hundred neurons.

Maybe; if the neurons react to both very high and very low dopamine levels? Otherwise it may be that the two types of receptors more or less cancel out, and you’re left with a neuron that is less affected by dopamine than other neurons. If the striatum is trying to learn positive and negative patterns in cortical activity, having two types of neurons that react to very different dopamine levels would effectively result in two networks of columns in the same region that become active at different times. That could result in the two effective networks not being able to share information because they are never active at the same time. Dopamine-independent neurons could help bridge that gap.

Yes, but motor output is known to be mostly based on rate and population coding. You don’t need millions of possible patterns in a region to tell a muscle whether or not to contract. With only a few hundred muscles or so in the body, chances are you have many columns controlling each muscle. Population coding is useful then.

As for other regions, like I’ve said, the frontal lobe seems to be structured in a hierarchy. The top is motor output, and the bottom is the limbic system. In between is the PFC and similar regions. Every level of that connects to the BG. The lower levels control the higher levels, and the BG uses reinforcement learning to regulate that control. In those regions, rather than use population coding, essentially what will happen is that if a column starts outputting SDRs that the striatum recognizes as bad, it will decrease disinhibition (increasing the influence of local inhibition), causing that column to quiet down. You don’t need control over individual neurons to do that.

I guess you can still call it a form of action selection. It’s just not neuron-level control, so I think saying that it’s “selecting subsets of SDRs” is a bit misleading unless you’re talking about the SDR across the entire frontal lobe.

Also, thanks for the paper recommendations. I’ll have to read them when I get the chance.

Well about half way through writing this post I decided to fact-check a few things, and think through some of their consequences. When I started writing the post I was thinking that S+ and S- might be subsets of the input SDR, but then I realized that that would only be the case if the granularity that the striatum was mapping the cortex at was very high. With receptive fields bigger than individual neurons, that ceases to be the case. I probably should have gone through an extra time to clear out any contradictions that caused.

The comparison to AND still works, though it’s not perfectly accurate. Like I’ve said, I’m not sure what granularity the BG influences the cortex at. The smaller it is, the better the AND comparison works. If you made a model with a neuron-level granularity, you could just use AND.

Pretty much. The biasing signal, like I mentioned above, is only going to be a neuron-level SDR if you’re modelling it at that level of granularity. In the brain, it’s probably modeled closer to the level of columns (or at least a few minicolumns), in which case it ends up being a population-coded scalar value for each column.

As I mentioned in my response to Casey, a nicer description of what is going on is that the cortex outputs an SDR, the striatum looks for patterns in it that it has learned to associate with positive/negative responses, and if it finds more positive responses than negative, it biases the entire column to become more active. If a column contains an SDR with a lot of bad patterns, the BG essentially just tells it to quiet down.

I meant that if the BG is influencing the cortex at a coarser granularity than individual neurons, you’ll probably want a decent number of neurons in the striatum for each column in the cortex. That way, you can recognize more positive/negative patterns in it and get a better model of what is good/bad. In that case, the SDR created from the Striatum won’t be subset of the cortical output anymore.

Yes, it contains multiple layers. As for the temporal pooling part, L5 is the motor layer, and is very similar in structure to L3. As Jeff has pointed out a few times, L5 and L6a almost look like a copy-pasted L2/3 and L4. I think temporal pooling makes a lot of sense in a motor layer, as it allows a motor neuron to better tell when to fire based on where in the sequence it is.

I’ll add another diagram later to clarify the connections a little better though.

Well if some form of temporal pooling is occurring in the striatum, it should be able to tell if a pattern in a column is good or bad based on temporal context. That way, a pattern that it considered bad in one case may be tried again later if it’s associated with positive reward in a different context. Now, as for getting it to explore such cases to begin with, I’m not sure. I’ve been thinking for a while that the brain probably has some kind of boosting-based novelty reward, but I could never find a mechanism. If the Striatum is a temporal pooler though, maybe bursting there as you suggest could be such a mechanism. It would increase striatal population counts after all.

I’m not sure. VPM has ~200 cells, but barrel cortex (its main target) is estimated to have ~8500 cells, split up into a lot of columns. I don’t know how many cells most thalamic nuclei have.

I think they would probably just not express dopamine if they were meant to be independent of reward, although you might be right. The details matter here. For example, do different parts of the neuron express different dopamine receptors? Is dopamine distributed throughout the striatum (e.g. with very diffuse projections or extracellular dopamine release), or do different cells in the substantia nigra target only some cells, with activation of different cells in substantia nigra serving some role?

I think the easiest way to determine the roles of these cells is probably to find their activity patterns, if any studies have been done on that.

I didn’t mean to imply that it needs to be very fine-grained like mini-columns, just not on the level of macro-columns. Otherwise, there doesn’t seem to be much point having so many cells in each macro-column which control movement. Also, the basal ganglia needs to know the movement it is controlling, not just the part of the body being moved, and it probably needs to be able to select one of multiple possible movements suggested by a macro-column, since there are typically multiple possible ways to move. Maybe it could learn to narrow down the options to the best one, but it still needs to deal with multiple options before learning the best action.
Columns also don’t always directly control muscles. Some control pattern generators. In some regions, they even control the planned actions, which means there are a huge number of possible motor commands.

Primary motor cortex gets input from primary somatosensory cortex, so it’s close to the bottom of the cortical hierarchy. If higher regions work at higher levels of abstraction, there’s probably a certain point where it shouldn’t have a strong motor output. Sensory regions shouldn’t, because they don’t know much about what’s going on. Very high regions also shouldn’t, because they don’t know the specifics required to determine behavior with much precision, although they could control the general behavior of lower regions.

I think a single column should be able to consider multiple actions in parallel, just like L2/3 represents the set of feature-location pairs. That way, L5 could suggest all possible features to touch, then basal ganglia could choose. To manipulate an object, L5 would just have to include possible location-feature pair which could be sensed as a result of action, which is basically the same as learning what would be sensed given each location on the object. The only difference is that the object is defined by what feature-location pairs could be sensed after applying force to manipulate the object, rather than the subset of possibilities which result from gently touching the surface of the object.

You’re welcome. I’m researching whether or not the basal ganglia controls activity buildup in L5 before a movement, so I’ll want to take notes on them. If you can’t access an article you want to read, let me know.

The entire PFC probably contains at least a billion neurons, and dendrites can only spread out so far. That’s a very big and very topological SDR. SDRs like that probably differ in a few ways from the smaller ones everyone here is used to. With topology, you’ll end up with similar patterns stored near each other in the cortex. Since the PFC models a lot of different ideas and strategies, and since you’ll only be using a couple at a time, the number of significantly active columns in the PFC at any time, even under heavy mental workloads, will probably be fairly low.

I’m thinking that there are three major factors here that help select actions. I’ll mostly be talking about higher-level regions like the PFC here:

  1. Topology means that similar strategies are probably represented close together. Let’s say that we have a few columns, and each represents a different strategy. If a particular strategy is a bad idea, the BG will create a bias against it and lower its influence. If a particular column offers both good and bad actions using the same strategy, the BG will still bias against it, but to a lesser extent.

  2. The hierarchy probably plays a big role. The PFC plans strategies. The Premotor regions take those strategies and plan out specific actions based on them. The BG might let some bad strategies slip through the PFC, but they get more specific (and probably more spread out, as different high-level actions will result in different lower-level actions) in Premotor areas. If the BG recognizes them as bad there based on the current context, it’ll weed them out. Actions would have to be almost identical to not be differentiated here, and so there may not be a clear line between a good action and a bad action. I imagine an action like this would be risky, and the BG would recognize it as such and be a bit less likely to promote it.

  3. Feedback from lower-level actions to higher-level strategies probably also plays a role. If the basal ganglia lets two very different potential actions through the PFC because it couldn’t separate them, it should be able to differentiate them when they’re transformed into very different lower-level actions. The actions that it promotes in premotor areas will send feedback to the PFC. If that feedback occurs on PFC apical dendrites, then it’ll create a bias toward the specific neurons involved in the action that gets let through the premotor areas. Due to the scalar bias (some negative patterns, some positive patterns in the same column, so the column doesn’t get the maximal disinhibition), the neurons getting that apical feedback will be more likely to exceed the higher threshold from the lower bias response.

I’ve been studying a bit of microarchitecture lately, so I’ve been thinking of pipelines and hierarchies as similar things, especially since I’ve mostly been thinking about sensory information going up the hierarchy. I was talking mostly about a small portion of the full cortical hierarchy, and it just made sense in my head to say that information keeps going up the hierarchy, as sensory information is transformed more and more into the eventual motor output. Plus, DNNs typically have the top of their hierarchy as an output, so I didn’t think twice about it.

I was mostly talking about the frontal hierarchy anyway, not the entire cortical hierarchy.

Cool. I don’t get to read a ton of papers (I’m busy and neuroscience is just a hobby), but I’ll keep that in mind. Thanks.

Extensions to Incorporate Feedback, Hierarchy, and Curiosity

Thanks to my discussions with Casey and Paul_Lamb, I’ve found some details I’d like to add to the model.


A quick and small extension here; as Paul suggested, provided that the Striatum is performing some kind of temporal memory, bursting there could serve as a mechanism for a form of curiosity. In the case that a suggested action appears that has not been seen yet (at least given the context), Striatal “minicolumns” would be expected to begin bursting. This could throw off the population coding, possibly providing a cortical column with an extra bias, pushing for actions to be taken without a clear prediction of whether they will result in “good” or “bad” results.

Action Selection

The core model I’ve suggested seems to explain most of how actions are selected; columns provide a union of SDRs for actions, the Striatum recognizes patterns in that SDR that it associates with positive/negative reinforcement, and then manipulates the inhibition in each column based on the patterns it found (more positive patterns mean less inhibition, more negative patterns mean more inhibition). The amount of inhibition has a major impact on the sparsity, and too much could easily stop all activity in a column, giving columns a large bias against suggesting bad actions.

The problem is that an entire column seems too big; there may be many very different actions represented in a column simultaneously. How does the brain separate them?


Let’s say that we have a bit of cortex, similar to the one in the picture. Each square in the grid is a column. White columns are inactive, red columns are suggesting “bad” actions, blue are suggesting “good” actions, and columns that are both red and blue are suggesting a union of “good” and “bad” actions. Let’s call those ones ambiguous.

The brain is very topologically structured; as a result, columns that represent similar patterns will tend to be close together. The brain also has a motor map, mapping regions of the motor cortex to muscles. High-level regions of the Executive Hierarchy in the frontal lobe will likely be fairly different from this map, but lower-level regions should tend to be organized closer to the motor map. The entire Executive Hierarchy (frontal lobe) likely will only be sparsely populated by active columns; after all, the PFC might contain a large number of strategies for solving problems, but only a small number will be used at any point in time.

This also means that different actions should result in different columns becoming active. If one particular column suggests two different actions, the BG will be unable to separate them. However, when passed to the next region the two suggested actions should result in many different columns becoming active, provided that the two actions are sufficiently different.

In the example in the above image, a single active column in a higher-level region (left) may result in many columns becoming active in a lower-level region (right). A “good” (blue) action in the left region will likely trigger many other “good” patterns in columns in the next region. A “bad” (red) action will likewise trigger many other “bad” patterns in lower-level columns. There may also be some overlap.

Due to topology, these “good/bad” patterns should now be spread out across many columns. The only case where this would not occur is if the two suggested actions are almost identical in terms of their lower-level sub-actions, but such a situation should be rare (as they would then not really be different actions).

In this example, the BG would very quickly produce high inhibition in the red columns, low inhibition in the blue columns, and medium inhibition in the ambiguous columns. The red columns would be quickly blocked out from the inhibition, the blue columns would become very active, and ambiguous columns would become more selective about which neurons fire.


This other diagram shows connections between regions in a hierarchy, and regions with the BG (Paul asked for the one on the right). L5 of a cortical column is the primary motor layer in the cortex, and functions as the main interface with the BG (right diagram). In the left diagram, the blue arrow shows feed-forward connections, and the red shows feedback. The red and blue arrows are specifically feed forward/back connections between regions at different levels of a hierarchy.

Note the feedback connection from L5/L6 to L1/L6. The L5 to L1 is what matters here. L1, while normally omitted in HTM theory due to its lack of neurons, is actually full of apical dendrites (feedback dendrites) from neurons in the other layers, including L5.

What this means is that, in our example, blue columns in the right region will provide feedback to the same neurons that caused them to fire. This means that the neurons associated with the “good” action in the column in the left region will become further depolarized, and neurons associated with the “bad” action in the same column, lacking such feedback from the inhibited red columns in the right region, would not be depolarized to the same extent. As the column is ambiguous due to the striatum recognizing both “good” and “bad” patterns in it, the BG slightly raises local inhibition and the column becomes more selective about which neurons will fire. With the extra depolarization from the feedback, the neurons associated with the “good” pattern will quickly win out over the “bad” pattern.

To Summarize: Biasing entire columns is going to lead to some “bad” actions getting through. A hierarchy of reinforcement learners would provide the BG with additional chances to further filter out these bad actions. When topology is taken into account, different actions suggested by the same column should lead to activity in different columns in the next region, which will be easier for the BG to filter them out. Adding apical feedback to this model allows these corrections to propagate backwards through the hierarchy, creating biases against the neurons that contributed to the blocked actions in the next region (the neurons that are part of the bad actions).


Clearly some experimentation is needed here to test out this model and these extensions. I don’t think that these extensions are manditory for the model to work, but I think they would definitely allow faster learning and more accurate behavior.

Also, if anyone from Numenta could join the discussion, that would be quite interesting. I’m sure Numenta has some ideas and info that could help develop this model.

Charles: I just read your initial post too quickly, but I want to be lazy and ask you some basic questions. First of all, why should an SDR be split into a SDR+ and an SDR- ? And what does splitting mean - does it mean that if you had a 20 bit SDR, the first 10 would go on one path, and the second 10 bits would go on another? What do they actually do in the basal ganglia?Why do you need both inhibition and excitation, and why do you need both from the same SDR? Where does the SDR come from? The cortex? What does it represent - a motor command?
Also, from a practical point of view at this particular time, when the only thing Nupic has been used for is predicted sequences, how would adding a reward/punishment signal add new functions to the cortex algorithm?

So the SDR would only be split into S+ and S- (where a union of S+ and S- will return the original SDR) when the granularity is on the level of individual neurons. If the BG biases the cortex on any coarser of a granularity (most likely), then S+ and S- aren’t really subsets of the original SDR anymore. It’s also not being split 50/50 like you mentioned, but is a bit more complex.

Each cortical motor column (or more precisely, L5 of a cortical column in the frontal lobe) has an associated striatal column. Outputs from the cortical column function as a normal motor/feed-forward signal, but also as an input to a column in the Striatum.

The Striatum, I’m proposing, is mostly doing normal temporal pooling on this input. However, it also receives a second input from the Substantia Nigra, in the form of a dopaminergic “reward” signal. This is not an SDR, but rather a scalar value. In addition, not all neurons in the Striatum have the same dopamine receptors. Instead, there are two types, the excitatory D1 and the inhibitory D2. Striatal neurons with mostly D1 would fire more often when dopamine (reward) is high. Neurons with D2 would fire more often when dopamine is very low. Most of the time, dopamine levels are probably in the middle.

What this means is that the neurons in the Striatum are segregated into two groups, each biased to recognize patterns associated with different reward levels. There is also another difference between these two types neurons; their axons project to different places.

S+ is essentially made up of the D1 (high dopamine) neurons, and S- is the D2 (low dopamine) neurons. These two SDRs are produced by the same region (the Striatum). The two different pathways go through some other nuclei in the BG (Globus Pallidus, Subthalamic Nucleus, etc.). These regions don’t appear to do any learning, and rather seem to just be performing some kind of calculation on the outputs.

Now an important thing to note is that for this to work, sparsity (at least in the striatum and motor regions) is going to have to be a much more flexible thing. These columns won’t have a fixed sparsity, but rather a sparsity dependent on a local inhibition value, and how much local neurons react to the proximal input (which may also vary in sparsity). Rather than pick the K-highest neurons in a region to fire, the SDR would contain all the neurons that exceeded a particular threshold; one that may change from timestep to timestep.

The first reason why flexible sparsity is important here is that the calculation I’m proposing the BG is performing on S+ and S- is based on population coding. In other words, the BG doesn’t care which neurons in S+ and S- fire, but rather how many. It is also topological, so it would be more accurate to say that it’s dependent on how many S+ and S- neurons fire in each striatal column. The calculation it appears to do is a simple collection of adds/subtracts; something like pop(S+) - pop(S-) + C. The signal sent back to the cortex is in the form of an excitatory signal dependent on this result.

What that means is this; a cortical column outputs some motor command (or output to the next region) in the form of an SDR. The associated Striatal column performs temporal pooling, using it as a feedforward input. S+ neurons, being tuned to recognize patterns in it that occur around the same time as a reward, respond if they find any. S- neurons respond to any patterns they find that they associate with “punishment”. The BG then counts how many S+ neurons and S- neurons show up in this striatal column. If there are a lot of S+ neurons, then a high excitatory signal is sent back to the cortex. If there are a lot of S- neurons, then very little of an excitatory signal is sent back. If there is a mix, then it is dependent on the different between the numbers.

Simply put, the Striatum is looking for good/bad motor commands. It sends back a bias signal to the cortex that promotes columns providing mostly good motor commands.

This bias signal is also likely population coded, and, being excitatory, can modulate the threshold in different regions of the cortex on the level of individual columns. That’s also the second reason for flexible sparsity; to allow the BG to properly modulate activity in the cortex.

Now, this would probably only work well with some level of topology; the cortical columns need to map nicely to the striatal columns, and the bias signal has to return to the original column. There is also a problem in that this doesn’t quite allow the BG to select actions, but rather to promote columns suggesting unions of mostly-good actions. Promoting them with an excitatory signal could even make the problem worse by increasing the number of active neurons further. This might not matter a lot for direct motor output (which are probably population coded), but it would be a problem for regions producing more high-level plans, like the PFC. I have a post with some extensions to the model though suggesting how this could be solved just by adding a hierarchy and apical feedback.

As for additions to NuPIC, I haven’t worked with it myself (just some of my own toy HTM implementations, and a lot of theory), but it doesn’t sound to me like there’s a huge amount of support (at least in the public version) for hierarchies, apical feedback, dynamic sparsity, sensory-motor inference, or topology. I might be wrong though. In order to work at all, my model would likely require good support for those features.

After that, reinforcement learning would mean that you’d be able to create some type of fitness/reward function, and have HTM generate outputs (SDRs or an array of population-coded integers, the SDR probably being a bit slower to learn), attempting to optimize for that function. HTM takes an input and provides an output. You take that output, put it through the reward function, which returns a scalar value to the BG. HTM will be biased to produce outputs that produce high rewards.

For example, rather than using a classifier to try to determine what an HTM region is predicting for the future, you could instead add a motor layer or two (with an associated BG), and train the model to provide a more precise and straight-forward prediction. You could just as easily have it attempt to predict arbitrary timesteps in the future. Have some kind of reverse-encoder to convert the motor output SDR to a precise and easily readable prediction, and then provide a reward based on how well the prediction matched up.

Another example would be that you could apply HTM to problems like the kind Google DeepMind goes after; we could have HTM playing Atari games or something. The motor output can be a reverse-encoder to generate inputs for the game, and the reward can be based on changes in the score, health, etc.

Essentially, anything that you see DNNs doing now (and probably more), HTM could likely do with the addition of some reinforcement learning.

@Charles_Rosenbauer, @Paul_Lamb and @Casey thanks for the interesting discussion. I’d like to pitch in with some extra ideas.

@Charles_Rosenbauer The usage of column is kind of ambiguous in your explanations for me. Usually, HTM theory refers minicolumns as columns but you seem to mean macrocolumn (layer scope) when you say column if I am not mistaken. It would help if you can explicitly state which one you are referring to.

Your model seems concerned with biological functionality, so maybe you should present it in terms of the pathways inside basal ganglia: direct, indirect and hyperdirect. Most computational models seem to reference those when describing theirs. [1]

• Direct (GO) Pathway; Striatum→GPi.
• Indirect (NO-GO) Pathway; Striatum→GPe→GPi.
• Hyperdirect Pathway; Cortex→STN→GPi.

Direct pathway (Go) provides what you are referring as learning and biasing good motor actions, and indirect pathway (No-Go) does the exact opposite for bad actions. The hyperdirect pathway provides what @Casey describes below. It inhibits the output of basal ganglia to integrate more information from other cortical columns that help resolve the correct action in case of conflicts. [2]

In the context of biology, dopamine is not a reward signal based on my understanding. It is actually an error signal representing the difference between an expected reward and the actual reward. Striatum is capable of comparing the two. [3] This might seem like a detail but it actually has drastic implications on the model.

The temporal pooling proposal on striatum is an interesting one for me. What exactly happens after you decide that the current pooled activation is good or bad? Remember, you are pooling multiple activations of layer 5 but you need to bias the next step in the sequence for layer 5. Or is there another mechanism to resolve this? If you bias all the steps of that particular sequence on layer 5, how does layer 5 pick the correct one? Will sparsity and temporal memory be enough for this? I will definitely experiment with this. Without pooling, striatum needs to know every transition of layer 5 which currently works on my model but also seems a huge waste of capacity. You might be onto something here.

So how do you avoid bad stuff, such as not going froward when punishment is inevitable? If you don’t avoid bad stuff, you essentially prevent the model from reducing its search space. You would want to search among the good options if you already know the bad ones.

I do not think it is omitted because of a lack of neurons, because the team is aware of this paper [4] as I saw them sharing it. There are in fact inhibitory neurons in L1. It might be more of a functional abstraction thing. Layer 1 is a top-down terminal for the cortical hierarchy as you also stated as well as being the terminal for ganglia output. There is no ganglia or cortical hierarchy in HTM yet and that might be the reason.

@Paul_Lamb mentions a very important problem to dwell on. Especially since you propose temporal pooling on striatum which treats the layer 5 sequences as a whole. Learning the wrong bad action or an action being classified as bad prevents a lot of exploration.

Yeah, I’m primarily talking about macrocolumns here, though I’ve mentioned the granularity a bit. I’m using macrocolumns as an example, since I’m not entirely sure what the granularity is. What I mean is that neurons near each other are going to project to more or less the same area in the striatum (topology), those neurons will project to more or less the same areas in other parts of the BG, the outputs of the BG via the thalamus will project more or less back to the same set of neurons, etc. However, you do have to take receptive fields into account. The granularity is dependent on those receptive fields, and so the bigger they are, the more “blurred” the BG’s representation of the cortex gets. So a bias signal won’t just go back to the exact same neuron that triggered it, but rather a group of nearby neurons. For convenience, I’m just assuming that group is about the size of a macrocolumn, but the exact size doesn’t matter too much. It really just determines how precisely the BG is able to control the cortex.

I understand. I was also trying to make my explanation easier to understand for people with a better understanding of HTM than the neuroscience behind it, so I didn’t want to throw in all of the neuroscience jargon. S+ is the direct pathway, S- is the indirect pathway. As for the hyperdirect, it’s not something I’ve taken into account too much, though it seems to me to be there to just add some extra corrections. Probably important, but not for a high-level explanation of my model. As for it correcting conflicts, I think my extension regarding hierarchies and feedback probably could be enough to do that. Some experimentation is needed though to confirm that.

Great point! I must have overlooked that detail in my research. In fact, I think it actually improves my model; if the direct pathway for example is being biased to learn during rewards, it shouldn’t need an additional bias signal every time there’s a reward if it already does its job recognizing “good” patterns. The difference between the population count of the direct and indirect pathways would be the expected reward (Or more specifically, how much each column is expected to contribute to it. Kind of. “Good” patterns are subsets of an SDR that are expected to contribute to more reward, and “bad” are ones that are expected to contribute more to less reward), and if that differs from the actual reward, then the dopamine signal would correct that.

I think it’s also important to note that even when there is no difference here, dopamine levels are almost always nonzero. That makes sense though; a continuous signal to inhibit the indirect pathway, but not enough to significantly excite the direct pathway. When dopamine drops from there, it’s enough to disinhibit the indirect pathway. When it goes above normal, it’s enough to significantly excite the direct pathway.

Also, doesn’t dopamine affect the learning rate of neurons too? That would make sense here, as dopamine levels at either extreme would mean that the reward was poorly predicted in the current situation, and so increasing the learning rate in the affected pathway temporarily would make sense. Though it’s probably not absolutely necessary.

Well the temporal pooling would be more or less to provide a temporal context to the direct and indirect pathways. An action may be good in some cases, but not others. The temporal pooling would provide it the context needed to determine that. As for biasing the next step in the sequence in L5, that’s more or less what’s happening, just perhaps with a slight delay of a few steps. The output of the direct and indirect pathways do control the bias in L5. Whether it’s the next step in the sequence, I’m not certain about that. There might be too much of a delay. Luckily, taking an extra step or two to stop a bad sequence probably isn’t a bad thing; a timestep here is about 5ms, so that’s probably within the brain’s margin of error.

As for biasing the right sequence out of many, that’s what my hierarchy-and-feedback extension was about. Getting the exact right neurons probably doesn’t matter too much in the primary motor cortex as it seems to be mostly population coded. There, all that matters is how many neurons are firing, not which exact ones. As for more high-level areas, that’s where topology and the hierarchy come in. Feed-forward signals tend to “blur” the topology; for example, a column in V1 might only get inputs from one small portion of the retina, but going up to V2, you’ll find that each column can trace it’s inputs back to a much larger portion of the retina.

Same principle here, but with specificity in reverse. The PFC feeds forward to Premotor areas, which feed forward to Primary Motor areas. The Primary motor areas are very close to the motor map, the Premotor less so (blurring), and the PFC even less so. A small area of the PFC therefore influences a much larger area of the Premotor cortex, etc. This spread-out influence, combined with topology (similar patterns will be represented close together in the cortex), means that with each feed-forward step through the hierarchy, actions get more specific, and also more spread out. The above-mentioned granularity doesn’t matter as much here, because each step forward through the hierarchy is essentially zooming in on the specifics and filtering out the “bad” details.

If a PFC column sends a union of actions to a Premotor area, that action gets spread out across many columns, each modeling a particular aspect of that set of actions in more detail. Due to being spread out, the BG can now be more specific with filtering out the “bad” patterns. When it biases against a Premotor column due to it suggesting “bad” details, that column stops sending apical feedback to the neurons that triggered it in the PFC column. If any “bad” patterns were recognized in the PFC column (which should be the case if it creates bad patterns in Premotor columns), that will lower the excitatory bias in that PFC column, increasing the effective local inhibition. Due to the lack of apical feedback and the increased inhibition, the neurons in that PFC column associated with “bad” actions, even when they form a union with several “good” actions, are biased against. It just takes a few extra steps. This answers your next question too.

I meant pyramidal/etc. neurons to be more specific; no spatial/temporal pooling going on there.

I think that’s why the temporal pooling in the Striatum is important; it provides a temporal context. A particular action will vary in whether it’s good or bad based on what has happened recently. In temporal pooling, different neurons in the same minicolumn represent the same pattern in the feed-forward input, but just in different temporal contexts. If a Striatal minicolumn (or whatever the Striatum’s equivalent of minicolumns is) contains neurons from both the direct and indirect pathway, it would mean that minicolumn can represent the action in both good and bad contexts, and send the results down the correct pathway in either situation.

Yes, it is discussed in this thread [1].

I really understand what you mean and I agree with the approach. Although implementation wise, if layer 5 activates the motor action, it is important to ensure that any activation on the layer 5 is a coherent one to output a coherent action. Based on my experiments, the combined bias at layer 5 caused by the mechanism you described would have bits and pieces from multiple good and bad states. There needs to be a conflict resolution here and you are suggesting that this is the duty of the lower regions in the hierarchy through top down feedback which I agree. Though in practice, it is not clear to me how. I am also suggesting that topdown influence is not the only component solving this. According to this paper [2], the conflicting activations of layer 5 is resolved through its apical dendrites sampling from layer 1. Layer 1 integrates information from basal ganglia, thalamus and other cortical regions. So the solution might not be that simple.

Is it possible that you are confusing Temporal Memory with Temporal Pooling? If so, we are arguing about different things. If you were talking about striatum having Temporal Memory, then I totally agree with that and that is how I implement it.

I’d like to take a closer look at the results of your experiments. It sounds like you only tested the effects of a bias signal and Striatum, right? “Bits and pieces of good and bad states” is what I’d expect to see after the bias without feedback.

Upon a bit of research, yes. My mistake there.

I think there’s evidence that the basal ganglia can select actions with a sub-columnar but not cellular level of specificity.

There is massive convergence from the cortex to basal ganglia outputs, on the order of 100 or more cortical neurons which project to the striatum for every neuron in the basal ganglia output nuclei. However, most cortical cells which project to the cortex use a sparse code which is sensitive to the combination of various parameters, whereas the basal ganglia seem to control more directly coded actions. The basal ganglia also mostly only control regions in the frontal lobe, but they receive input from the entire neocortex.

The superior colliculus is a subcortical structure controlled by the basal ganglia. SNr cells have visual receptive fields and project to cells in the superior colliculus which generate eye movements to similar receptive fields [1]. If a stimulus in a receptive field silences a cell in SNr, that cell contributes to movements to that receptive field [1]. Similarly, in the bird, at least some parts of the basal ganglia control just a few cells in the thalamus [2].

I don’t have direct evidence for the cortex because I’m still not sure how the basal ganglia send a signal back to cortical cells with motor outputs in order to select an action. There probably needs to be some sort of loop. In primates, only ~1% of pyramidal tract cells (cortical cells with motor outputs) project to the striatum. Most cortical inputs to the striatum come from another type of cell in layer 5 and cells in layer 2/3. However, pyramidal tract cells project to thalamic nuclei controlled by basal ganglia, and those thalamic nuclei provide ~1/3 of the inputs to striatum. The main issue is how the signal gets back to cortex. The thalamic cells targeted by L5 are matrix cells, which project to layers 1, 2, and 3a. Pyramidal tract cells often fire a burst of a few spikes, likely to generate the action after the basal ganglia has time to make a selection, and input to those layer 5 cells on their apical dendrite in layer 1 can cause them to burst. However, activity in those cells often rises gradually before they burst, especially when the animal can react to a sensory cue without a tight time constraint or when planning is involved. Simply causing cells to burst to make a decision isn’t sufficient. Also, the input to the distal apical dendrite in layer 1 of L5 pyramidal tract cells doesn’t have much impact on their firing rates, unless the input is strong enough to trigger burst firing, so there’s probably no direct signal from thalamus to cortical pyramidal tract cells.

The basal ganglia needs a way to increase firing rates in cortex gradually before making a final decision. Many parts of the brain, including multiple parts of the basal ganglia, pyramidal tract cells, subcortical motor structures, and even some cell types in the spinal cord, gradually change their firing rates before firing more rapidly to trigger the action. Based on that fact, I think gradual changes in firing rates during action selection are a fundamental part of behavior. However, I don’t know how this works.

Overall, I think the basal ganglia can select fairly specific actions, but I am biased, since that’s convenient for the ideas about layer 5 I’m working on.

[1] Disinhibition as a basic process in the expression of striatal functions (Chevalier and Deniau, 1990)
[2] A GABAergic, Strongly Inhibitory Projection to a Thalamic Nucleus in the Zebra Finch Song System (Luo and Perkel, 1999)

An alternative way to look at it involves treating BG->Thalamus->L1->L5 pathway as modulatory. We know that L1 has inhibitory effects on other cortical layers [1]. So what if that signal is actually used for hyperpolarizing the rest so that only the target activation stays depolarized? Or what if apical depolarization only activates the ones that are already depolarized by inter-regional connections? Either way this signal is not the actual cause for excitation and I am not sure if it needs to be.

1 Like