HTM Based Autonomous Agent

Hello Kaan,

Thank you for the reply. I hope everything went well at the wedding! Congratulations!

I started rebuilding the architecture including layer 2/3 in NUPIC and got some more detailed questions regarding L5/Motor layer that I formulated in the forum-thread. In my current attempt I try to calculate the excitation from D1/D2 directly in the motor layer without layer 5 in between again, as this symmetric, circled computations are not really supported in NUPIC.

However I currently experience a hard time trying to figure out why the network is not doing as intended doing to a lack of good visualization tools for NUPICs Network API. My layers have 2048x32 neurons which makes them very big and I see that the agent initially has no apical connections that are active at all to excite/inhibit neurons. Probably it just takes more learning or more likely debugging in my architecture.

I am also wondering about how much training it took your agent to learn behaviour and the environment? I use the same parameters (PermInc etc.) initially but am not sure how much time it will take the agent to learn and adapt - in case it works in the first place.

Kind regards

1 Like


Thank you for the kind words. I will make a seperate answer for the thread. Visualization indeed helps to track the learning progress. On the other hand, you can ensure that the system works by introducing a mechanism to directly give inputs to the agent. These would be very simple 1 2 or 3 step inputs. Then you could give the control back to the agent at the same point in action sequence. This was how I made sure it was at least learning.

The required number of steps is not concrete because the layers need to settle down self competition and stablize. This takes time and input variety. After they settle, the number of steps should be around (connection permanence / permanence increase) for every new step of sequence. Still the convergence time varies greatly. You can limit the number of environmental states to reduce the necessary time to stablize.

1 Like

I meant that you could make the agent learn a very simple sequence by repeatedly giving direct motor inputs. You then give the control back at some point to see what it does. It is hard to see whether it learns with random exploration at first.

1 Like

Thank you, I saw that approach of human control in your thesis it is indeed a good idea to test the network. I will try to test it that way. Yes I believe so, I reduced the environmental complexity greatly for now. Currently the task consists only of 4 cubes on a screen and the agent should click the left upper corner to get a reward, with a timeout of 10seconds.

Some more questions coming to my mind:

  • With which layer sizes did you experiment with?

  • My layers are quite large but I sample only a very small number of active neurons with the given parameters, which makes me wonder if that can be correct and working this way. Did you keep sparsity at about a ~2% rate or much lower in some layers?

  • And I currently do not include topology… so my input is just encoded as 1D-black/white array. This takes a lot of information away from the agent. Did you have any tests how important the topological information (2D) was in your framework?

1 Like

Seems like the task is simple enough. I assume the agent is controlling the mouse on a 2D screen? Also, why the timeout?

At the end of the thesis there are various layer size measurements both in terms of performance and learning behavior. My environment worked good enough with layer sizes 512 columns and 8 neurons per column. I experimented in 256 and 4096 column range. O went with 512 for performance reasons as it did not seem to lose learnimg capability. If you design your encoder well you can get away with smaller layer sizes.

The sparsity was between 4% and 2%. It did not have a drastic effect. 512 columns with sparser activation then 2% results in lower than 10 active columns which hurts overlap capacity.

My input was rgb values of a 2d image converted to binary. I tried sampling without a topology (all columns can connect to all pixel locations) and sampling based on proximity (for example columns near the center of the layer sampled rgb values at the image center). I am not sure about the neccesity of a topology but it certainly helped visualizations as local sampling provided activations corresponding to its image location. Local sampling also helps to disconnect distant values in the image so that a column does not learn to combine values on opposing ends of the image. However, as long as similar inputs have overlapping activations based on the strength of the similarity you shouldn’t need topology. The agent does not need to have the same ‘sense’ of similarity as we do at this point. I would start with 1d.

One downside to topology is that the layer is not utilised fully because some receptive fields have a larger variety of inputs than others. Some columns need to learn more than the others because they sense more patterns. You absolutely have to enable some sort of boosting or bumping with topology to balance the activations but then you introduce more instability.

1 Like

Yes the agent sees a 160x160 2D screen and can control the mouse in this field. The action space is discrete as I only allow it to click, move left/right/top/bottom 10px or do nothing at the moment. I map the motor neurons (which are currently 65k) to the range of 6 and choose the action with most active winner cells (most excited) neurons. The maximum number of winner cells can be modified with a parameter but I kept it low at 4 and planning to reduce the amount of motor neurons that is currently so large as it mirrors layer 5.

I use the timeout to ensure that the agent is not learning to just do nothing (as it just gets negative reward if it clicks or the task times out). However at the moment the task does not change after the timeout so it is pretty useless.

Yes I completely agree on the thoughts on topology, that was the same from what I figured out so far. I will keep it in 1D for now and try to make that work.

I experiment with the sparsity at the moment and will shrink the layer size as it should really not have troubles with representational-space. However to really be able to make systematic experiments I need to implement serialization and maybe also a better debug-printout than console^^

In your thesis you also speak about the boosting factor and in the appendix set it to 4, but also set it to false. Was it completely disabled or somehow used as 4 and I miss interpreted that?

Thank you for the quick answers!

1 Like

Also in your framework you activate the cells which have basal and apical depolarization in layer 5 and use them to drive excitation/inhibition of motor neurons (as I believe without the proximal activated layer 5 neurons). It is grounded in biology but what are your thoughts why this works/improves performance?

I think of it as a filter so far that is filtering the apical depolarizations with the basal depolarized neurons to let the let the “more realistic” prediction come through. (as D1/D2 are predicting from L5, which in case is the same information input as the basal activation)
However that part is not quite clear to me yet. I see how the D1/D2-L5 loop can work, makes predictions and utilizes the TD-Error to excite/inhibit actions/states that are favourable, but as the distributions in L5 and D1/D2 are different, how/why should there be an overlap of apical and distal depolarization besides that they are both contextualizing and implicitly driven from the same source. (In my current version they do mostly not overlap at all).

1 Like

I believe reducing the number of possible motor outputs would make the whole process a lot easier for both you and the agent.

I utilized a tweaked version of boosting. Normally boosting artificially increases a column’s overlap based on the activation frequency. I did not like this so I used the calculated boosting factor to continiously increment the permanences of all proximal synapses of all column (referred as bumping). In Nupic this synaptic increment is based on the average overlap of a column with the input not the activation frequency. In my implementation, proximal synapses always gets stronger and the amount of it is dependent on the activation frequency (boosting factor) of each column. It worked better for me; lesser uses columns grow more. There is a subsection for this in the thesis. Boosting is disabled but bumping uses boosting factor.

If you can accomplish this, it allows new and in depth ways of debugging and assessing learning.

The idea was that the predictions somehow should alter the proceeding actions to produce behavior. The problem is that a cell cannot effect its post synaptic targets in a depolarized state. There is just no way of knowing that a cell is predictive from the perspective of the other cells. So this prediction cannot be directly sent to other cells and could only be used to make the self firing easier. The main assumption is if apical and basal sources both depolarize a cell at the same time, it causes a spike which means the predictions from multiple sources are so strong that they fire the cell before the actual input. This was the only bioplausible way that I could find to produce behavior out of depolarization among the many mechanisms I tried.

Distal connections of layer 5 predicts its own activity. Among all the layer 5 activations, some are to be avoided and some are desired. D1 learns the desired activations and D2 learns the ones that are to be avoided based on TD error. At a given time there are predictive cells in layer 5 caused by distal input (layer 5 cells). Among these predictions some are also predicted by D1 or D2 layers. As you said, D1 and D2 filters out the salient ones among all the predictions. If there is any, this results in the respective predictive layer 5 cells to become ‘voluntarily active’ before the proximal input of the next step. These voluntary acitve cells stimulate motor neurons to produce action which are configured to produce random behavior otherwise.

Layer 5, D1 and D2 all predict the next activity of Layer 5. If connected properly, the predictions of D1 and D2 on layer 5 should contain a subset of the predictions from distal input (layer 5 itself). So there is indeed expected to be an overlap, the whole behavior production is built on this overlap; if the agent predicts something (layer 5 distal input from layer 5) and if that thing is important (layer 5 apical input from D1 and D2) activate that thing. I can clarify a bit more if you could pinpoint the source of confusion. The activations of D1/D2 and layer 5 are different at a given time but they are all mapped to the next layer 5 activation.


Ok, yes I read about the bumping but the parameter declaration was slightly confusing so I wasn’t sure. I stay with/without NUPICs boosting option for now but keep it in mind.

Agreed. But your motor-layer where you choose the 3 winner cells from had the same size as the other layers (512). Did you keep all layer sizes the same for simplicity?
Do you believe it hurts the architecture much if the apical connections from layer 5 voluntary active neurons to motor neurons to excite/inhibit are not learned but constantly mapped instead?
As it is learning by association, I thought it might work this way too and simplifies one layer.

The main confusion is caused by the fact that we sample in D1/D2 using a Spatial Pooler from layer 5. This means we will have a complete different columnar activation than in layer 5 as the connections are randomly initialized. Then even though the neural activation is still the prediction for layer 5 the indices can not really be mapped to layer 5 as they are completely different distributed. Thus it seems not systematic to me that basal and apical depolarizations aim to the same.

Maybe I misunderstood this part and we use the same columnar activation as in layer 5 without Spatial Pooler learning, that would in my intuition work more as we intended. (I did not test it yet)

1 Like

I also have the problem currently that the mouse movement does not have enough effect on the sensory encoding and layer 4 neural activation to make a difference.

I believe this is a general problem also when the task would be more complex and detailed that individual pixels become more important but get vanished by the Spatial Pooler. In general some “Attention” mechanism could be tried here to give a certain image region more importance in the encoding (or using saccade and a limited perceptual field as you had).

In my case I could encode the mouse coordinates next to the image visuals and then combine the input at layer 4 sensory input, sizing them on the importance for the feature. Or I increase the layer size and sparsity enough to encode the details in the activation. Or do it specifically for the pixels around the mouse (attention like)

Do you have thoughts on this and what might work best?

Kind regards

1 Like

I have been travelling the last couple of days so I couldn’t respond quickly.

The motor layer consists of only 30 neurons and no columns. It is as shown in page 28. There are 4 rows just to visualize different states of the same neurons. There is actually a single row of neurons. image

So the motor layer can be treated as 30*1 (30 columns and 1 neurons per column). The functionality can be realized by a full layer but I just simplified it this way.

I thought of this myself too at some point. It works but there is a catch. What happens when the layer 5 activation changes slightly because of a new pattern or boosting? In this case, there will not be a motor command mapped to the slightly changed layer 5 activation for whatever reason. It can work if the layers are highly stable but you basically remove noise robustness and any change requires a new mapping which kind of contradicts with what HTM does.

I think I understand your confusion.

Each layer 5 activation corresponds to different activations on D1 and D2. Suppose that at time t, L5 has activation L5(t), D1 has activation D1(t) and D2 has activation D2(t). Activation D1(t) and D2(t) takes activation L5(t) as their input. Therefore D1(t), D2(t) and L5(t) all have differing activations. However, there is a relation; both D1(t) and D2(t) occurs when they get L5(t) as their input so they encode activation L5(t) on their own unique way. As the time goes on, D1 and D2 learns all layer 5 activations. On the temporal memory side, D1(t) and D2(t) takes their distal input from L5(t-1).

This is like motor layer association with layer 5 with a single difference. For motor layer you associate L5(t) with Motor(t). In this case, you associate D1(t-1) with L5(t) through apical connections instead of D1(t). Same with D2.

So, there are also apical connections forming to L5(t) from D1(t-1) and D2(t-1). So any activation occurring in D1 or D2, depolarizes cells in L5 that are expected to be active in the next time. What you end up is at any given time there are predictive cells in L5(t) that are distally depolarized by the activation from L5(t-1) and apically depolarized by the activation from D1(t-1) and D2(t-1).

You can achieve the same thing by directly using the same columnar activation of L5 on D1 and D2, however that does not seem to be how biology does as one activation is in the cortex and one is in the striatum. These should be mapped but not the same.

This is a VERY crucial problem. I had this sort of a problem from the beginning. Currently, you can only get around this by redesigning your encoders. This is a research area on its own. In my case, I was interested on the parts of the image that changed so I tried implementing an event based visual sensor here. This allowed the agent to sense what actually changed (what matters). Maybe you can come up with an encoder that ‘magnifies’ what you need until we can have a crack at the attention problem.

Any of these can be a starting point. You will probably understand what is important about the input after some trials. I really spent weeks trying to come up with something kind of universal to “zoom” on to the important bits in the input. However, biology has very sophisticated tools tailored just this task such as the retina and thalamus. If you are interested in how the eye does it, you can read on neuromorphic vision sensors but then again, this is another area of research.


I edited this bit. For time t, activations D1(t) and D2(t) take their distal input from L5(t-1). It also works if you configure temporal memory of D1 and D2 as in vanilla HTM but I found that prior layer 5 activation as distal input works better as in the architecture diagram.


This was exactly the part I was not sure about.

  1. In the Implementation part it is mentioned that the distal input is from layer 4 and layer 2 so I did not include any distal depolarization from L5(t-1) yet. I will change the layer 4 to layer 5 itself that makes sense.
  2. Is the apical depolarization used for the temporal memory (choosing winner cells in proximal bursting columns), until now I just let it learn and use the proximal input to choose if it was correctly predicted or not. And then use the apical depolarization to calculate the intersection with basal for voluntary active neurons.

The second point is mainly as I thought the input is from D(t) and the computational flow:

  • Calculate L5_SP(t) and L5_TM(t) with input from L2(t) and L4(t)
  • Calculate D1/D2(t) with input from SP and TM of L5(t)
  • Calculate the Apical learning of L5(t) and voluntary activation with apical input from D1/D2(t)

However now I understand it more like:

  • Calculate L5(t) with proximal input from L4(t), distal input from L2(t),L5(t-1) and apical input from D1/D2(t-1). This includes calculating the voluntary active neurons that excite/inhibit the motor layer.
  • Calculate D1/D2(t) with proximal and distal input from SP and TM of L5(t) [Alternatively distal input from L5(t-1) suggested] This includes calculating the new TD-Error(t).
  • L5(t) is learning the apical connections with the TD-Error(t). It increments/decrements connections from the L5(t) active neurons with D1/D2(t-1) activation (if correctly depolarized).

I will try to interpret it:
We basically form a representation in D1/D2(t) that in some way encodes L5(t). This representations are learned, such that they are consistent after learning but could adapt. So in some sense they are another representation of our Agent-State at time t.This is utilized to calculate the TD Error - if the state is more favorable than its prior or not.

Based on this TD-Error L5(t) will learn the apical connections to D1/D2(t-1). Using ApicalAdaptInc(type) to increment the permanence of connections that did depolarize active cells and decrements with ApicalAdaptDec connections from active L5(t) cells to inactive D1/D2(t-1) cells.

The result is after learning - that when we getting L4(t) column input in L5 it will be apically depolarized from D1(t-1) to activate neurons that are learned to be a favorable state in the context of the prev. state activation (represented in another form by D1(t-1) instead of L5(t-1)). In contrast from D2(t-1) apical connections depolarize cells that are learned to be a negative state.

The final step enables to generate voluntary behavior using this depolarization. We have cells apically depolarized because they are favorable in the context of the prev. state. The overlap with the basal depolarization (that predicts its own activity) they become filtered to the likely occurring (if basal predictions are good) ones and relevant to the motor layer.

I really try to grasp the concept here in detail again, as I misunderstood some part before and tried to make sense of it. Please let me know if I got it right now.

Additionally I looked more closely into my TD-Error calculation and recognized:

  1. You do not weight the prev. predicted neurons but do weight them calculating the average state value. Doesn’t that effect the calculation as even the same activation would not result in zero error? Also if we weight them with factors > 1 isn’t it changing the average value drastically and we should also divide by the number of times we factor?
  2. My average state value (and then also the TD-Error) currently always grows larger and larger as longer we run. This seems logically as the agent initially learns with high volatility in the D1 region and we only add to the state values of the neurons. However this makes the reward at some point not giving a real influence on the equation anymore. Did you deal with the same problems?

Also did you calculate the TD-Error for D1/D2 separately?

1 Like

Since you mentioned this, I tried various combinations for the distal input of L5(t). L5(t-1) or L4(t-1) both works. L4(t-1) alone was more stable and L5(t-1) resulted in more learning time for the same patterns. So I added both of them to the diagram but mentioned L4(t-1) on the connections section (ignoring L2/3 at this point). Same thing with distal input from D1(t) and D2(t):

On your second point:

If a layer has apical dendrites, it has 2 temporal memory stages; one for distal depolarization and one for apical depolarization. These get calculated separately and voluntary activation results from the intersection afterwards:

1- Spatial Pooler
2- Temporal Memory
3- Apical Temporal Memory
4- Voluntary activate the cells that are both distally and apically depolarized; no proximal input involved in this.
5- Clear the voluntary activation for the next time step.

You should always assume that all apical and distal input comes from t-1 of the input layer because we are trying to predict the next step. This is the case with Nupic too. The only exception is motor layer apical input which comes from time t of L5 as it is just mapping to L5 not predicting it. Therefore:

L5(t) proximal input: L4(t)
L5(t) distal input: L4(t-1) and L2(t-1). (L5(t-1) works too but L4(t-1) performed better)
L5(t) apical input: D1/D2(t-1)

D1/D2(t) proximal input: L5(t)
D1/D2(t) distal input: L5(t-1) (Again D1/D2(t-1) works but L5(t-1) performed better)

To reiterate, if a layer X(t) has distal input Y(t-1), it means that the cells of X(t) forms distal connections to Y(t-1) so that after the learning phase of the TM iteration, Y(t) (current Y activation) depolarizes cells on X(t) that are expected to fire in the next time step t+1 (blue cells below):

Your interpretation above sounds about right.

Yes because the state values are stored in neurons of layers. In the end you get very similar TD-Errors for both D1 and D2 but I think they have to be separate.

It wasn’t a problem in practice for me but yes, I think that should work better in terms of normalizing

I think these two maybe related. The average state value should be calculated by weighting state values of individual neurons. On the error calculation, I remember switching to calculating the error by average state value instead of the individual neuron’s state value because of a similar problem you observed. The state values converged to around similar values for all neurons if I used average state value in the error calculation instead of the individual state value of the corresponding neuron. So yes it effects calculation and I think there is room for experimentation here.

In general, there is room for experimentation for a lot of the things I mentioned above. Feel free to try things that you find more sensible and share the findings with us :slight_smile:


Yes that is true for the temporal memory with internal basal connections. However in the ExtendedTemporalMemory implementation of NUPIC when we consider ExternalBasal and Apical input it will follow the normal compute order of the regions and can end up with the time t input instead of t-1 I believe. This is why I did some tuning to make it run correctly with the Network API and could also be the reason it did not work well before. I will change back to Layer4(t-1) now and test a bit around.

Thank you for all the answers! I try to get a base version working on the simple problem and then take the rest time I have left (and after the thesis) to experiment further from that. I will be happy to share any findings if I get something working. :slight_smile: Don’t know how it was for you but as it is a niche topic there are no people (e.g. supervisor) around besides the HTM forum to share ideas with, what makes it more difficult.

I also tried the calculation with the previous average state value, but I got convinced from the newly calculated one as the same state should result in a zero error (same guess) and that would not be the case using the prev. calculated state value and then calculate a new one with updated neuron values.
I will investigate more but I need to change something with the TD-Error as it just grows continuously due to state changes and then also increases/decrements the neuron values higher and higher leading to a circular reaction and growth towards infinity.


Oh I see, if the input layer is computed after the target layer on every iteration, then the remaining activation of the input layer at time t would actually be (t-1) from the perspective of the target layer when it is being computed. It is definitely dependent on the order of computation as you said.

All we have is the forum and some older papers that took a shot at HTM + RL without explicit documentation unfortunately. There are also some hot gym examples shared in the forum. There are computational models of basal ganglia which are very valuable. For the past 2 years, I went through numerous architecture iterations that I lost count of without any mentor other than the forum and some models in computational neuroscience. At least this thesis may be a tangible starting point for a barely functional HTM agent :slight_smile: On the other hand, it is exciting exactly because of this.

I am not sure if this is how it should be. The error should converge to zero not instantly become zero. Each same state iteration updates values with a learning rate that is smaller than 1.

I remember having this exact problem. I am guessing your decay functions properly? As in you decay the eligibility traces properly with lambda parameter. Also try parameters that have very small rates. Using average state value also helps.


I’m thrilled by your PhD reference point!

Neuroscientific based benchmarks for AI games should certainly help connect AI game writers to neuroscience, and possibly neuroscientists to game writing. Only problem I can think of worth (with some humor) mentioning is: later wishing you didn’t after seeing the weird stuff that can come from such work like virtual reality socks and floor pads that not only nip at your feet like sand crabs they safely allow humans to compete with a virtual critter in the moving invisible shock zone arena, then soon there is a YouTube video on how to bypass the mild shock circuit with the electronic system from an electric fence made for cows.

Neuroscience can be so bewildering that most people have good reasons for not wanting to have to on their own make sense of it all. There are currently many research papers that contain good clues but putting them all together has had to be more like a life-long quest with not much time left for other things like focusing on the marketing of a game. What you propose might be what they need.

1 Like


Yes when I take very low parameters it just takes longer but the result of infinite growth is the same. It seems it is just to volatile and does not converge. I will need to experiment more to have clearer insights into that problem.

How much did the environment change in your experiment?
You only use Motor activation as distal input in L4, did you also experiment with having internal connections in L4 as well?
In the case of only 4 motor neurons becoming active layer 4 is unable to make accurate predictions of the environment in the next time step without internal distal connections. However as they are much more neurons (~512 total) they outweigh the impact from active motor distal connections if activated. I test around to get a good balance there both should be important as its change due to behaviour and the environment itself. (As in Numentas intrinsic and extrinsic sequence paper)

Yes it should converge to zero but shouldn’t the same state produce the same guess (and thus zero error) as nothing changed?
I try around with both calculations though.

1 Like

You can seen the environmental change from the demo. The agent moved in a highly discrete way in the thesis. So the change was capped at reachable voronoi cells * their edges. The agent could stand at the center of one of them and could face the direction of its edges.

Have you dropped the matching and active thresholds of the layer 4 distal dendrites to accommodate for the low activity count of the input?

In my calculations, at some point I introduced a decay to the neural state values as well. So same guesses resulted in the amount of decay applied in each iteration at best, instead of zero. The learning was more consistent this way. This would also help with your infinitely growing state values till you figure out what is wrong.


@sunguralikaan please is there any source code for this project? if so then can you please release the link of the source code?