HTM Based Autonomous Agent


This was exactly the part I was not sure about.

  1. In the Implementation part it is mentioned that the distal input is from layer 4 and layer 2 so I did not include any distal depolarization from L5(t-1) yet. I will change the layer 4 to layer 5 itself that makes sense.
  2. Is the apical depolarization used for the temporal memory (choosing winner cells in proximal bursting columns), until now I just let it learn and use the proximal input to choose if it was correctly predicted or not. And then use the apical depolarization to calculate the intersection with basal for voluntary active neurons.

The second point is mainly as I thought the input is from D(t) and the computational flow:

  • Calculate L5_SP(t) and L5_TM(t) with input from L2(t) and L4(t)
  • Calculate D1/D2(t) with input from SP and TM of L5(t)
  • Calculate the Apical learning of L5(t) and voluntary activation with apical input from D1/D2(t)

However now I understand it more like:

  • Calculate L5(t) with proximal input from L4(t), distal input from L2(t),L5(t-1) and apical input from D1/D2(t-1). This includes calculating the voluntary active neurons that excite/inhibit the motor layer.
  • Calculate D1/D2(t) with proximal and distal input from SP and TM of L5(t) [Alternatively distal input from L5(t-1) suggested] This includes calculating the new TD-Error(t).
  • L5(t) is learning the apical connections with the TD-Error(t). It increments/decrements connections from the L5(t) active neurons with D1/D2(t-1) activation (if correctly depolarized).

I will try to interpret it:
We basically form a representation in D1/D2(t) that in some way encodes L5(t). This representations are learned, such that they are consistent after learning but could adapt. So in some sense they are another representation of our Agent-State at time t.This is utilized to calculate the TD Error - if the state is more favorable than its prior or not.

Based on this TD-Error L5(t) will learn the apical connections to D1/D2(t-1). Using ApicalAdaptInc(type) to increment the permanence of connections that did depolarize active cells and decrements with ApicalAdaptDec connections from active L5(t) cells to inactive D1/D2(t-1) cells.

The result is after learning - that when we getting L4(t) column input in L5 it will be apically depolarized from D1(t-1) to activate neurons that are learned to be a favorable state in the context of the prev. state activation (represented in another form by D1(t-1) instead of L5(t-1)). In contrast from D2(t-1) apical connections depolarize cells that are learned to be a negative state.

The final step enables to generate voluntary behavior using this depolarization. We have cells apically depolarized because they are favorable in the context of the prev. state. The overlap with the basal depolarization (that predicts its own activity) they become filtered to the likely occurring (if basal predictions are good) ones and relevant to the motor layer.

I really try to grasp the concept here in detail again, as I misunderstood some part before and tried to make sense of it. Please let me know if I got it right now.

Additionally I looked more closely into my TD-Error calculation and recognized:

  1. You do not weight the prev. predicted neurons but do weight them calculating the average state value. Doesn’t that effect the calculation as even the same activation would not result in zero error? Also if we weight them with factors > 1 isn’t it changing the average value drastically and we should also divide by the number of times we factor?
  2. My average state value (and then also the TD-Error) currently always grows larger and larger as longer we run. This seems logically as the agent initially learns with high volatility in the D1 region and we only add to the state values of the neurons. However this makes the reward at some point not giving a real influence on the equation anymore. Did you deal with the same problems?

Also did you calculate the TD-Error for D1/D2 separately?


Since you mentioned this, I tried various combinations for the distal input of L5(t). L5(t-1) or L4(t-1) both works. L4(t-1) alone was more stable and L5(t-1) resulted in more learning time for the same patterns. So I added both of them to the diagram but mentioned L4(t-1) on the connections section (ignoring L2/3 at this point). Same thing with distal input from D1(t) and D2(t):

On your second point:

If a layer has apical dendrites, it has 2 temporal memory stages; one for distal depolarization and one for apical depolarization. These get calculated separately and voluntary activation results from the intersection afterwards:

1- Spatial Pooler
2- Temporal Memory
3- Apical Temporal Memory
4- Voluntary activate the cells that are both distally and apically depolarized; no proximal input involved in this.
5- Clear the voluntary activation for the next time step.

You should always assume that all apical and distal input comes from t-1 of the input layer because we are trying to predict the next step. This is the case with Nupic too. The only exception is motor layer apical input which comes from time t of L5 as it is just mapping to L5 not predicting it. Therefore:

L5(t) proximal input: L4(t)
L5(t) distal input: L4(t-1) and L2(t-1). (L5(t-1) works too but L4(t-1) performed better)
L5(t) apical input: D1/D2(t-1)

D1/D2(t) proximal input: L5(t)
D1/D2(t) distal input: L5(t-1) (Again D1/D2(t-1) works but L5(t-1) performed better)

To reiterate, if a layer X(t) has distal input Y(t-1), it means that the cells of X(t) forms distal connections to Y(t-1) so that after the learning phase of the TM iteration, Y(t) (current Y activation) depolarizes cells on X(t) that are expected to fire in the next time step t+1 (blue cells below):

Your interpretation above sounds about right.

Yes because the state values are stored in neurons of layers. In the end you get very similar TD-Errors for both D1 and D2 but I think they have to be separate.

It wasn’t a problem in practice for me but yes, I think that should work better in terms of normalizing

I think these two maybe related. The average state value should be calculated by weighting state values of individual neurons. On the error calculation, I remember switching to calculating the error by average state value instead of the individual neuron’s state value because of a similar problem you observed. The state values converged to around similar values for all neurons if I used average state value in the error calculation instead of the individual state value of the corresponding neuron. So yes it effects calculation and I think there is room for experimentation here.

In general, there is room for experimentation for a lot of the things I mentioned above. Feel free to try things that you find more sensible and share the findings with us :slight_smile:


Yes that is true for the temporal memory with internal basal connections. However in the ExtendedTemporalMemory implementation of NUPIC when we consider ExternalBasal and Apical input it will follow the normal compute order of the regions and can end up with the time t input instead of t-1 I believe. This is why I did some tuning to make it run correctly with the Network API and could also be the reason it did not work well before. I will change back to Layer4(t-1) now and test a bit around.

Thank you for all the answers! I try to get a base version working on the simple problem and then take the rest time I have left (and after the thesis) to experiment further from that. I will be happy to share any findings if I get something working. :slight_smile: Don’t know how it was for you but as it is a niche topic there are no people (e.g. supervisor) around besides the HTM forum to share ideas with, what makes it more difficult.

I also tried the calculation with the previous average state value, but I got convinced from the newly calculated one as the same state should result in a zero error (same guess) and that would not be the case using the prev. calculated state value and then calculate a new one with updated neuron values.
I will investigate more but I need to change something with the TD-Error as it just grows continuously due to state changes and then also increases/decrements the neuron values higher and higher leading to a circular reaction and growth towards infinity.


Oh I see, if the input layer is computed after the target layer on every iteration, then the remaining activation of the input layer at time t would actually be (t-1) from the perspective of the target layer when it is being computed. It is definitely dependent on the order of computation as you said.

All we have is the forum and some older papers that took a shot at HTM + RL without explicit documentation unfortunately. There are also some hot gym examples shared in the forum. There are computational models of basal ganglia which are very valuable. For the past 2 years, I went through numerous architecture iterations that I lost count of without any mentor other than the forum and some models in computational neuroscience. At least this thesis may be a tangible starting point for a barely functional HTM agent :slight_smile: On the other hand, it is exciting exactly because of this.

I am not sure if this is how it should be. The error should converge to zero not instantly become zero. Each same state iteration updates values with a learning rate that is smaller than 1.

I remember having this exact problem. I am guessing your decay functions properly? As in you decay the eligibility traces properly with lambda parameter. Also try parameters that have very small rates. Using average state value also helps.


I’m thrilled by your PhD reference point!

Neuroscientific based benchmarks for AI games should certainly help connect AI game writers to neuroscience, and possibly neuroscientists to game writing. Only problem I can think of worth (with some humor) mentioning is: later wishing you didn’t after seeing the weird stuff that can come from such work like virtual reality socks and floor pads that not only nip at your feet like sand crabs they safely allow humans to compete with a virtual critter in the moving invisible shock zone arena, then soon there is a YouTube video on how to bypass the mild shock circuit with the electronic system from an electric fence made for cows.

Neuroscience can be so bewildering that most people have good reasons for not wanting to have to on their own make sense of it all. There are currently many research papers that contain good clues but putting them all together has had to be more like a life-long quest with not much time left for other things like focusing on the marketing of a game. What you propose might be what they need.



Yes when I take very low parameters it just takes longer but the result of infinite growth is the same. It seems it is just to volatile and does not converge. I will need to experiment more to have clearer insights into that problem.

How much did the environment change in your experiment?
You only use Motor activation as distal input in L4, did you also experiment with having internal connections in L4 as well?
In the case of only 4 motor neurons becoming active layer 4 is unable to make accurate predictions of the environment in the next time step without internal distal connections. However as they are much more neurons (~512 total) they outweigh the impact from active motor distal connections if activated. I test around to get a good balance there both should be important as its change due to behaviour and the environment itself. (As in Numentas intrinsic and extrinsic sequence paper)

Yes it should converge to zero but shouldn’t the same state produce the same guess (and thus zero error) as nothing changed?
I try around with both calculations though.


You can seen the environmental change from the demo. The agent moved in a highly discrete way in the thesis. So the change was capped at reachable voronoi cells * their edges. The agent could stand at the center of one of them and could face the direction of its edges.

Have you dropped the matching and active thresholds of the layer 4 distal dendrites to accommodate for the low activity count of the input?

In my calculations, at some point I introduced a decay to the neural state values as well. So same guesses resulted in the amount of decay applied in each iteration at best, instead of zero. The learning was more consistent this way. This would also help with your infinitely growing state values till you figure out what is wrong.