This was exactly the part I was not sure about.

- In the Implementation part it is mentioned that the distal input is from layer 4 and layer 2 so I did not include any distal depolarization from L5(t-1) yet. I will change the layer 4 to layer 5 itself that makes sense.
- Is the apical depolarization used for the temporal memory (choosing winner cells in proximal bursting columns), until now I just let it learn and use the proximal input to choose if it was correctly predicted or not. And then use the apical depolarization to calculate the intersection with basal for
**voluntary active neurons**.

The second point is mainly as I thought the input is from D(t) and the computational flow:

- Calculate L5_SP(t) and L5_TM(t) with input from L2(t) and L4(t)
- Calculate D1/D2(t) with input from SP and TM of L5(t)
- Calculate the Apical learning of L5(t) and voluntary activation with apical input from D1/D2(t)

However now I understand it more like:

- Calculate L5(t) with proximal input from L4(t), distal input from L2(t),L5(t-1) and apical input from D1/D2(t-1). This includes calculating the voluntary active neurons that excite/inhibit the motor layer.
- Calculate D1/D2(t) with proximal and distal input from SP and TM of L5(t) [Alternatively distal input from L5(t-1) suggested] This includes calculating the new TD-Error(t).
- L5(t) is learning the apical connections with the TD-Error(t). It increments/decrements connections from the L5(t) active neurons with D1/D2(t-1) activation (if correctly depolarized).

I will try to interpret it:

We basically form a representation in D1/D2(t) that in some way encodes L5(t). This representations are learned, such that they are consistent after learning but could adapt. So in some sense they are another representation of our Agent-State at time t.This is utilized to calculate the TD Error - if the state is more favorable than its prior or not.

Based on this TD-Error L5(t) will learn the apical connections to D1/D2(t-1). Using ApicalAdaptInc(type) to increment the permanence of connections that did depolarize active cells and decrements with ApicalAdaptDec connections from active L5(t) cells to inactive D1/D2(t-1) cells.

The result is after learning - that when we getting L4(t) column input in L5 it will be apically depolarized from D1(t-1) to activate neurons that are learned to be a favorable state in the context of the prev. state activation (represented in another form by D1(t-1) instead of L5(t-1)). In contrast from D2(t-1) apical connections depolarize cells that are learned to be a negative state.

The final step enables to generate voluntary behavior using this depolarization. We have cells apically depolarized because they are favorable in the context of the prev. state. The overlap with the basal depolarization (that predicts its own activity) they become filtered to the likely occurring (if basal predictions are good) ones and relevant to the motor layer.

I really try to grasp the concept here in detail again, as I misunderstood some part before and tried to make sense of it. Please let me know if I got it right now.

Additionally I looked more closely into my TD-Error calculation and recognized:

- You do not weight the prev. predicted neurons but do weight them calculating the average state value. Doesn’t that effect the calculation as even the same activation would not result in zero error? Also if we weight them with factors > 1 isn’t it changing the average value drastically and we should also divide by the number of times we factor?
- My average state value (and then also the TD-Error) currently always grows larger and larger as longer we run. This seems logically as the agent initially learns with high volatility in the D1 region and we only add to the state values of the neurons. However this makes the reward at some point not giving a real influence on the equation anymore. Did you deal with the same problems?

Also did you calculate the TD-Error for D1/D2 separately?