ANN: DEW Moving average



I’m learning Reinforcement learning and to clarify my thoughts and see what I learned I wrote a quick introduction :

Hope you like it.
Feedback welcome …


Cool, thanks! :smiley:


Thank you for the write up. As someone who couples HTM with Temporal Difference Learning Lambda it is one of the easier reads about it, I wish more people approach these things in a similar fashion.

I have a question to jump start a potentially valuable discussion and to get more attention to this.

The max() in the formula is the prediction part i.e. we pick the biggest Q-value between all subsequent states.
That is the most probable direction the signal will go according to information we currently have.

What would be the usage or reasoning behind calculating this “prediction”? This implies that there is a policy that chooses the transition leading to maximum state value. It is stated at the top that the write up ignores any policy. I believe you are trying to relate this to HTM, correct me if I am wrong on that.

To my knowledge, the premise of that value calculation is to get an error between the expectation from t and the calculated reward at t+1. You would then modify the network (HTM) according to this error. So temporal difference learning calculates those values depending on the policy. As a result the most probable behavior of the next step may or may not be the most beneficial. If you were talking about Q-Learning, it would make sense to state that because the state values are independent of the policy on that algorithm.

Just to present an alternative perspective, while there may be a possible set of actions on current state for other applications it may or may not be the case for HTM; states and actions may be the same thing in terms of HTM. For example, Layer 5 has a state which outputs an action, the activation of the layer may be the action itself. It doesn’t necessarily “choose” an action. Changing a state may just output the action bound/glued/conditioned to it; like a reflex, no “choosing” or “decisions”. The glue (synapses) in between the layer 5 and the action may be modified by the error calculated on your write up. At least that is how the architecture I am experimenting operates.

Thanks for bringing this up. :slight_smile:


If I understand your question correctly : In this case there is no choice of what next action to pick. The time series “forces” the actions on us. We just follow and update the Q-table.
Then the prediction is simple np.argmax(Q).

In the general case :
We have to use “prediction” (which synonymous for max(Q)) because we don’t have the real Q-value available, we have the current approx in the table.
The Q-table has all the approximate ACTION-VALUES (i.e. average discounted returns).

When we have to make decision to make a transition from state X =to=> next-state our best option is to pick the column with max value on row X.
We don’t have to decide to pick different one to optimize future paths-of-actions , because the current values are the best long-term values we calculated so far.
In the sense they have already been calculated and are in the table.

max(Q) ~~ Gt

with the knowledge that we can never get Gt because we have to know all the possible futures, max(Q) is the closest thing we have. The Action Q tends toward the path-of-actions with the highest Gt.

Minor detail : max(Q) is the VALUE of the next specific action , Gt is TOTAL RETURN of all the actions taken in the future.

As far as I understand it RL is “pulling” the future in the single decision of what action to take next, rather than optimizing over “paths” i.e. it is sort of like gradient descent that is constantly approaching some optimal value, rather than finding it at once.


I think integration of HTM and RL or any other algo should happen on different-level.
I’ve been looking around it seems all models are STATE based. The big thing that I see in HTM and not in any other model is the dynamic creation of new states and the abundance of the available room, courtesy of choose(40 of 2000 ).

HTM is implemented via dynamic Variable order markov chain, RL if used has to be replacement of that model.

So you either replace VOMC with RL or merge them in some way.
But there is also the constraint how does RL can be made to represent connected neurons ?!

The other option which I’m more inclined can be used is to use RL when you interconnect HTM’s, where HTM is the MODEL part in an RL system.


I just corrected the paragraph which I realized now. It was stating the opposite, sorry if it caused confusion.

So you were actually talking about Q-Learning. I thought you were talking about temporal difference learning RL algorithms in general. Q-learning (sometimes called Off Policy TD) is a specific variation of temporal difference learning algorithm that came out later. This is a nice and easy to read overview of TD RL umbrella.[1]

In context of HTM, temporal difference learning lambda (specifically backwards view) is more inline with biology (dopamine[learning signal] usage of basal ganglia).

In Q-Learning, you calculate the Q(State, Action) values which creates a map that the agent can use to output the optimal transitions for every state in the long term. In temporal difference learning, you calculate the values for the state itself, not the action a taken in state s. This indirectly takes into account what the agent statistically does in the proceeding steps. So a state’s value is actually tied to its policy which is what happens in biology.

Q-Learning implies that the agent has a decision making mechanism where it weights all the options and “chooses” one. That is not the case in biology on low level. Learning in biology is not about optimizing the reward path, it is about reducing the difference between expected value at timestep t and actual value at t+1, the error/learning signal; dopamine. The error is already calculated; Rt+1+γGt+1−Qt. [Gt+1 would be Vt+1 and Qt would be Vt in TD(lambda)]

You can still treat individual cells or columns of HTM as states and calculate state values for individual cells/columns that are active at time t+1. That’s what ANN guys are doing mostly.

We are on the same page on this and that is what biology does to my understanding. On computational models of basal ganglia (may be modeled as a modified HTM with state[neuron] values), specifically striatum is believed to be functioning similar to temporal difference learning through its dopamine receptors D1 and D2, in conjunction with cortex. [2]


thanks for the nice words :slight_smile:


I forgot to mention the most important thing.
HTM is binary, not real numbers.
It is my thinking that genuine universal algorithm has to be binary 0|1.