Theoretical implementation for extending predictions backwards through time


#1

An important capability that I need for the HTM-based RL system that I am developing is the ability to extend predictions further and further back through time the more often a particular sequence is encountered. I was initially doing this by walking through the activation history to grow distal connections from correctly predicted cells to cells which were active multiple time steps in the past, using a TD(λ)-like strategy. This strategy has a couple of drawbacks:

  1. It requires holding onto a significant amount state information (for cells and, more significantly, synapses), which has a big impact on memory requirements. Memory consumption grows quickly with each time-step that you need to refer back to.

  2. Walking the history takes time. This can cause a significant performance hit depending on how far back through the history you needs to go.

  3. Assuming you do not have enough memory to hold onto a complete history, or that walking the complete history would be too time consuming, you are forced to limit how far back a prediction can be extended.

I decided to see if I could theorize a strategy which doesn’t require holding onto any more than a couple time-steps of historical state information, and which does not impose an arbitrary limit on how far back in time a prediction could be extended. I also wanted an implementation that at least appears to be biologically plausible (granted I have zero knowledge of neuroscience, so I am sure there are holes in my logic…)

I had a thought about the distal synapses as “coincidence detectors” concept (i.e. if enough of them are activated within a particular time window, the cell becomes predictive). It occurred to me that there is a possible property in a continuous-time system that may have been overlooked in a discreet-time implementation. Namely, some of the active potential synapses during that time window could include not only cells active for the last element in a sequence, but also some percentage of cells activate from the element before that. Those potential synapses might therefore also be strengthened each time the sequence is encountered, and eventually become connected. This would allow the cell to become predictive just a bit further back in time, and shifting the time window slightly to the left. This in turn would allow potential synapses even earlier to be strengthened, eventually also becoming connected and shifting the time window even further to the left, and so-on. In theory this process would lead to predictions being extended further and further back in time the more often a sequence is encountered, with no theoretical limit.

In theory, this behavior could be emulated in a discreet-time implementation by not only creating/strengthening connections with cells active at T-1, but also creating a (configurable) number of new connections with cells active at T-2, up to some (configurable) maximum that is equal to or greater than than the activation threshold. After enough iterations of the sequence, the cell will be predicted from T-2, and the process would repeat, extending the prediction back another time step, and so-on.

Where this theory breaks down, of course, is the part of the process where if a cell is predicted but it doesn’t become active in the next time step, then the permanence of the active synapses which led to that prediction are degraded. This would cancel out the permanence increases that were made with T-2 when it predicted an activation that happens at T and not at T-1. This can be addressed by only applying the permanence decreases if the cell is neither active nor predictive in the next time step (versus only if the cell is not active). In other words, as long as a predictive cell is still predictive in the next time step, the connections which lead to the prediction will be maintained (and possibly strengthened).

I would like to address something that might initially seem to be a problem with this theory, in the case where a cell is predictive for more than one time step, but never becomes active and eventually becomes inactive and not predictive. Since we are only degrading connections with cells active in the previous time-step, what about the connections with cells active in earlier time-steps? This actually works itself out if you think it through – when the connections with T-1 become weak enough, cell will eventually no longer be predicted by input at T-1, and thus the permanence decreases would shift back one time slot. This would continue to bubble back through time as the new sequence is re-encountered.

Consider for example, we have well trained sequence ABCD. The predictions would start out like so:
A: predicts B,C, and D
B: predicts C and D
C: predicts D
D: (no predictions)

Then we begin training sequence ABCF, and prediction updates begin to bubble back like so:
(iteration #1)
A: predicts B, C, and D
B: predicts C and D
C: (D not predicted)
F: (no predictions)

(iteration #2)
A: predicts B, C, and D
B: predicts C and F (D not predicted)
C: predicts F (D not predicted)
F: (no predictions)

(iteration #3)
A: predicts B, C, and F
B: predicts C and F
C: predicts F
F: (no predictions)

As of now, this is just a theory. Thought I would write it up here for discussion before I run too far with it. Is this really biologically feasible? Are there other theories that are better or more feasible, which would have the desired effect of extending predictions back in time? Any obvious flaws?


Exploring Reinforcement Learning in HTM
#2

I’ll have to give this a more careful read when I have more time, but TD doesn’t strictly require looking back through time, which is why it’s so attractive to neuroscientists who want to explain biological RL. It just maintains an eligibility trace that decays exponentially, and that time constant determines your reward discount rate.

So in the mean time before I get a chance to really take in the idea, have you thought about maintaining eligibility traces instead and using those to learn to predict discounted reward? Then you do the standard learn-on-reward-prediction-error thing.


#3

Thanks, @jakebruce. I should probably have clarified what I meant by “TD(λ)-like”, because your observations are correct about the algorithm. TD(λ) cannot be applied “out of the box” to HTM. It is essentially a mathematical formula for assigning value to points in a decision tree. HTM is about strengthening/degrading distal connections to generate predictions. On the surface, it isn’t immediately obvious how the two concepts can be connected.

What I am trying to do is to capture the “essence” of TD(λ), not the algorithm itself. The core of TD(λ), as you have mentioned, is the concept of eligibility traces that allow earlier steps to be associated with rewards/punishments that occur later in subsequent steps. These eligibility traces are what I referred to earlier as “historic state information”.

In practice, for HTM the eligibility traces need to consist not only of which cells were active in previous time-steps, but also which synapses were active (because we need to do things like finding “best matching segment” and so-on, which require synaptic state information). In HTM, the synapses change quickly over time – the state 100 time-steps ago was likely very different than the state at the current time-step. In my mind, a “best matching segment” would be one that would have best predicted something at the time it was active (versus at the current time-step). Of course, I may be over-thinking this point – the difference is subtle and may not have any functional difference.

The eligibility trace consists of lists of historic state information, indexed by time-step. This indexing is necessary because, as you pointed out, the weight of the current reward/punishment decays exponentially as you step back through the time-steps (i.e. the level of reinforcement with the active reward/punishment cells is different for each time-step). I’m just thinking of this now, but another possible implementation would be to decay the eligibility trace itself (i.e. drop some percentage of cells/ synapses from the lists) versus decaying the level of reinforcement. This would be more memory efficient (but would have to see if it was functionally equivalent).

The eligibility trace decay is what I have referred to above as imposing an limit on how far back a prediction can be extended. This is actually a limitation imposed by my own implementation, which generates distal connections only with active reward/punishment cells. This differs from the TD(λ) algorithm, which works with a sum of current reward plus all future expected rewards.

To more accurately simulate this element of TD(λ) in HTM, this would mean in practice that you would enforce distal connections not only between cells active in the previous time-step with cells active in the current time-step, but also with cells that are in a predictive state from the active cells in the current time-step. This would allow predictions to bubble back through time as the sequence is re-encountered, because it is essentially making a copy of current predictions to cells in the previous time-step. It would eliminate the need to hold on to large lists of historic state information.

The issue I have with this, however, is that it leads to self-reinforcing wrong predictions. This is because the enforcement with predicted cells counters the degradation of wrongly predicted cells in the next time-step. The system might keep predicting some future reward for a particular action over and over again, even though that reward never actually happens (something like a superstition). This could be countered by configuring permanence decreases to be larger than permanence increases, but that is backwards from the way we typically configure HTM (we tend place more emphasis on learning things quickly and remembering them, rather than on forgetting things)

To make this work, I would really need to redesign how permanence reinforcement/ degradation is implemented. An activation at T-1 would need to reinforce connections with active reward/punishment cells at T-1 as well as connections with both active and predictive reward/punishment cells at T, and degrade any remaining connections with reward/punishment cells. This is a fairly significant deviation from the current implementation, which I assume is designed to mimic the biology (thus my aversion to going this route if there is an alternative that doesn’t make any significant changes to the current implementation).

Ultimately, I probably should implement three or four different strategies and do comparisons to see which demonstrates the best balance of intelligence and resource requirements.


#4

I just wanted to point out that this is a huge problem for me in embedding TD lambda on HTM. I get that you’re proposing a different thing but synaptic traces are required even if you just want to update HTM according to the error signal. Even then the synapses on your eligibility trace are probably outdated. Currently I only modify the synapses from the last step but update the state values of all the neurons of the eligibility trace. Though it is kind of wrong to update the state values of neurons without updating the synapses activating them because it implies that you made the necessary adjustments leading to those activations.