Multistep Predictions vs Predictions from Predictions

In the current HTM theory, is the mechanism for multi-step predictions based on an idea of looking a certain number of steps into the past for cell prediction at the current time step, or is it based on using predictions to make new predictions? I mean, instead of only looking at cell activity, the cells look at cell prediction as well and form new predictions based on these predictive cells. Therefore a cell can form connections to any cell that was, in the previous time step, (a) active or (b) predicting. Then the next time the other cell predicts, this cell predicts. Two step prediction.

In my view, it would only make sense to make predictions based on both cell activity and cell prediction. This means you only have to store the current state and the previous state when implementing the theory. You can also generate many step predictions into the future. The issues with this are clear - there is no way of telling if this is a two-step or more prediction. But it seems more biologically accurate. Perhaps a method of observing how strongly a cell feels it will predict is in order, as opposed to the binary activity of prediction?

Does the community have thoughts on this topic?

Isn’t that the ultimate question :slight_smile:

I have been thinking for a long time about this, even experimented on a very basic level.
As far as my understanding is current theory on Temporal memory is that it is one-step (forward) prediction.
I think Numenta are using a Classifier to predict future values (still just one virtual-step forward, of course the time-step itself can be different N sec, N days, N months … still one step though).

Pure MP (multi step prediction) and PP (prediction from prediction) are prone to failure, because they would rely on TM being a sort of simple “Transition-matrix”.
My recent thoughts on many-step predictions have been two-fold.

  1. Do some sort Reinforcement learning algorithm, where the TM plays the role of the MODEL.
    (The problem with RL is how do you “invent” a good reward-function)

  2. Understand the TM as something more than simple “Markov-Transition-matrix”, but more along the lines of Belief Network with DCG(directed cyclic-graphs), rather than DAG !
    (The problem with BN is they are DAG-based, not DCG-based)

Combine the two.
RL is also where the hierarchy takes place, but as a stepping stone I think RL will be easier to implement and test before you commit on full HTM.

No literature on RL without reward-function, the irony is you can skip all other requirements but RF is crucial ;(
Not much literature on DCG ! :frowning:

I would love if somebody can point me to documents on those two !

Hi Guys,

I know I’m just a knuckle-dragging coder, but I can’t help but wonder if there is any sight lost in this convo regarding where the real power lies in the TM; which to me is not in the depth of the number of steps ahead it can predict but in the fact that:

  1. Makes multiple simultaneous predictions, each of which advocate the dominance of different semantic origins. (i.e. consideration of multiple semantic expressions in parallel - even though only one is selected).
  2. Each single prediction (depolarized cell) arises from the potential aggregate dendrites of many different pre-synaptic cells (100’s or thousands) - each representing their own semantic origins, whose predisposition to a particular semantic is yet again derived from multiple competitive spatial distinctions.
  3. …and it’s all changing dynamically (unlike classical NNs)

So I think (as you guys have pointed out), that there is profoundly more going on than merely a step-predictive matrix… :wink:

You are correct to point out your 3 points, but the thing is people want much more to PREDICT multiple steps :slight_smile: not so much to detect anomalies :wink:
In fact that was why I went into HTM, I thought it was the ultimate solution to TimeSeries prediction and “language comprehension”, I was wrong on that the technology was already there, but wasn’t wrong on exploring it, because the solution lie somewhere there … i hope.

There is something about HTM that feels right, but also some bias on what problems are more interesting to solve too, at least for me.

1 Like

Besides the SDR/CLAClassifier already making multistep predictions, It seems like you could actually do this yourself by:

  1. Looking at what is being predicted for the next time step, and getting an SDR from that.
  2. Feed in that SDR (so that it is fed exactly what it is expecting),
  3. Go to #1 (rinse and repeat however many prediction ahead you need being careful NOT to feed that “synthetic” prediction into the classifier and ruining its statistics).


1 Like

…and if you did the above after turning off “learning”, you could do it indiscriminately however many times you like without messing up the learning of the actual stream! (by reinforcing (incrementing) permanences of the synthetic input).

This is what I thought when I said “pure PP” … it is not “stable” very quickly goes out of whack or wanders chaotically, because you don’t have the original signal to “discipline it”, keep it in line :slight_smile:

That is why I said you need some sort of RL algo.
F.e. if you have cyclic pattern have sinusoid-signal (extracted from FFT) to base a REWARD mechanism in RL algorithm… or another example if it is a trending signal, extract a regression line and do similar thing… or even reward based on external signal this way you can probably explore “emergent properties” of the TM … but this starts to look more and more like TimeSeiesAnalisys.

If you can find more universal reward-mechanism that keeps the predictions in line, then may be you can do this self reinforced PP.

PS> Or even better have another TM which predict N-steps ahead and use that for reward … huh … that could be it … interesting :sunny:

I’m new to most of artificial intelligence outside of HTM, since HTM is where I first started learning machine learning. The ideas of reinforcement learning, belief networks, and directed cyclic and acyclic graphs, these are all new to me. I’ve learned a small amount about Markov chains, but the information needs refreshing before I could hope to grasp the significance of what you’re proposing.

That being said, I really like the ideas of RL and finding a reward mechanism. I’ve come across the multi-armed bandit in game theory, and that helps me grasp the difficult of the problem. Whenever I think about how I do it (reinforcement) I get into very unanswerable questions about the difference between a good feeling and bad feeling, the decision process and feedback system in our bodies, and the way that thoughts correspond with actions and vice versa. It’s fun, and reminds me of the later parts of On Intelligence, but eventually I reel myself back in.

With reward systems, is it usually a “good stuff bin” and “bad stuff bin” and the point is to respond to events by some combination of these? Like “Something happened here, put 0.25 points in good bin and 0.75 points in the bad bin. If bad bin gets too high, alter the way the temporal memory responds to that event”. Is the difficulty of reinforcement deciding what is “good” and “bad”? Or am I way far off? These are first thoughts on the subject for me.

I don’t know much about RL either, I started learning it recently… If you know your problem you can easily decide on the RF, but if you don’t have idea of the “search space” how do you decide which one is a good “score”.
For example lets pick example again from time series. Suppose that the signal is sinusoidal, knowing that I can decide to implement RF that will give better score until the predictions reach close to previous maximum, then i will reverse the rule to give better score until it reaches close to previous minimum … so in a sense guiding the RF algo to approximate up/down, up/down …

But what will happen if the signal is trending upwards swiftly OR if the signal happen to by flattish with spikes … my RF will misguide the RL algo.

See It is very hard to find general RF, but easy to figure specific one.

This one is good :
I’m on second lecture.

Also I’m reading this : Bayesian Reasoning and Machine Learning by David Barber,

Does anyone know how multistep predictions are actually done in NuPIC? I’d assumed the TM predicts one step ahead at a time, and then feeds that back into the TM to generate the next set of predictions. Is that not right?

As far as I understand each neuron in the TM learns transitions from just the prior time step (forms synapses to neurons that were active at time t-1). If that’s true it would seem that each given neuron when active generates a set of predictions for the next time step, not numerous steps into the future. Am I wrong here? I haven’t experimented yet with multi-step prediction in NuPIC and would love to know if I don’t have this understood right. Thanks!

That makes sense, but the nupic online prediction framework does not work that way.
Instead, it uses a classifier to make an association between SDR’s in TM and the respective input scalars that caused them (in the scalar->encoder->SP->TM->SDR chain). The classes represent buckets of scalar ranges, and you can have 100 classes let’s say for a resolution of 0.01 in the input interval [0-1].

The mechanism for single-step and multi-step is exactly the same:
The classifier learns the association between SDR at time [k-n] and scalar input at time [k], using a FIFO buffer for the SDR’s. After the association is learned, given the input at time [t] and its corresponding SDR at time [t], the classifier will unfold the association with scalar input at time [t+n], therefore getting a prediction n stepts into the future.

The SDR is the set of active cells from TM, as it supposedly holds all the available information (about input history and recent input context) needed to foresee the future, however I did some tests using the set of predictive cells and there is no noticeable difference in the results.

You can watch this video as it’s the only resource explaining scalar prediction:

Yes and no. If the input history is A,B,C, A,B,D, A,B, next step prediction will be C+D (SDR union) because they are equally likely, however if C was seen more often before then the union will be more biased towards C and D will be stripped away of bits. In the case of a scalar input stream, this next step prediction is also an SDR union of the next SDR’s that might be occurring multiple steps into the future, however the most dominant SDR from the union is the one for the immediate next step, and the SDR’s for the next steps will be “faded away” so to speak. But as I explained, this is not so relevant with regards to how OPF makes predictions. But then of course, there are better methods than nupic OPF.

It might not need to make multistep predictions in the temporal memory. The temporal pooler’s representation of the sequence is sort of a long term prediction because it pools the sequence before the sequence is over. If there’s no ambiguity about the rest of the sequence, the representation of the sequence indicates that the rest of the sequence will play. If the temporal pooler has a bit of instability to weakly represent the place in the sequence, its representation as a whole indicates a sort of long term prediction (based on the combination of sequence and place in the sequence).

Imagination mechanisms could also be involved, but HTM theory is probably a ways off from imagination because imagination probably requires behavior or something similar. The brain decides to imagine rather than doing so automatically.

@Casey you just helped me understand the temporal pooler! I haven’t yet dug into the temporal pooler since I wasn’t sure where to get reliable documentation on it. But the way you put it, it seems the temporal pooler pools together sequences, and stores the, and compares the stored sequences to the current cell space. Am I on the right track?

That being said, while I like the idea of a temporal pooler, I’m not sure it’s the optimal method for long term predictions. Not that a combination couldn’t be useful, but it seems that a predictions from predictions system would save a lot of memory, a pretty valuable commodity. I can’t back that up with numbers, but from an initial perspective Im not sure about the TP for long sequences. I’ll need more information to be at all sure.

I can imagine the TP has a unique purpose, and there are other places to talk about it in depth. Staying within the realm of predictions, how much improvement does the TP have?

Here are some resources on the TP (or ‘Union Pooler’ as they’re now calling it).

Pseudo Code:
Code Repo:
Recent Discussion Post: New TP pseudocode?

As I understand it basically performs spatial pooling on the active and active predicted cells from TM instead of on an encoding vector as usual. So there’s another SP layer operating on the TM. In this layer each column has a ‘pooling activation’, which is incremented each time the column is activated by the SP and decremented otherwise. The columns with the highest pooling activations at each time step (as accumulated over the prior x-timesteps) at are activate in TP (/Union Pooler).

The purpose is to have set of active columns that learn to represent entire sequences from the TM, so one TP activation can remain for many timesteps of TM, the duration of the known sequence. Therefor a TP working well on a TM that has learned the sequences well will change its activations much more slowly than the TM.

I think this is basically right from a high level, but if someone would jump in and correct it please do!


Glad I could help.
Sorry about this long rant. I don’t know the exact answers to all of your questions, so I’ll give some speculation for those questions.

You are on the right track. One detail to keep in mind is that it treats predicted inputs differently from non-predicted inputs in a few ways. I’m not completely sure, but based on the link below, it seems like the inputs to the HTM need to come in a known sequence for it to pool anything. So if the input suddenly jumps to the middle of a known sequence, it won’t understand what the sequence is, at least not right away.

I’d like to add a link to sheiser1’s list which I’ve found helpful:

I’m not sure, but I suspect the TP can store a lot of sequences, similar to how the SP can represent an enormous number of things using the properties of SDRs. I’m not sure if the TP produces representations with those same properties though, but it probably should if it doesn’t. Then again, there are way more possible sequences than inputs, so it might not be enough capacity, especially for very long sequences.

I’m not sure how much the TP could help long term predictions. It would probably filter out at least a decent portion of the instability to allow abstraction, so high regions might not be able to contribute to low level predictions very effectively, but they could still make abstracted predictions about the future. The final version of hierarchy will probably involve something more complicated than TP because there are multiple layers with many mysteries, and there are multiple pathways both up and down the hierarchy.

A predictions from predictions system might be able to work very well, but there could be technical issues. For example, you might need to distinguish cells by how many steps they are predicting into the future so that each prediction leads to the next prediction rather than predicting another step into the future based on all predictions of activity far into the future. I’m not sure if that could happen in biology.

I’m not sure the brain even makes predictions far into the future, at least of the same type of the temporal memory. Thinking about the far future probably usually involves things like imagination, extreme abstraction, and activity modes such as oscillation frequency. Longish term predictions could be useful for behavior in a simple reward learning system, but after a certain distance into the future of prediction, it probably needs to get a lot more complicated because its probably really hard to keep track of the long term impact of each action, and the number of possible sequences is too large to even experience many of them so some way to generalize experiences to concepts is required.

That said, there’s probably at least some sort of medium term prediction - not on the order of milliseconds or seconds like lower cortical levels, but also not on the order of days like people are capable of planning. The hippocampus is capable of predictively playing sequences which it will later play as the animal takes a path. It can also learn to play a sequence if it needs to keep something in memory relevant to behavior, such as red flag means take a left when you go into the maze and blue flag means take a right, although I’m not sure how long animals keep the sequence going. That’s related to the idea of long term predictions because it knows what it needs to do long before it actually acts.