I’ve seen explanations in many places that TM is sort of variable order Markov chain, the problem I have is visualizing it step by step how it happens.
From the basic structure (one-step prediction) what this implies is that we have sort of simple “transition matrix” i.e. if state-x then state-y. There is nothing variable so far, but simply order-1 chain.
Now in the docs there are the ABCD, XBCY examples, but they don’t help me see the connection, how is it variable when you still have oreder-1 transition “stored” in the TM.
I imagine that because HTM states are not dense-state (real number like in the NN), but SDR (sparse) this creates looser “transitions” i.e. different prior-bits influence differently post-bits of the next state, so in a sense you virtually create multi-step-bit-sub-states, “below” the STATE which is a composite of many-bits (SDR).
Abstractly (descriptively) I can appreciate it, but can’t seem to be able to imagine step by step (proceduraly) how this variable-markov-order is created and plays out.
As you point out, we have explained this many times and in many ways. But I know it is hard to understand so I like to know when it isn’t getting across. Did you read the “Why Do Neurons Have Thousands of Synapses” paper? The explanation and figures in that paper underwent peer review and it might be the most concise explanation we have.
Perhaps the confusion is one of language. In some situations the state of the TM predicts one next state which predicts one next state etc. If you say it this way you might say these are 1st order transitions. But the states themselves are high-order. If a “B” is input to the TM it will be represented uniquely depending on variable length context. We often describe these high-order B states as B’ or B’’.
In the example ABCD and XBCY, When B arrives after A it is represented by a unique B’ and when B arrives after X it is represented by a unique B’’. B’ uniquely predicts C’ which predicts D’. B’’ predicts C’’ which predicts Y’’. To phrase this in english, when you listen to a familiar melody, every note is represented by a unique SDR that is unique to both the particular melody and the location of the note in that melody.
I can see how B’ and B’’ play out. They are selected depending on what was the exact sequence before them, that’s the theory.
But what about “finding” the correct first specific pattern to initiate the sequence f.e. if you start with A you will get random A pattern at the start A’ or A’’ or A’’’ …etc.
On top of that the next pattern in line constantly slips as seen by the need to nudge/reinforce the prediction (otherwise we could predict multiple steps in advance).
I understand it on abstract level, but can’t wrap my mind to see how it will work proceduraly step by step.
Or if I can rephrase my question to be more answerable :
What guarantees that the next predicted pattern will be correct sub/high-pattern f.e. B’ rather than B’’, so that when the times comes for D or Y the correct one is selected because the correct B was selected beforehand ? Even if we are not sure the correct A was used in the first place.
How the algorithm accommodate the slippage ?
BTW: I wrote my own abstract implementation of TM, approximating the papers and I can see that it works
Are you asking about how the high order sequences are learned by the TM? It is indeed a little tricky to go through the learning process step by step. Here’s a rough textual description:
Suppose you’ve already learned XBCY (let’s call the TM representations X’B’C’Y’) and now you want to learn ABCD (A"B"C"D"). When you present A, nothing is predicted in the B columns. When you present B it bursts and a random set of cells are chosen to win. Those cells become part of B". So far so good.
However due to the B columns bursting, at this point C’ is predicted. So at this point the TM will learn A"B"C’D. This is incorrect but temporary.
Later when you present A, B" will be predicted. When you see B, B" will become active. At this point nothing is predicted in the C columns. When you see C, a random set of cells are chosen to win. These become C". So at this point the TM will learn A"B"C"D".
Because of the temporary incorrect state, the TM needs to see the high order sequence several times in order to learn everything correctly.
Does that make sense? There are a few more details, but that should give the basic idea.
That is true the first time around, but not during this later phase. During this phase B" was correctly predicted, so the B columns did not burst so C’ is not predicted. At this point there is no connection from B" to C, so when C is presented, the C columns will burst. In this situation the learning rule then picks a random set of cells to become winner cells.
I see, make sense. Haven’t thought before about burst-predicted vs correct-predicted pattern.
So we have three cases :
Burst pattern will predict most probable next pattern (most often occurring, because the permanence’s will favor it). This is how high order sequence get back on “track” OR not, if this happen to be “misfire”.
A correctly-predicted pattern will predict burst-pattern, when there is uncertainty. This is how HOS loses “track”.
correct-pattern predicts correct-pattern. On “track”.
Hello, I was following your conversation and there is something I didn’t understand.
When B produces a burst and C’ is predicted the first time, all the cells in the (bursting) responsive columns to B link to C’ (when C appears in the input). Then, when B’’ is predicted and then active (when B appears in the input after A), shouldn’t C’ be predicted and then eventually active when C appears again?
After all, every cell from the bursting columns that responded to B the first time are already linked to C’ and B’’ is not an exception, is it?
I think I understood: Permanences between B’’ and C’ are not strong enough as to establish a connection yet. Ok?
Actually, when B bursts and C’ is predicted, only the existing connections from B to C’ are updated. The existing connections were from B’ to C’. (Additional synapses are added to that segment only if there is room.)
So when B" becomes active, it will not predict C’.