Raw TM Test (no SP)

encoders
temporal-memory
category-encoding

#1

Hi All,

I just wanted to verify that I’m using the Algorithms API correctly to test the TM. I’m trying to make a 1:1 bare bones comparison with my own TM implementation, so I’m feeding in the following simple categorical sequence, repeated 3000 times:>

seq = [‘A’,‘B’,‘C’,‘D’,‘X’,‘B’,‘C’,‘Y’,‘A’,‘E’,‘F’,‘D’,‘X’,‘E’,‘F’,‘Y’]

I’m bypassing the SP to simplify the comparison, and I thought it shouldn’t really matter since the inputs are categories with no overlap. So I’m feeding the active bits from the CategoryEncoder right into the TM as the SP winning columns would normally be. Does this approach make sense?

What’s confusing me is that the NuPIC TM seems to be taking a very long time for the Anomaly Scores and the Number of Predictions Made to converge to 0 and 1 (respectively). This 16-letter sequence (with no noise), doesn’t seem to fully converge even after as many as 20,000 repeats, whereas my implementation does so after just ~400. Of course I suspect there’s something off in my code, since a 16-length sequence w/out noise should only take around 1600 repeats to converge, right?

Here is my code. I’m using only imported functions from NuPIC as shown in the Algorithms API Quickstart (except those for generating the data). If anyone would take a quick glance I’d be really grateful once again. Any mistakes you find or intuitions you’d have are most welcome!

https://pastebin.com/xp1VreZs

Here are the result plots showing NuPIC and my TM’s Anomaly Scores and Precisions (number of predictions made). Both sets of Anomaly Scores and Precisions become steadily more sparse moving toward 0 and 1, though the NuPIC TM doesn’t quite seem to get there.

NuPIC TM:


My TM:


#2

I think so, just make sure the categories are 2% sparse if you’re going to bypass the SP.

Anomaly score would be a good metric to track, but even with completely regular data I’m not sure the metric of “when it converges to zero” is valid. If an HTM converges to a 0 anomaly score, I’m not sure it’s really in a useful state.

I would ask @cogmission what metrics he used form comparisons. His Java implementation is currently even running NAB!


#3

Right now I’m using the raw output of the CategoryEncoder, with 8 categories and a ‘w’ of 21, the the ‘n’ is just 189 (8*21 + 21 for ‘other’ category). But you’re saying that the category encoder’s ‘w’ value should be just 2% of it’s ‘n’?

Yes, though in theory if a perfectly periodic pattern is repeated enough times shouldn’t the system no longer be surprised by any one input? It seems to me that in this case the anomaly score should eventually flat-line at 0, and to make sure that isn’t because of too many predictions I track the ‘Precision’ (number of predicted TM columns / number of activated encoding bits per input). Once it totally learns the sequence the Anomaly Score of 0 and Precision score of 1 mean that its making just 1 prediction and its the correct prediction.

But other than the encoder sparsity issue, does my use of the API functions make sense? I went right from the Quickstart guide and just want to be sure I’m not leaving anything out.


#4

That is typical of the SP’s output, so it would better match the type of input the TM usually gets. If there are 189 bits in the encoding, make 4 bits active.


#5

Ok great! Since there are only 8 categories a ‘w’ of 4 means that just 32 of the 189 input columns would ever activate, tho I don’t see why that’s necessarily an issue.


#6

You should talk to others who’ve built HTMs. There may be better methods of comparison. CC / @cogmission @Paul_Lamb @breznak @mraptor @zacgross


#7

@rhyolight @sheiser1,

I’m a little on the “rusty” side, but I seem to remember some strict guidelines for ‘w’ formulation. I would try the Encoder base class documentation. I’ll look it up myself later if you don’t get a chance before then. I’m also, not sure if the CategoryEncoder has a different set of criteria? I’ll get back to you if nobody else “beats me to it”.


#8

You don’t need to be concerned with sparsity of columns as long as all of your inputs do not share any semantics with each other. If for example every input is given its own 32 minicolumns, and you have 8 distinct inputs, then 256 minicolumns are enough to test TM.

Are you doing a reset between each iteration of the sequence? If not, then if you are using the traditional TM algorithm, you will get a burst at an interval which becomes longer over time, but you will effectively never reach a point where that burst stops. I can explain this in more detail if this isn’t something you are familiar with (I’ve talked about this on several other threads, so not sure if you’ve seen this before). There are variations of the traditional TM algorithm where the sequence of representations stabilizes and the bursting stops.


#9

I don’t think I remember ever seeing a non-convergence in bursting? Do you think that’s true of the Java TM?

Anyway, the only restriction i think I remember is that the w has to be “odd” so that an exact center is found? Then there’s some size limitation but that can be forced, I think. But yes, sparsity (if I remember correctly), isn’t the purview of encoders.


#10

Ok great, that’s what I figured.

Nope no resets.

The interval between bursts definitely becomes larger over time as it learn the transitions between subsequences, but why should this bursting never stop if the sequence is perfectly periodic?

What happens in my implementation (which I plucked completely from the BAMI pseudocode) is that the between-burst interval gets larger and larger until the only burst left happens when the entire sequences restarts (going from letter 16 back to letter 1). That transition occurs last and least often so it takes longest to learn, but it does learn it eventually at which point the Anomaly Score flat-lines at 0.

I’m very curious, why shouldn’t this happen? Shouldn’t the TM be able to learn a sequence of any length with enough repetitions? Does my result mean that my implementation deviates in some way? I’d certainly like to read any explanation you have on this here or in any post. Thanks @Paul_Lamb!


#11

Do you mean that HTM.java would also stop bursting eventually in this case of perfectly repeating categorical inputs?


#12

Sure, the easiest way to understand this phenomenon is to use the shortest repeating sequence possible ABABABAB… (you could do a single repeating input AAAAA… but that comes with other special considerations, such as whether a cell can be both predictive and active at the same time)

The first input causes the A minicolumns to burst, and a set of winner cells A’ is chosen. Then the next input causes the B minicolumns to burst, and a set of winner cells B’ is chosen. These connect to the A’ cells. So far, the system has learned the following sequence:

A’B’

Nothing is predicted at this point, so the next input causes the A minicolumns to burst, and a set of winner cells A’’ is chosen. These connect to the B’ cells. So far, the system has learned the following sequence:

A’B’A’’

Because the A minicolumns are bursting, this means A’ cells are active, and B’ cells are predictive. So the next input B does not cause bursting, and B’ becomes active, predicting A’’. The next input A does not cause bursting, and A’’ becomes active.

At this point, nothing is predicted, so the next input causes the B minicolumns to burst. A new set of cells B’’ are selected as winners. These connect to A’’. At this point the following sequence has been learned:

A’B’A’‘B’’

Because the B minicolumns are bursting, A’’ is now predictive. For the next inputs, A’’ becomes active, then B’’.

At this point, nothing is predicted again, so the A minicolumns burst, and a new set of winners A’’’ are chosen and connect to B’’. The following sequence has been learned:

A’B’A’‘B’‘A’’’

Because A is bursting, B’ and B’’ are predicted. Next input B’ and B’’ are active, predicting A’’ and A’’’. Next input A’’ and A’’’ are active, and predict B’’. We are back to one active cell per minicolumn again, so B’’ activates then A’’’.

At this point, nothing is predicted, so the B minicolumns burst, and winners B’’’ are selected, which connect to A’’’. The learned sequence is now:

A’B’A’‘B’‘A’’‘B’’’

Hopefully you see the pattern here. This continues until the capacity is reached. For 32 cells per minicolumn, the sequence learned is:

A(1)B(1)A(2)B(2)A(3)B(3)…A(32)B(32)

Again, nothing is predicted, so the A minicolumns burst. Because now all the A cells have one segment, each with the same number of synapses, the random tiebreaker logic kicks in and a random sampling of A cells are chosen for A(33), and connect to B(32).

At this point, every B cell has connections to an A cell, and because the A minicolumns are bursting, all the B cells are predicted. The bursting stops, because the layer is fully saturated. When all A cells are active, they predict all B cells, and when all B cells are active, they predict all A cells. So technically the bursting has stopped…

… except at this point the minicolumns may as well be bursting, because every cell in them is predicted active for every input :smiley:

(sorry for the edits… it has been some time since I ran the repeating sequence experiments, so didn’t quite explain the end state for traditional TM algorithm accurately)


#13

I think this might be the behavior if you allow only one active cell per minicolumn (except for bursting minicolumns). Is that the case, or if two cells are predicted in the same minicolumn, for example, do you activate both?


#14

Hey @Paul_Lamb, thanks a million for your fully thorough description! To answer your thoughtful questions I checked, and:

  1. None of the 168 utilized columns are using all 32 of their cells. More specifically, of the 32 cells in each column between 18-20 have been used by the end (a ‘used’ cell is one that has at least 1 segment on it). This means that by the time the Anomaly Score flat-lines 12-14 of the cells in all columns are yet unused, and no cell has more than 1 segment formed either.

  2. I also don’t enforce a policy of ‘only 1 active cell per column’. In non-bursting columns (those that have at least 1 matching cell) I activate each matching cell.

Is there any other reason that may come to mind? I wouldn’t presume to take any more of you time, but I can’t help but share a bit of my code just in case the reason were to jump out at you. In a nutshell my overall data structure is a big dictionary called ‘cells_segments’, where the keys are all 65,536 cells and the values are lists of dictionaries (each dict is a segment holding which other cells it links to and the synapse’s permanences). The algorithm logic is directly from the BAMI pseudcode, I just use this weird/clunky data structure because I was able to fully understand it as a sophomoric coder.

Here’s the main TM function (taking in the main data structures as of the prior time step, updating them and returning them again)
https://pastebin.com/W5VFFXnp

Here are the helping functions as well in case:
https://pastebin.com/bTGyaXRn

I certainly don’t want to be a bother in any way. I just feel such a hunger to have my head fully around the TM, its long term behavior along with the algorithm itself, and as someone who has built them before and knows the theory so deeply your input is invaluable. Thanks again,

– Sam


#15

No bother, I enjoy HTM :slight_smile: I’ll take a look to see if I can see anything.

For reference, here is a visualization which is probably easier to follow than my long-winded post above. Went with one minicolumn per input and three cells per minicolumn for simplicity.


#16

I don’t see anything right off the bat, but I’ll look at it some more tomorrow. If it doesn’t take too many timesteps to analyze, could you run a repeating sequence between just two inputs (rather than your long sequence) through this, and take note of what happens each timestep until the representations stabilize for you? Note how many cells are active in a minicolumn for each timestep. May be something in the pattern to indicate where your implementation differs from NuPIC.


#17

BTW, this is one area where HTM sequence memory clearly diverges from how a brain implements sequence memory. If we hear “ABABA”, we aren’t surprised when we hear … “B”. Somehow our minds have pooled the part that is repeating (AB), and we experience it as a single repeating sequence (akin to “AAAAAA”).

This is one reason why I’ve spent a lot of time experimenting with different temporal pooling strategies, as well as trying to understand how to stabilize the single repeating input onto a small number of recycled representations. If “ABABABABABAB…” can be processed as “AB AB AB AB AB AB…”, and the latter can be represented in two or three alternating representations, then I think HTM will behave a little more like a brain in this regard.


#18

Very much so.
Although I tend to agree with you here, I’m not betting for or against the ‘HTM diverges’ conclusion.
What if this sub-sequence recognition was tackled by hierarchy ?


#19

Yes, definitely. Really if you think about it, an input and output layer together are essentially a two-level hierarchy (I suspect the ultimate implementation between the levels of a deep hierarchies will closely mirror this relationship as well). For this one, though, it seems like such a simple problem (even though I haven’t fully solved it yet) that it should be achievable without having to resort to a deeper heirarchy :slight_smile:


#20

A-unexpected, then B’
has to be translated to a stable pattern S somewhere higher (output row ? higher layer ? higher area ? … deeper anyway)

then, on the deeper level, S-unexpected has to be learned as followed by S, and S, and S…
(from the point of view of the deeper level, we’re thus in your “single repeating input” configuration, that you expressed above as being tricky… and for myself I haven’t thought about it at length).

Anyway, we’d need to have a model for hierarchy, when deeper level has sufficient confidence of a known S, able to feedback-signal the initial level that it’s time to reset its sequence translation at the end of it.
And also to provide S as a parameter, so that, ‘first in the context of S’ is A-expected