Generating sequences

I want to make universal sequence test for TM.

My idea is to use NLTK Context Free Grammar to generate sequences.

The question is that I need to be able to generate the sequences with increased complexity … but can’t wrap my head of how do I can come up with more and more complex Grammar.

I have to do this programmatically and dont even know how to come up with Next complexity level. F.e.

S → “a”
S → “a” “b”
S → A B
A → “a” | “z”
B → “b”

how do u decide what should be next lvl grammar.
so I can say :

grammar = cfg.generate_grammar(level=5)

as i was thinking I came up with another idea …turning everything upside down …

  1. pick sequence complexity measure f.e. Temporal-entropy : H

  2. pick sequence difference measure to compare original with predicted f.e. Levenstein distance : D

    Score = H / D

so Score indirectly measures the performance of TM no matter what is the Grammar

may be it needs Coef ?

what do u think ?


To my mind there are 2 basic elements to a sequence (other than noise), which determine how hard it’ll be for TM to learn. If I were designing a function to generate sequences of varying complexities, these would be my 2 parameters:

  1. length of pattern(s) (how many time steps it takes to repeat)

  2. presence of repeated elements in pattern(s)

The easiest kind of sequence for TM to learn is a short sequence with no repeated elements, such as:

A, B, C, D,
A, B, C, D

Next is a longer sequence with no repeated elements. This isn’t really more complex for TM, but would take more time steps to learn - because it takes more time steps for the pattern to show itself:

A, B, C, D, E, F, G, H,
A, B, C, D, E, F, G, H

The complexity comes in when we introduce repeated elements, for instance

A, B, C, D, X, B, C, Y,
A, B, C, D, X, B, C, Y,

Now the TM has to learn to differentiate between different instances of ‘B, C’, one preceding ‘D’ and the other ‘Y’. In this case the anomaly score should settle down slower than the 1st sequences - as it learns to differentiate these 2 cases. Also the ambiguity should make the TM predict multiple elements at some point (both ‘D’ & ‘Y’ from input ‘C’).

If we imagine increasing the presence of repeated elements, this differentiation becomes more complex and thus requires more repetitions, for instance with:

A, B, A, G, B, C, G, A,
A, B, A, G, B, C, G, A,

I think an objective way of measuring the complexity of noiseless sequences is to count the number of time steps until:

  1. The anomaly scores flatline at 0 AND

  2. The number of prediction flatlines at 1

That’s when the sequence is totally learned, when the TM is never surprised and fully precise.


As far as generating simple to complex levels of grammar, have you looked at implementing something like an l-system to generate commands that have grammatical rules that can be learned, as well as levels of complexity? Maybe that’ll give you a much smaller domain (than say, a human language) to tackle but still with the features you’d want to learn, for the most part.


I hear you … I also was wondering if you can use this to compare performance of different TM or NN memory implementation

This was done to an extent in the 2016 paper “Continuous Online Sequence Learning with an Unsupervised Neural Network Model”.

They generated artificial sequences composed of different sub-sequences, with noise injected in between. The idea I think was to test how long each alg took to recover to 100% prediction accuracy after the switch between sub-sequences.

It seems TDNN was the only comparable performer to HTM.

Interesting…those resemble Constraint handling rules…but both don’t have python implementation ;( … CFG does