Lexicon Induction From Temporal Structure

A stream of bits, about which one knows only that it is comprised of some unknown lexicon tokenized as n-bits where n is also unknown, may be thought of as a temporal structure. In other words, the HTM’s challenge is to learn a 1 to n bit demultiplexer which may, then and only then, be represented as an SDR.

Let me illustrate the problem by asking you to follow this recipe:

  1. Open a shell prompt.
  2. At the shell prompt, enter perl -de42
  3. Copy and paste this one line Perl program: for(0..100){print ( (('00') x 1, ('01') x 2, ('10') x 3, ('11') x 4)[rand(10)])}
  4. Now note a few things about the resulting string:
  5. One can, trivially, induce its lexicon as consisting of {0,1}.
  6. Conditioned on this, one can also induce its probability distribution as {0 x 7, 1 x 13} – and from this we can see that lexical induction with n = 1 has a temporal structure.
  7. However, one can also induce a lexicon of {00, 01, 10, 11}, at n = 2, with probability distribution of {00 x 1, 01 x 2, 10 x 3, 11 x 4}, but only if taken on even ticks of the clock (starting at 0 prior to the first bit).
  8. Please note that this is, in essence, the Kolmogorov Complexity Program of the temporal structure since it is the heart of the program that generated the temporal bit string.
  9. Inducing a lexicon at any larger n – say {000, 001, 010, 011, 100, 101, 110, 111} – will (regardless of the clock offset) produce less structure (higher Shannon entropy) hence be less representative of the algorithmic information (hence algorithmic probability) of the string.

This illustrates a trivial example of lexical induction as temporal structure learning. Despite it being trivial to state, and obviously relevant to virtually all learning (since it discovers “representation”) I’ve not been able figure out whether HTM can induce such temporal structure.

Can HTM “discover” n when it is anything other than a small integer (say 2 or 3)?

If so, let me then throw one more challenge at HTM:

Let’s say the lexicon consists of tokens of 2 different bit lengths. For example, take a probability distribution of {0 x 1, 1 x 1, 00 x 1, 01 x 2, 10 x 3, 11 x 4}} and learn that temporal structure.

Now, clearly, the neocortex can learn such temporal structures – at least for small n – so I think this is relevant, if not key, to HTM theory.

When I challenged the Computer Science section of StackExchange with this problem – as addressed by probability theory – I got back “TL;DR: use maximum likelihood and discrete optimization.”

1 Like

I think I’m following you, although you are speaking in unfamiliar terms. I would say yes HTM will discover patterns in those streams, and I would also guess that it better predicts streams with a higher n because fewer semantic concepts are being transmitted in the complete stream. Also keep in mind that each sub-stream can use entirely different encoding mechanisms.

I don’t think I understand what your challenge is. As it exists today, NuPIC does what you are challenging (I think). At least I can say that it will organically recognize an incoming bit stream separated into arbitrary groups (sub-streams), each encoding in a different way. Yes that is what HTM does, just like the brain.

I take it, then, the answer is: “Yes – HTM can discover such seemingly-trivial temporal structures in a bit serial stream.”

So an obvious test would be to:

  1. Set up an HTM system with a bit-serial input and a bit-serial output
  2. Feed it the enwik9 corpus from the large text compression benchmark, one bit at a time
  3. See whether it can discover that octets are the most primitive temporal structure in the bit-serial stream, hence most primitive lexicon, if not discover higher level temporal structures like words, punctuation, etc.
  4. See whether it can produce, at its bit-serial output, a losslessly compressed representation of enwik9 based on, at least, the frequency of the characters in the enwik9 corpus.

If your opinion is that this can be done, and can point me to the relevant literature, I’ll put time into getting the code up and running.

One bit at a time? I’m confused now. Let’s talk about how we are defining “stream”. When I think of a stream, I think of a high-dimensional space, like an SDR space, or a fiber optics cable, or the spinal cord, or a nerve bundle. In these spaces, many bits are transmitted in one time step. So “one bit at a time” leaves me thinking I don’t understand the problem. I would say that one input to the compute cycle would be a fixed length series of bits. Perhaps we’re just arguing semantics, and we can say “one line at a time” and move on.

What exactly is the enwik9 corpus? I assume it has these sub-streams you defined. But how is data encoded within them? What is the structure of the data over time? It must have repeating temporal patterns for HTM to detect it.

I don’t think HTM can do this. It can learn the structure, make predictions, and output a representation using an SDR classifier. But I don’t think you can get these primitive definitions like you want.

Hi Guys,

Forgive me if I’ve misinterpreted the focus of this conversation, but…

I just saw this discussion about serial transmission, it causes me to wonder why it would be the responsibility of your “inferencing node” (along the chain of software components processing a given stream) - to “reassemble” serialized bits into bytes (I think James referred to them as octets?)?

I would think that concrete assembly of bits into byte units would be more efficiently handled by some kind of de-multiplexing algorithm?

Unless of course your interest is merely curiosity and you’re wondering whether HTMs can “discover” salient units whose minimum number of parts impact the “understanding” granted by assembling those larger parts into significant formats? (i.e. what are the minimum number of units necessary to grant comprehensibility to the data?)

Think of “one bit at a time”, in HTM terms, as this data encoder:

A large number of cells is divided into two equal groups. When the next input bit is 1, one group is active. When the next input bit is 0, the other group is active.

Does that help?

Ok sure. But you would need to input an entire row of data into the HTM at a time. So for this data set, how long would a row of data be?

The enwik9 corpus is a 2^9 byte snapshot of Wikipedia taken several years ago. It, or its subset, enwik8 (at 2^8 bytes), are used in artificial general intelligence research to benchmark the inductive half – or discovering “What is,” aka science. Deciding what to do is the other half of AGI, frequently referred to as the “decision tree” based on some value system: “What ought to be,” aka engineering/technology/economics.

This benchmark originated when I suggested, in 2005/2006 to Marcus Hutter that he create an AGI prize. He then set up The Hutter Prize for Lossless Compression of Human Knowledge awarding money for each incremental decrease in the compressed size of Wikipedia. enwik8 was chosen so that those living in poor countries could afford to compete with relatively low capacity computers. This resulted in the discovery of a Russian genius that has held the record for almost all of the years since – including a prize entry, just last month that decreased the size an astounding 4%! To achieve this leap, he incorporated LSTM.

Since Cui, Ahmad and Hawkins published, just one year ago, a paper “Continuous Online Sequence Learning with an Unsupervised Neural Network Model” demonstrating impressive performance on serial data, and compared it favorably to LSTM, it’s natural to think HTM’s temporal learning might be incorporated into a winning entry in the most rigorous AGI competition out there.

I think what you are asking is whether HTM can be used to learn sequences in which there are only two possible inputs. For example, such a sequence might be something like ABBABABBABABBAB. The main problem you will run into with this is that HTM is not good at predicting a single repeating input (in my experience it can handle up to 3 instances of a repeating input before it loops back around to the first representation). So something like ABBBBA is indistinguishable from ABBA for example.

If this is the case, then I totally misunderstand the problem :thinking:

Let me challenge that requirement with your most recent HTM School video illustration:

Replace the notes of the scale with a 2 note scale, say, “E” and “C”. Now, clearly, with minicolumn limited to only one cell, an HTM can learn that an “E” is more likely to follow a “C”, than vice versa, given a tune like:

EEEECEEEEEEECECCEEEEEECEEE

Given more cells per minicolumn, the HTM can learn a more complex temporal structure than one-ahead – even if limited to only 2 notes in the musical score.

Then, there is the question of the smarty pants showing up and doing a “name that tune” based on an even higher level of abstraction – with some intervening levels of abstraction corresponding to “hum me a bar of that tune”.

The reason I seem to be so unreasonable by insisting on the most trivial data encoder imaginable – the reason I’m so autistic about this – the reason I’m going to continue to be intransigent no matter how many people insist that I provide pre-digested representations to the data encoder – is that I’m trying to keep the bar very low for HTM so I can discover its most essential capability in creating levels of representation (hence levels of abstraction).

I think I see what you’re trying to do, but HTM will not work such a small population of neurons representing input (only one in this case). HTM works because of cell population dynamics. There must be an input of a certain length (not sure what that length is) for the SP to perform properly. So a 1-bit input will not work. But if you changed it to a 100 bit input with 50% on / off to transmit the 1-bit signal, that should work.

I believe that is what he was suggesting:

Yes, that’s what I meant in a prior response when I said:

Think of “one bit at a time”, in HTM terms, as this data encoder:

A large number of cells is divided into two equal groups. When the next input bit is 1, one group is active. When the next input bit is 0, the other group is active.

Does that help?

I get it! :smiley:

I recently discovered a bug in my implementation of HTM which could be affecting this scenario, so I will let @rhyolight comment on how well NuPIC handles a single repeating input.

In htm.js a single repeating input stabilizes on three sequential representations. I’m guessing that NuPIC probably bursts every other element in the sequence and eventually saturates the capacity of the columns (which wouldn’t be useful either)

While I understand the representation space a bit better, I’m still confused by the enwiki8 format. I don’t think I understand the problem well enough to recommend HTM for the job. As you said a couple times, as we feed in “one bit at a time”, it seems to me like a very inefficient use of the SP / TM algorithms. I don’t think it is going to perform very well on an encoding like this, because not enough semantics are being encoded in each time step. The full capacity of the SP / TM will never be used.

Moved from #htm-theory:neuroscience to #htm-theory

Does your algorithm attempt to (albeit perhaps with bugs) implement temporal learning, as illustrated by HTM School #12?

Yes. The bug may be a bit far into the weeds depending on how familiar you are with the algorithm, but I described it here