Lexicon Induction From Temporal Structure



The point is to exploit the ability of HTM to abstract representations in a “hierarchy”, as shown in its iconic illustration thereof.

If HTM can abstract, from the trivial, two-“note”, data encoder, things like bytes, it may, then, be able to abstract, from bytes things like morphemes, punctuation, etc. at the next layer in the hierarchy, and then abstract words (what people normally think of as “the lexicon”) at the next layer of the hierarchy, then sentences – at which point “thought”, “semantics” etc. have become apparent in the SDR at that layer. These are the sorts of knowledge representation that can lead to better prediction, going back down the hierarchy to the bottom, of “the next bit” hence superior compression.


We think there is a whole lot more going on in one level of the hierarchy than we originally thought. None of our newest theory work includes hierarchy. So I could not recommend HTM at this point if you think hierarchy is the magic sauce. It may very well be, but we still have more to figure out in a cortical column before working on hierarchy again. For more info see H is for Hierarchy, but where is it?.


Thanks. That history is a good perspective. However, the link to Hawkins’s critical followup presentation wasn’t immediately obvious.

If I correctly understand from Hawkins’s talk the current strategy at Numenta:

Until the “function” of the macrocolumn is understood, “bottom up” (in terms of the neurophysiology), revisiting the hierarchy, “top down” (in terms of global function), is likely to be unsound. The, apparently minor, critique I would have of this is that even while driving research from the neurophysiology, there is, nevertheless, always, a range of plausible global functions under consideration. I mean this is, after all, what Numenta has basically stated about perception: That it is biased – primed by the plausible interpretations aka “expectation”. The reason I say “apparently minor” is that it is, indeed, a minor critique in that I’m sure Hawkins et al understand this is what they’re doing and, quite reasonably, expect other to as well. However, I say “apparently” because it is always good practice in science to be explicit about the hypotheses being tested – not individually or absolutely (ie: Not in the Popperian “falsifiable” sense) – but relatively: Rank ordering in terms of plausibility given the ground truth observations, which, in this case, is the neurophysiology.

This requires a review of the literature of macrocolumn function.

My particular interest about “lexicon induction” is that my review of that literature leads me to believe the most promising macrocolumn function involves a given lexicon specific to each macrocolumn. Once this lexicon is established, global phenomena naturally emerge such as syntax, grammar and semantics, in natural language.

The mystery is how the lexicon is established for each macrocolumn and how the lexicons are factored between macrocolumns.

Of course, I’m not suggesting that Numenta drive their work on the macrocolumn function from this hypothesis because that would be all too Popperian. No, I’m suggesting a more relativistic Plattian “strong inference” approach.

In this regard, I would also strongly suggest taking very seriously “Universal Measures of Intelligence” involving Kolmogorov Complexity when establishing the relative plausibility of unsupervised learning hypotheses. I don’t believe any of the current major efforts in AGI are doing this, although it has been known to be the correct approach since the early 1960s with the papers by Solomonoff, Kolmogorov, Chaitin et al. These were, coincident with Platt’s paper on strong inference.

PS: The top of Google’s search results page for strong inference and Popper comes up with an execrable paper titled “Fifty years of J. R. Platt’s strong inference.” I won’t dissect it but will simply quote the last, very Popperian, sentence: “It [strong inference – jab] is a message that can benefit anyone who is interested in tackling difficult problems – we must be bold enough to assume that one of our ideas is correct [emphasis – jab], and yet we must have the humility to abandon those ideas that don’t stand up to scrutiny.”


Where you say “macrocolumn” I’m going to use the term “cortical column”, which is our current terminology (in contrast to the “mini-column”, which exists within layers of cortical columns).

Yes I think this is correct. Or at least one cortical column’s representation of a “ball” will not be the same as another’s. Every column has a unique encoding of reality.

I think they are pretty randomly initialized and they grow organically over time using established Hebbian learning rules.

Even though each cortical column represents objects with a completely different set of neurons, representations can still be communicated between columns because synaptic growth between columns links these ideas together.

Regarding your other comments, I can’t speak for Numenta’s research philosophy.


the “mini-column”, which exists within layers of cortical columns

Accepting the term “cortical column”, my impression of minicolumns is that they do not stack to form cortical columns but a cortical column is comprised of a large number of parallel minicolumns - each of which has all the layers of the cortical column.

Is my impression incorrect?

(There is clearly a terminology problem in the field of neuroanatomy as indicated by this quote from “The Neocortical Column” by DeFelipe et al: “the term Colum is used freely and promiscuously to refer to multiple, distinguishable entities, such as cellular or dendritic minicolumns or afferent macrocolumns, with respective diameters of <50 and 200–500 μm.”)

one cortical columns representation of a “ball” will not be the same as another’s

A “ball” meaning an n-dimensional sphere in an appropriately scaled space of the world’s properties, containing what I might call a “lexicon” that represents some property of the world – sensed and/or imputed, e.g. color in terms of a “lexicon” like (“red”, “green”, “yellow”, etc.) ?

I think they [the lexicon of each cortical column – jab] are pretty randomly initialized and they grow organically over time using established Hebbian learning rules.

If it is really that simple, I refer you back to my original, very simply stated but very difficult challenge here.

Hawkins is absolutely on target in placing time as the foundation. It was searching for prior work on neocortical modeling, so founded, that drew my attention to Numenta, and which elicited my challenge based on a serial bit stream to do a very simple lexical induction: the ‘byte’.

To quote Hawkins from the aforelinked video, at this time code:

“The trick to making progress is to really pick the right problem to solve. I can’t say that enough.”


There are a couple of things that HTM is currently lacking that make it ill suited for this particular challenge IMO.

Firstly, it lacks an ability to “self label” sequences in temporal memory. In practice, this means a robust temporal pooling strategy which is able to form stable representations of sequences which capture the proper semantics, and which can quickly adapt when switching from one sequence to another (attention will likely be a key component of the latter)

Secondly, it lacks stability in repeating sequences. Because this particular challenge involves only two unique inputs, it means that the system will encounter lots of cases where the same input is repeated many times in sequence. The system must still be able to encode context properly, and be able to distinguish those same repeating patterns if it exists within multiple higher-order sequences.

Thirdly, it lacks motor control. I suspect that even a human subject tasked with finding the concept of a “byte” from a stream of on / off inputs will not use pure sequence memory to do it – they will use motor control to manipulate the input and explore it (by physically scanning their eyes over a screen or sheet of printed paper, by replaying the input in their “mind’s eye”, etc.)

Note that these are all things that need to be addressed by HTM to properly model a single level in the hierarchy. I’m not sure what hierarchy itself would bring to the table without these more basic capabilities.


Thirdly, it lacks motor control. I suspect that even a human subject tasked with finding the concept of a “byte” from a stream of on / off inputs will not use pure sequence memory to do it – they will use motor control to manipulate the input and explore it (by physically scanning their eyes over a screen or sheet of printed paper, by replaying the input in their “mind’s eye”, etc.)

Now you’re challenging me to do something I’ve been thinking about with regard to this question:

Actually synthesize music with these patterns to see if lexical induction emerges in humans from temporal input. I strongly suspect it does.

While you may be correct that humans – as entire organisms – utilize tempo-spatial input to the maximum extent possible, that detracts from the central importance of the challenge to Numenta’s model of the cortical column.

PS: I’m specifically and very carefully not addressing the hierarchy in this challenge, but rather am issuing a challenge to those who claim that it is important to figure out what a cortical column does as a prerequisite for hierarchy…


I don’t know whether mini-columns extend throughout all the layers in a cortical column or if some or most of them are localized within a layer.

Our terms have evolved over time. Perhaps this video will help with our current nomenclature:

Yes exactly. But some cortical columns will represent the ball in terms of somatic sensory input because they are processing that type of input. Other cortical columns processing visual input will represent visual aspects of the ball. In both cases, the cortical columns are mapping sensory features to locations in space.


Keep in mind that our cortexes did not evolve intelligence to process information at this level. Our brains process gargantuan amounts of parallel data, all inter-associated over time with our own actions and the world. It doesn’t surprise me when HTM is bad at solving these very low-level prediction tasks involving very small encodings. The encodings coming from all our senses are much, much larger than any of these we are even simulating.


True, and I am calling out the fact that there is a lot of work to be done in understanding and modeling the low-level capabilities that HTM theory needs to address before it will be suited for this challenge. I think we agree that the ability to extract concepts is a prerequisite for hierarchy. I am just being specific about a few of the capabilities (from my perspective) that are missing from HTM.


Let’s be careful about the distinction between HTM theory and brain reality. We can argue forever about what evolution did. We can argue for a far less version of “forever” about HTM theory.

What is it in HTM theory that tells us to expect that it becomes worse at doing things as the number of SDRs a cortical column must deal with decreases?


HTM works because of the dynamics of large populations of neurons. Input into the system also involves large populations of neurons. If the input is sufficiently small, the SDR comparison properties we depend on to represent concepts breaks down. I just don’t think it will work.

Here is a good paper describing why HTM depends on these properties of SDRs requiring large populations of neurons.


We should also be careful about the distinction between the capabilities of HTM now in its current state, and the goal of HTM, which is to model intelligent systems by using the biology of the cortex as the point of reference. With that in mind, I think it makes perfect sense to reference brain realities in response to how HTM should/ will eventually react, and whether it is or ever will be suitable in a particular environment (in this case, small-scale isolated inputs with high density)


It is probably worth mentioning that Numenta’s current research is published at https://numenta.com/papers/ and latest is always at the top. We include code samples with most papers, including this last one.

I will try to make an effort to disambiguate between implemented theory vs conjecture. In any case, with speculation or not, it doesn’t seem to me that HTM will be competitive in this task.


When I say “number of SDRs” I’m not referring to the number of neurons nor the number of synapses. You can have a billion neurons representing 2 SDRs. How does the failure to utilize the full capacity of a layer handicap learning in HTM theory, so long as one is using an appropriate ratio of active to inactive cells?


This is related to the problem of repeating sequences (in this case, one single input repeating several times in sequence). To stabilize the inference layer, I personally believe you must form a finite set of representations to avoid saturating the capacity of the minicolumns involved. This, however, results in an inability to distinguish between the repeating input a few times versus the repeating input a few more times.

This is a problem that the brain must have solved, though. If you think about a particular tone, for example, you have the same set of hairs in the cochlea sending repeating pulses over and over again. I suspect that a timing signal is the missing component here. If the system can form a representation in which a percentage of the cells are stable and represent the input that is repeating, and a percentage of cells which are dynamically changing to represent how long the pattern has been repeating, it should in theory solve the problem.


I suspect a better test would be learning morse code.
The sequences are relatively short and each code sequence is time-based.


Hi Paul,
If timing is of interest to you, there is a recent paper about timing in

which shows how the brain convert a “level” activity of a neuron in thalamus to a speed of dynamic system made of recurrently connected neurons in the caudate which effectively translates into a controllable delta time between two states of that recurrent set of neurons.


No pay wall:


Morse code certainly has the advantage that it is a minimalist character encoding that humans learn to interpret as a time sequence stream. It has the disadvantages that humans learn Morse code character boundaries in a supervised manner and since the character boundaries are variable time intervals, it is more difficult than would be a comparable fixed-length character code such as ASCII. Outputting 6-bit character codes from a serial bit stream of the same after unsupervised training might be a reasonable reduction in complexity.

There was a Morse code Kaggle competition nearly 3 years ago, although they were not restricted unsupervised learning in a purely temporal manner.