I am just getting started learning HTM theory (computer science background, no biology expertise). I have watched the videos on the HTM Open Source, Numenta, and various university channels to try and gain a good initial grasp of the theory before diving into the source code (I actually got hooked after accidentally stumbling across a video from 2010 on the UBC channel).
So that is where I am at now in the learning process. For the most part the theory is making a great deal of sense to me. However, there is one of the foundational points that I am struggling a little with though. I feel like it is a key component of the theory, so would like to try and comprehend it a little better before I dive into the code. This is probably a dumb question, so I apologize in advance
In the hierarchy elements of the HTM theory, it is mentioned in numerous videos that the output of the lower levels in the hierarchy becomes the input to the higher levels. However, from my limited initial understanding, it seems that the output is simply a rearrangement of the input bits (with the loss of some percentage of them). To me, the predictive states seem more significant than the output. I feel like I am missing an important point. Maybe this is because the focus of these introductions is primarily dealing with layer 2/3, and the answer deals with the other layers?
BTW, I did read the “H is for Hierarchy…” thread, but my question is more conceptual/ theoretical (definitely not asking where H is implemented in the new source). Also, sorry if there is already a thread about this (I didn’t see one, but I can be blind from time to time )
The answer to my question hit me soon after posting (probably writing the post helped). Just in case anyone else has the same question, the output actually is more than just an arrangement of the input bits (that would be the case if only information about the columns were considered). Since it is cells within the columns that constitute the output, that means the output contains information about a transition context, not just about the current input. It now makes sense how feeding that into another level would be able to make predictions about transitions, and so on.
Hierarchy isn’t part of HTM yet. Numenta is focusing on completing a single region first, so it’s probably best to focus on single regions.
I think what you said is similar to an early idea about the output, but it’s going to be a lot more complicated. Even single regions are going to get more complicated, because the established HTM system corresponds to just one layer out of quite a few, depending on how you divide them. There are also dozens of neuron types with unique interactions between each type, so it might be more accurate to think of a region as dozens of layers. I’m guessing a lot of those neuron types are non-essential, but it’s still going to be complicated.
As I understand it, the current thinking is that there are two feedforward outputs to the next level. The one which determines column states is L5, which sends up both predictions and the current representation, and is also involved in causing motor outputs. I think what you said is pretty close to that, but there are a lot of unknown details. The other output is L2/3, which would help the next level in the hierarchy make predictions when the brain generates motor output and maybe in other situations.
One idea in the mix is temporal pooling (which is what temporal memory used to be called). The goal is essentially to represent sequences rather than the current input. That’s sort of a lie because it just forms a more stable representation, where neurons stay on longer than the inputs, sort of like representing little chunks of sequences. Temporal pooling might have roles both in the output to the next highest region and in the interactions between layers.
This is coming from a non-expert so trust other sources first.
Thanks very much for the reply. Yes, definitely understand that hierarchy is not currently the focus. Really my question was to understand one of the core concepts of the theory at a very high level. Obviously hierarchy is a core component of the theory (represented in virtually every high-level introduction video about HTM). My confusion was that it seemed the output was just another representation of the input, but then realized my mistake that I was thinking from a column (spacial pool) perspective and not a cell (temporal pool) perspective. In other words, the next level in the hierarchy gets more information than just the current input. I don’t need to go too much further into the specifics than that at my current level of understanding (I’ll need to dive into the details a lot more in the coming weeks before I can have a more intelligent conversation).
My understanding of this is as follows: After SP and TM the region collapses (logical OR) the active+predicted cells into a new column-SDR which constitutes the output of the region.
Yes, you could say that this SDR is just a “re-arrangement of the input SDR” because it’s the same size just with different bits switched on. But obviously that would be missing the powerful characteristics of large SDRs (especially their union function):
The new output-SDR represents a higher-level, aggregated representation of the input. One could think of it as a sum of the most prevalent (i.e. not all – to maintain sparsity) activations (=features) and predictions. It is then passed up as well as down the hierarchy. On a higher level, it is used as input for further abstraction. On the lower level it is used for interpretation and likelihood assessment of different options.
HTH. Would be glad to see other people’s understanding as well.
Thanks for another really good reply. Got me thinking about some of my assumptions. I am not sure I understand one point though, where you talk about aggregating active + predicted cells. From a programming perspective, I can certainly conceive how you could implement that, but not sure how that would relate to the real world. As I understand it, in the predictive state a neuron is not actually firing (“passed up the hierarchy” implies firing).
I think only historic/transition (not predictive) information can be aggregated with the current input. Let me give a simple example of my current level of understanding so you can poke holes in my assumptions (that’s usually how I learn best)
Say I train a single-level system a simple musical sequence (first 7 notes of “Mary Had a Little Lamb”) a few times, then play the first two notes and check the prediction:
With just one level, I should get
Prediction: (C,E) (i.e. next note will be either C or E)
If I send the output of this first level up to a second level in the heirarchy, I can get a more precise prediction:
In your example, when the system gets the first D, it will predict C, and it will predict E when it gets the second D in EDCDEEE. That’s because it represents each input in context of the previous representation, which was in context of the next previous representation, and so on, until the first input of the sequence, where it represents the input without context. I like to think of this as representing the sequence thus far. When it gets the first D, it predicts based on ED, and, for the second D, it predicts based on EDCD. That’s after learning, though, and it can be a lot more messy than that. For example, some of the columns might still burst.
As I understand it, the main purpose of hierarchy is to use more abstract representation in higher regions and to use that abstract representation to aid lower regions. If the system got visual data, the lowest region might represent lines, the next might represent simple details of objects, the next might represent full objects, and the highest might represent whole scenes. Exactly what a region represents will vary depending on how well the hierarchy has learned the current scenario.
One thing which was confusing for me is that HTM relies heavily on predictions, but that’s not the final goal. The final goal of the brain is to interact with the world. A single region already makes good predictions, so my guess is that sending predictions up the hierarchy would be involved in some other process.
There isn’t just one output, because there are multiple layers and neuron types. Most just go up or just go down, so there’s probably a significant difference between what goes up and what goes down.
The outputs probably wouldn’t be the same size as the number of columns in the lower region. One reason is that the outputs probably include the context state of the column, not just whether or not any neuron in the column is on/predicting. Also, the most powerful outputs go to the thalamus first, which has a different number of output cells than the corresponding cortical output (probably fewer).This allows it to maintain sparsity. For example, it could be like sending the output to a spatial pooler, representing everything in a more general way. It’s definitely more complicated than that, though.
I see… so the implication of what you describe is that output of a single processor is fed back as input on that same processor allowing it to nest lower-level contexts within higher-level contexts without requiring a hierarchy (well technically there is still hierarchy, but by taking advantage of sparsity, one processor is able to play the role of more than one hierarchical level). I suppose I can see how that could work (I’ll need to play around with some examples to be more confident I think).
To clarify how I interpreted what you wrote, Casey, I drew up this simple diagram. Basically, the idea is that cells within the columns can themselves provide input in the “spacial pooling” phase (i.e. selecting the column). Let me know if I completely misinterpreted what you described
Never mind, I see that I misinterpreted what you were saying. Basically the D in my scenario will be different when it comes after an E versus when it comes after a C, so without additional information it can predict the next note will be C and not E, because of the previous step. In other words, I chose a bad example for when hierarchy might be needed to learn a larger sequence of notes.
I find this confusing too… but here is how I’m interpreting it.
You have 4 options (let say for region of size 5x1000) :
Vertically aggregate the active cells : size of SDR 1000 bits
Vertically aggregate union of active and predicted : SDR 1000
all Active : 5000 bits
union of active and predicted : 5000 bits
You are free to use any of those, depending on the circumstances.
For example (1) and (2) should work if you want to directly compare input and output w/o the need to map/classify the value. I use it and it works OK. Or you can use it to easily chain TM’s.
What is important is the region represent all that you know so far, overall.
The particular active bits, represent the specific item in the current context.
It is all the available information you will get, you can choose to ignore part of it if you wish.
Thanks mraptor, that is a very good outline of your understanding of the theory, and very helpful. I am doing the same thing you did – writing my own implementation of components of the HTM theory as an exercise to understand it better (prior to looking at NuPic source code).
Oh, I didn’t understand that you were looking for ways hierarchy is needed for larger sequences.
I think a single region could learn an sequence of any length, but it would need to see that sequence many times. Hierarchy lets it learn sequences of sequences. For example, the lowest region might learn words and the next highest might learn phrases. It doesn’t need to learn every phrase as a completely different sequence. Instead, it just learns sequences of words. This cuts down on the sequence length for learning phrases, letting it learn phrases with fewer exposures.
Feedback is another reason for using a hierarchy. I think feedback has been implemented, unlike all other aspects of hierarchy. Feedback helps deal with noise of various types, like item deletion, insertion, misorder, repeats, and so forth. Let’s say it knows the sequence ABCDE, but it gets the sequence with C and D reversed (ABDCE).Normally, HTM would completely lose track of the sequence once it gets D instead of C. To solve this issue, a union pooler keeps track of the current sequence. It also maintains the representation of the same sequence when it gets an unusual input, at least for a little while. As such, when the system gets D instead of C, the union pooler doesn’t forget the current sequence, unlike the temporal memory. The union pooler predicts every tempral memory cell which typically activates at some point during the sequence, so the same cells will be used for each input even when the sequence encounters brief noise. Once the sequence progresses beyond the noise, the temporal memory can continue tracking the sequence normally.
The temporal memory has two outputs to the union pooler. The first is the active/inactive cell states, and the second is only those which were predicted. One way these may be distinguished in biology is the number of times the cell fires when its column turns on. For example, a predicted cell might fire twice while an unexpected cell might fire once.
For example, the lowest region might learn words and the next highest might learn phrases. It doesn’t need to learn every phrase as a completely different sequence. Instead, it just learns sequences of words.
I agree. But how does a region decide on the “cut-off” point when to pass its current state on to an upper region? For the NLP case one could simply hard-code punctuation marks as the cut-off from the word-region to the higher-level phrase-region, use a full-stop as the cut-off for the sentence-region, etc. But that’s probably not how the brain does it. So how to decide on correct point/time of hierarchical transition?
To solve this issue, a union pooler keeps track of the current sequence. It also maintains the representation of the same sequence when it gets an unusual input, at least for a little while. As such, when the system gets D instead of C, the union pooler doesn’t forget the current sequence, unlike the temporal memory. The union pooler predicts every tempral memory cell which typically activates at some point during the sequence, so the same cells will be used for each input even when the sequence encounters brief noise.
What exactly is this union pooler and where in the algorithm is it applied?
There aren’t clear cut-off point. Language is probably a bad example because it has clear cut-off points.
Each neuron does its own thing (except for effects of competition). A given neuron learns its own sequence, and that sequence probably won’t be as long as more obvious sequences. For each new input, some neurons will stay on and some will turn off, so they aren’t all changing states at once after every word, sentence, etc. Not as many outputs will change as inputs, so the output is more stable than the input.
Each neuron in the union pooler decides when to change states based on learning. Neurons don’t need to stay on a fixed amount of time. They just need to stay on longer than inputs in response to sequences.
I might be using that name wrong. I don’t know of any difference between the union pooler and the previous temporal pooler except for how they work.
The union pooler receives input from the temporal memory and sends feedback to the temporal memory. In the algorithm, I guess it would do its thing after the temporal memory states are found but before predictions are found, because it contributes to the temporal memory’s predictions. I don’t know what order Numenta uses, though.
@Casey Thanks for your response and for sharing the links which helped to clarify the terminology!
Here’s an extract… maybe helpful for others as well:
History of the term “temporal pooler”
Temporal pooling has been an active area of research for HTMs and Numenta for several years. The meaning of temporal pooling and the overall goals of temporal pooling have been largely consistent. However the term “temporal pooler” has been used for a number of different implementations and looking through the code and past documentation can be somewhat confusing.
The original CLA Whitepaper used the term temporal pooler to describe a particular implementation (let’s call this temporal pooling version 1). This implementation was intricately tied in with sequence memory. As such the sequence memory and temporal pooling were both referred to as “temporal pooling” and the two functions were confounded. In NuPIC the code files TP.py and TP10x.py implement sequence memory but use the old terminology. The newer files, such as temporal_memory.py, avoid the term “temporal pooling” and just implement sequence memory.
That’s an interesting question. I think this is where “patterns being pushed down the hierarchy” comes in, which I’ve read Mr Hawkins talk about. As someone said, the length of a sequence is not determined, which means that the same pattern can be recognized by just one region, or by two or more regions in the hierarchy working together.
a sequence of ABCDEFGHI can be recognized by one region, or it can be recognized as a sequence of sequences ABC-DEF-GHI
The lowest region recognizes the separate sequences and represents them as stable union SDR’s as they come along. First ABC, then DEF, then GHI. Higher up, the sequence of union SDR’s is recognized by the next region. union(ABC), union(DEF), union(GHI).
However, as the lowest region gets exposed to ABCDEFGHI many times, the whole sequence becomes recognizable to that region and represented in a union SDR. So its recognition has been “pushed down the hierarchy”.
To continue with the example of language, it is likely that often used phrases, consisting of several words get pushed down this way. Words that are used less often, and in a variety of contexts, are likely to have a sequence the length of the word, and not longer. Words that we have never heard before, or that we are hearing from someone with a strange accent or something, might be recognized as sequences of syllables, syllables which have their own Union SDR’s.
So it might go for learning how to read, starting with recognizing individual letters, moving on to words and more.
If that is true, the “correct” point of transition does not really exist. Pushing recognition down is probably a matter of quicker, deeper understanding. The lower in the hierarchy a concept is recognized, the more room there is left for more abstract understanding, but at this point I’m speculating.
As I understand it, a hierarchy would sort of push sequences down. However, there might be a fixed maximum length for the temporal pooler because, while tracking the longest sequences possible would be useful, it’s also useful to relegate some regions to tracking shorter sequences so that the system can use more specific locations in the sequence to learn new trends or generate motor outputs. You wouldn’t want it to use the same columns for hours in the lowest region.
One thing to keep in mind is that it probably splits up the input stream into sequences differently for different cells. One cell in a region might respond to “regi” while another might respond to “gion”. This is still a more stable representation than dividing the sequence into single letters (“r”, “e”, “g”, “i” etc.), but some cells turn on or off every time it gets a new input.