Constraint satisfaction is used in many situations, for instance, when you hear the word “can” in a sentence, it could be a verb, or it could be a noun, and the surrounding context disambiguates it. I was curious about HTM - does it just take a union of all possibilities, and as new information comes in, winnow down the number of possibilities, or is there an actual contest going on, such as in other models where you have competition between high level concepts inhibiting each other.
The BAMI documentation on Temporal Memory explains this really well:
Consider hearing two spoken sentences, “I ate a pear” and “I have eight pears”. The words “ate” and “eight” are homonyms; they sound identical. We can be certain that at some point in the brain there are neurons that respond identically to the spoken words “ate” and “eight”. After all, identical sounds are entering the ear. However, we also can be certain that at another point in the brain the neurons that respond to this input are different, in different contexts. The representations for the sound “ate” will be different when you hear “I ate” vs. “I have eight”. Imagine that you have memorized the two sentences “I ate a pear” and “I have eight pears”. Hearing “I ate…” leads to a different prediction than “I have eight…”. There must be different internal representations after hearing “I ate” and “I have eight”.
Encoding an input differently in different contexts is a universal feature of perception and action and is one of the most important functions of an HTM layer. It is hard to overemphasize the importance of this capability.
Each column in an HTM layer consists of multiple cells. All cells in a column get the same feed-forward input. Each cell in a column can be active or not active. By selecting different active cells in each active column, we can represent the exact same input differently in different contexts. For example, say every column has 4 cells and the representation of every input consists of 100 active columns. If only one cell per column is active at a time, we have 4^100 ways of representing the exact same input. The same input will always result in the same 100 columns being active, but in different contexts different cells in those columns will be active. Now we can represent the same input in a very large number of contexts, but how unique will those different representations be? Nearly all randomly chosen pairs of the 4^100 possible patterns will overlap by about 25 cells. Thus two representations of a particular input in different contexts will have about 25 cells in common and 75 cells that are different, making them easily distinguishable.
After rereading your question, I think I didn’t entirely answer it. Temporal memory describes how different contexts for the same input are represented (i.e. columns represent the input, and cells within the columns represent the context).
I think you are also asking how does the context get established. Say I had learned “I ate a pear” and “I have eight pears”. If I hear a word sounding like “ate/eight” without any context, what happens is the columns representing the input “ate/eight” burst (all cells in those columns activate), and the next possible inputs “a” and “pears” become predictive. If the next input is “a”, then we have locked in on a specific context (“I ate a pear”). In other words, the more elements in the sequence come in, the more sure the system becomes about the context.
Presumably regions higher in a hierarchy would also provide a biasing signal back to the lower regions, allowing higher-level contexts (as well as output from other parallel regions) to bias the next predicted input, allowing the context to be established more quickly.
This same effect can also be accomplished by another layer in the same region which receives proximal input from other layers, establishes long-distance distal connections within its own layer, and apical feedback to those other layers (this is described in more detail by Jeff in the recent HTM Chat where he described the current theories on sensory motor integration).
From what you say, I get this picture. (Instead of 100 neurons for a particular input I’ll simplify for this post to 3 neurons). Lets say the triplet of neurons represents the sound you hear when a person says “Can”. Lets suppose that the pattern for ‘Can’ (as in can of sardines)’ is 434 and the pattern for ‘Can’ (as in ‘Can I be excused?’) is 112. So initially, the ambiguity is indicated by all these neurons being on, including other neurons in the same columns.
All 4 neurons in all 3 columns are firing. I’ll go on with the example, but I should mention that it seems counter-intuitive to me that ambiguity is represented this way. The reason is this:
Suppose a sentence occurs to me such as “I walked to the store”. Now obviously I’ve been to many stores in many cities and villages and countries in my life. And yet, when I hear this sentence, I don’t have a jumble of all these stores in my head. In fact, the picture I get is of me walking from current home to the only store in the neighborhood, which is a deli. (Actually the deli has closed, but you get the idea). So I have a kind of default picture of myself walking to this deli, a picture which is susceptible to revision.
HTM seems to not represent ambiguity with a tentative default scenario, instead it represents ambiguity by activating every possible scenario, and then let priming select one of those scenarios.
So that bothers me a little.
But apart from that, does HTM theory (and I admit I really should look at Matt’s video series before asking details), say that the signals from other areas or other senses come in to the layer to help disambiguate, help suppress all wrong answers, and enhance the one cell in each column that is correct? And suppose you have a gradual narrowing down - you know that pattern 1,1,1 is wrong, but you can’t tell whether 1,2,4 or 1,2,3 is correct. Do the columns oscillate between the two? Or if an answer seems correct, and then has to be revised again - do you get an interim period where all cells in all three columns are active?
It may seem that I’m just nitpicking on details, but sometimes such details really clarify how something works.
In a scenario where an input entered the system without any context, current implementation will bias all possible next inputs (which are connected above some threshold). To me, this makes sense, because you really don’t know what input will come next since there is no context. When you add a hierarchy into the picture (or a layer with long-distance distal connections) and it is able to provide feedback from other points of input, the context can be established much more quickly (and the scenario of “an input without any context” becomes less frequent).
I think the important point here is that we are describing an input without any context (or one that was completely unexpected). In the example of a sentence like “I walked to the store”, it is unlikely that the sentence itself will occur without any surrounding context. So when the word “store” enters the system, not only has the context been narrowed down by the words “I walked to the…” but also by other inputs that happened prior to the sentence as well as from other parallel regions via a hierarchy. For example, presumably you actually walked to the deli before you returned and someone asked you where you went. That context is also providing some biasing toward what representation of “store” is getting activated.
This is of course a rather more sophisticated application than you would be able to do with current implementations, but hopefully you get the basic idea. Higher-order contexts bias which representations are activated as inputs come in (i.e. it isn’t just the inputs immediately before a particular input that define its context).