After rereading your question, I think I didn’t entirely answer it. Temporal memory describes how different contexts for the same input are represented (i.e. columns represent the input, and cells within the columns represent the context).
I think you are also asking how does the context get established. Say I had learned “I ate a pear” and “I have eight pears”. If I hear a word sounding like “ate/eight” without any context, what happens is the columns representing the input “ate/eight” burst (all cells in those columns activate), and the next possible inputs “a” and “pears” become predictive. If the next input is “a”, then we have locked in on a specific context (“I ate a pear”). In other words, the more elements in the sequence come in, the more sure the system becomes about the context.
Presumably regions higher in a hierarchy would also provide a biasing signal back to the lower regions, allowing higher-level contexts (as well as output from other parallel regions) to bias the next predicted input, allowing the context to be established more quickly.
This same effect can also be accomplished by another layer in the same region which receives proximal input from other layers, establishes long-distance distal connections within its own layer, and apical feedback to those other layers (this is described in more detail by Jeff in the recent HTM Chat where he described the current theories on sensory motor integration).