Temporal pooling and generalization

Normally we think of objects as part of a hierarchy. A chair is furniture. A wolf is a carnivore. An ostrich is a bird.
Does HTM theory have anything to say about this?
Does generalization mean that two input SDRs that are different but similar have an identical representation in the brain somewhere?
In HTM, the spatial pooler will represent similar input-SDRs with a similar, but not identical, set of columns.
The temporal pooler maintains a representation over time, as different features are observed.
It does generalize in a way, because:

  1. If it has learned 2 similar objects, and it starts out observing features that are in common to the two objects, the active cells in the higher layer will represent a union of the features of the two objects. Then once a differentiating feature is observed, the representation gets more specific (and sparser). So it goes from general to specific.
  2. it learns an allocentric representation, which I think means it is invariant to orientation, which means that it generalized over orientation.

But is the temporal pooler invariant to size? Or if it were a visual representation, would it be invariant to the angle of light that strikes the object? If it were observing letters in different fonts, I would think it would not generalize.

A hierarchy of layers that left out some information coming from below might have more general concepts near the top. But what information would be left out?
I doubt there are answers at this point, but any ideas?


The way I understand it, the vision is that patterns that often occur together in time get represented similarly. So you can get scale invariance by the fact that objects tend to get larger and smaller smoothly as we move forward and backward. You can get pose invariance by the fact that the subject will move smoothly between poses through time. You can get translation and rotation invariance by the same process. I haven’t seen a convincing implementation of this yet, but I think that’s the core idea.


This is an interesting question. I assume you mean that the lighting changes the coloring and contrast of the object such that it does not lead to the same/ a similar visual pattern. If it is different enough it might not get generalized, as we also get confused at times when we look at objects from weird angles with respect to light sources. On the other hand, the lighting effects on top may be represented as a context from “another” region. This could be because of place cells from hippo-campus as they encode very subtle information about the place. For example, combinations of smell properties are good cues for a place which may work as the context of the objects that you are observing in your childhood kitchen or your home.

In short, lighting might be context (apical/distal input) of the object representation if we assume place cells of hippo-campus encodes environments including lighting/patterns properties. You would not expect the same coloring or even edge structures when you observe an object inside a cave vs on beach. The context would transform the expected representation.

1 Like

Thanks for the interesting discussion. I think of the object under the light as an allocentric representation. The angle of light will still create sensations associated with similar locations in space, which fills in the details of the object under scrutiny. Under some light angles or even spectrums, more information about the object is available. It doesn’t change the agent’s representation of the object, only adds to it.


Depending on the angle of a light on an object, the shadows that are cast are different, the reflections are different, the colors are different. For instance, if you look at any webpage showing examples of computer graphics and lighting (here is one: https://www.cs.uic.edu/~jbell/CourseNotes/ComputerGraphics/LightingAndShading.html) you see that there are issues. But if you already know what the object is, perhaps by examining edges,or seeing contiguous areas no matter what the lighting, then the lighting may not cause a problem, instead it might be a source of additional information as Matt says.
It is safe to say that the brain has to account for many sources of variability. When you look at an object, are you moving? How is it oriented? How big is it? How prototypical is it? What is the color of the light on it?
Temporal pooling finds a constant representation (in L2/3) for the changing patterns in L4. A question is what happens when there is some odd influence that has to be taken into account. Perhaps, as Sunguralikaan says, the “taking into account” comes from a higher level.

I think you have a fair point @rhyolight

Not exactly sure on this but doesn’t allocentric location change the expected representation of the object? The union representation of the object expands as it is observed in different conditions and by different movements as you say. But in a given context, the expected representation of the object would be determined by the allocentric location. This is what I meant by transformation which was the expression @jhawkins used before allocentric location was the norm if I am not mistaken.

Assuming it only adds to it and does not transform it, I have a question:
When we encounter a new lighting condition or say some weird light effect caused by engineered devices or even underwater; do I have to observe every object that I know of to be able to predict how will their color pattern change? Or do I observe a couple of them and am able to “transform” other objects’ color patterns depending on this?

I do not have an exact answer but I am inclined to believe that the latter happens.

I believe, what you are describing here it’s very important mechanics, but primarily for an object separation from a scene and tracking. Sure, resulting patterns are useful for making generalization, but it’s not the core part of it.
Generalization is, first of all, an opportunity to find some semantic similarity to learned classes for an absolutely new pattern. The described mechanics can’t solve it by itself.

BTW, as I remember, you are working with a king of computer vision tasks using HTM, so you have to tackle the generalization problem somehow. Could you share any related insides from your own experience?

I’m pretty sure generalization is a corner stone of any AI theory. As for me, the current discussion moves in a bit wrong direction. Forget about different light conditions and other sophisticated aspects. Find a solution (theoretical at least) for black and white simple figures first. Everything else is implementation details.

I think to understand generalization, we need to look past the ideas of the spatial pooler and temporal memory algorithms as “spatial” and “temporal” processing components, and start viewing the layer like this:

Now we can talk about “feedforward” input coming across the proximal signal. This means input generally moving away from the senses. Distal input to a layer is a contextual signal. In the case of SMI, it might contain the location on an object associated with a sensory sensation coming feedforward. (Apical input might be used to play back memories without getting proximal stimulus at all, but that is another topic.) The layer processing unit does not know or care where these inputs originate.

In our SMI circuit example from the Columns Paper, the “input layer” learns how the sensor input associated with it typically moves across objects and predicts what the sensor will feel next based not upon the object being sensed, but the way in which the sensor interacts with the world. The “output layer” is doing temporal pooling and classification of objects that sensor column has felt in the past, and constantly associating the current stream of location/sensation data coming from the input layer with what objects fit that stream.

This is a spatial generalization happening over time as an agent interactively senses an object. This generalization is achieved by using the union properly of SDRs to narrow down all the object representation in the output layer as new sensation/locations are received from the input layer.

The “output layer” is temporal because it’s distal input comes from other neurons in the same layer (either within its own cortical column, or from neighboring columns). Remember if a layer’s context is other neurons in the same layer (or the same layer in neighboring columns), that makes the context temporal. In the case of our SMI circuit, some of the distal connections are to neurons within the same layer, and other are from neurons in the same layer in neighboring columns. You do not need this cross-column communication to do object classification, but it happens in the brain and it helps classify objects with fewer sensations.

1 Like

Unfortunately, unions can’t help to recognize the same pattern, which is somehow changed (for example rotated in case visual recognition) and this is a new perspective for the model.
The initial idea of @jhawkins was that hierarchy plays the key role in generalization, and it looks very conclusive. However, Numenta doesn’t use hierarchy at all at the moment (even it perfectly can). It puzzles me because I don’t see any practical use in an AI theory without core ability to generalize patterns.

It is true, I’m not sure how orientation or object rotation is represented at this point, but I think that’s an implementation detail. We remember objects in the form of allocentric representations. You can imagine a coffee cup in any orientation or scale. I’m not sure what the current thinking is on the research team about orientation, but for me it doesn’t affect the idea of how temporal pooling provides object generalization.

Are you talking about generalization in TM for sequence patterns (classic version) or using a similar structure for 3D-object recognition? If the last, it works without generalization. Even in animation for example for the last paper, if you just rotate the mug around the vertical axis, you’ll face the problem.
Unfortunately, it’s not just implementation details, it’s the core issue. I’m eager to hear about an update from Numenta which is going to solve it.

The important concept for object recognition is that allocentric locations must be independent of orientation. In the case of rotating the mug, the allocentric locations of its features should not change, but their egocentric locations should. I think the answer to the problem lies in understanding the mechanism for conversion/feedback between egocentric and allocentric representation (from what I’ve seen so far, this has not been worked out yet). Another related puzzling question in my mind is how the allocentric “axes” are established (or if they even exist – i.e. what is the frame of reference for allocentric locations? Perhaps feature locations are somehow in reference to each other)

1 Like

Showing where is the top of the object helps humans to recognize an object, so, yes, there a kind of space map, associated with an object itself, which is important for recognition.
Basically, supporting such a map, making it invariant, and use hierarchy for scaling in terms of complexity - apparently, are the required components for generalization.

@rhyolight: I was thinking about generalization, and came up with the following tweak to the temporal pooling algorithm.

Currently for every new object in the output layer, the algorithm requires that you select an arbitrary subset of cells to be activated. This subset stays activated during the learning of all the feature/location pairs.

In my proposed change you would still do that, but you would also activate a few cells in the output layer based on existing connections between previously learned objects and the features in the input layer.

For example suppose object #1 and object #2 share a feature “f1”. Suppose the system has already learned object #1, so a few cells in the output layer have learned patterns on their proximal dendrites that overlap with feature #1.

Now you present object #2.
Instead of being completely arbitrary in the cells you select, you would be partially arbitrary. You would select N number of cells arbitrarily, but you would also allow for activations of (some) cells that were activated by feature f1 in object 1.

The learning process that you already have, if I understand it correctly, increases interconnections between the arbitrary cells in the output layer that were selected for a particular object. It could also strengthen connections between them and a few of the cells that respond to f1.

I’m not sure this would work, but I can give a reason to try it out:

There are different types of generalization you could aim for:

  1. generalization over rotation
  2. generalization over translation
  3. generalization over size differences
  4. generalization over similarity between objects.

The above idea would not help with 1 thru 3, but it might help with #4.

This assumes that generalization has something to do with similar concepts having more active cells in common than dissimilar concepts.

What do you (or other thread viewers) think?

HTM does this work now pretty well. It initiates the random connection to the elements of the encoder only at the moment of creation of the model and this set is different for every cell op SP. As a result, some cells are responsible for features.
The problem is the same features to be recognized should be at the same position and have the same size.
So, for generalization, your first three closes should be solved (the forth is mostly solved).

1 Like

@spin - I don’t understand your explanation. Suppose the htm-pooler learns a cube. After that cube is thoroughly learned, We present it with a pyramid. I present feature #1 at location #1 which might be the bottom left front corner. This activates a particular pattern in the first layer which is reproducible. Certain cells (or at least columns) in layer 1 will always correspond to that feature. But the second layer is where I think the problem is. Just because feature #1 is present in a cube and pyramid (for example) doesn’t mean that any similarity shows up in layer #2 between the two objects.

On a related topic, I once read about rotation that if a person is asked whether two objects are the same, and one object is a rotated version of the other, he does a simulated rotation (in his mind) to match them. The more one object is rotated vs the other, the longer it takes for the person to decide whether they are the same.

1 Like

Why? You can think about the output of the first layer TM as an encoder for the second layer. Let’s suppose you have the same input in the encoder for the cube for all steps. You’ll have the stable set of columns which are sensitive for some features of the encoder and you can use it as the input for the second layer.
It’s a bit wired to discuss it without talking about sequences because this is the what TM is focused on, but that’s what we have for your thought experiment.

That’s true, we use in mind transformation to compare unfamiliar structure, or find subtle differences in transformed familiar ones. Nevertheless, we can recognize learned spacial patterns without extra in mind manipulations. For instance, we can read upside down. Even we do it slower, our mind doesn’t have time to rotate each letter while doing it.

I’m thinking of the temporal pooler, not the temporal memory.
In the article that describes it, (if I understand it correctly), the first layer neurons have distal connections to a location vector. they also have proximal connections to the feature vector.
So over time, as (lets say a robot) feels the different parts of a object, the first layer creates its own encoding for each location/feature pair.
But this is where the theory ran into a problem. It takes several time-steps to feel all the locations in the object. The solution was to create in the higher layer a totally arbitrary pattern, which stands for the object, and all the changing activity in layer #1 is then associated with this stable representation in layer #2.
The advantage of this is that a particular object SDR in layer #2 does not fade away from memory - it is there all the time, as features are being encountered, and when they are encountered, connections between the corresponding SDR in layer #1 are made to this stable representation in layer 2.
The disadvantage of this is that the SDR in layer #2 was chosen randomly. (If I understand the article correctly). It is as if you are making up a symbol in layer #2 to stand for the set of location/feature pairs of an object in layer #1.
To take an example, the word for “horse” in French is “cheval”, and though French people associate all sorts of features of horses with “cheval”, the word “cheval” is an arbitrary collection of sounds, and doesn’t sound remotely like "horse."
My idea is that the SDR in layer #2 should not be totally random (or arbitrary), maybe just partly random. You would get an initial stab at an SDR there by presenting the features of a new object, given that some connections between layer #1 and layer #2 have already been learned for previous objects, and see what cells in layer #2 tend to turn on. You would use some of those cells in the new SDR that you create for layer 2. That means that the new SDR is not completely arbitrary.

On the other hand, if you look at the temporal layer-pair as a whole, you could say that similar objects will be similar in layer #1, and there is no need for them to be similar in the upper layer. In that case yes, there is a similarity between similar objects. If you are looking only at layer #2, there is not.

Ok, but I don’t think this is biologically correct. The neurons in your brain that represent a horse are not the same as those in my brain, because we both have had completely different learning experiences involving horses. Taking it one step further, each relevant cortical column has a representation of “horse” that is entirely different from every other column’s representation. They still share their learning about “horse” with their distal connections even though they represent objects differently.

1 Like