How HTM is supposed to deal with spatial invariance?

The main difference in working with spatially meaningful data (like images) is the fact that they need to recognize a feature, wherever in the input (image) it is. An eagle should be recognized as an eagle, anywhere in the image it is. and conventional NNs deal with this using convolution (weight sharing). so how HTM is supposed to deal with that? as i understand, HTM spatial pooler can also be implemented convolutional, but then it would require a ridiculously massive amount of memory. so how it’s gonna get done? or how it’s done in the brain? is that up to the sensory-motor theory, like by iterating over the image?


People have a retina, and the have saccadic motion of the eye to center it on features of interest. We don’t attempt to recognize everything anywhere in the spatial field. Why not try something similar, where a control mechanism drives the “eye” to “look” at features of interest, putting them essentially into a canonical pose, then you don’t have to try and recognize a feature anywhere.

Some low level features such as simple textures can probably be handed in parallel, but for anything complex, just LOOK AT IT. In fact the distance which the eye saccades to different salient features probably forms a very useful functional description of objects (e.g., I had to move my eye 0.3 degrees to the right to get from one corner to the next, which forms a description of the width of the object in a functionally useful manner)

Humans constantly move their eyes. It is almost impossible to “see” effectively without doing so.


How can saccades help to recognize the same object if it’s scaled/tilled/rotated, etc?

We are able to visually recognize the same object in different poses the same way we recognize objects by touch. At the end of each saccade is a glance. Each glance is like a touch, but with thousands of sensors simultaneously (rods and cones on the retina). There is spatial information encoded in the distribution of light on the retina and certain features (low-level patterns) are recognized from this information. But it is the temporal sequence of recognized spatial patterns due to successive glances (as well as the sensorimotor information gleaned from the saccadic motion itself) that forms the invariant representation of the object in our minds.

To use the example of an eagle given in the previous post: We notice an object in the sky above us and glance in that direction. Within a couple of glances, we have determined it’s basic size and coloration, and then we notice that what we recognize as the head feature is bright white. Within a couple more glances we have confirmed that the head feature is consistent with the many images and sightings of bald eagles that we have been exposed to in our lifetime. Thus, the classification of bald eagles is made. This same process is at play wether the eagle is in flight, in it’s nest, or is being rendered stylistically on any number of government seals.

I believe that our brains have a natural tendency to rapidly drive saccades towards the most distinguishing features of an object because it wants to reduce uncertainty as quickly as possible. That categorization must happen almost immediately in order for us to recognize and respond to imminent danger in our immediate environment or to potentially advantageous opportunities.


There is a problem with this vision: we can understand a general scene and recognize key objects for a time less than one saccade. So, even saccades play important role in detailed vision, they are not the key mechanism of invarianе perception.

Our researchers call this “flash inference”, and we know that HTM must support this. It is an active topic of conversation.


I’m intrigued by this, I know how it’s supposed to be done with conventional NN but is there a place or documentation addressing this and prototype code to look at? Out of curiosity.

The basic idea is a generalization of the sequence memory, where the depolarized (predictive) neurons instead of encoding the state, encode the location (so active cells are features at locations). it’s not clear how that location signal is generated, but it can be the state (history) of the motor commands that are projected from a neighboring region. and there should be a mechanism of encoding the recent steps of the feature-locations, in a way that ordering doesn’t matter (temporal pooling) so that the set of those feature-locations can then represent an object, and that’s biological mechanism is not clear neither. but we can think of some (non-biological) ideas to do that.

1 Like

Does that mean that HTM would only be useful with videos (as they are a sequence of images) as opposed to highly varying pictures?
That would make sense from a biological standpoint but it would be quite hard to do classification and object detection tasks considering the spatial pooler that is currently implemented (since object have highly varying size and locations in different images or sequence of images). Plus getting the validation data seems harder to do for videos rather than images.

It means HTMs need streaming temporal input. Videos are an example of this. So is sound, or an accelerometer, or any number of million IOT devices that emit streams of data over wifi today.

When thinking about HTM, you always have to think about temporal context and moving through an environment. In the world, sometimes things move in relation to you, but most of the time, things move wrt to your coordinate frame because you are moving.

A temporal stream of highly varying pictures might work fine in a reality where physics worked differently and things jumped from place to place without moving through the points between them, but that’s not the reality where our brains evolved. :stuck_out_tongue: (or of course if you had motor commands to process and the jumps were caused by self-movement)

as someone had already stated in this thread, the view field of your retina is about the size of your thumb at one arm distance. and you’re constantly moving that same retina in order to effectively see the world. so you see a whole image by iterating your retina over it, and that’s in some ways similar to convolution (because of weight sharing) but the nature of it is not to store the result of the convolution kernel in spatial order, but is to view that as a sequence of feature-locations.


That makes sense, instead of processing series of image sequentially, process the image itself as a sequence of features in space within a given window and from those feature determine if there is an object or many object, what they are and or their extent.

This would mean that for the first encoding what could be done is for each subset of the data a sparse array would be created and would contain something along the lines of: the type of feature, it’s scale ( or that could also be considered as a different type depending on the size) and that way this would be quite similar to typical convolutionnal neural networks that learn features but instead of doing the whole image with many “cells”/kernels it would be done sequentially.

That would make a lot of sense since the most important part is the first encoding of anything in order to make sure that no critical information is lost.

Thanks again for those replies, they are truly helpful!

1 Like

That’s too simplified view, which doesn’t explain how we can recognize anything for a period less, then a saccade. Actually, our brain works pretty well with sequences of frequently changed pictures - as I remember, experimentally proven time is about 13 ms in contrast to about 200-300 ms needed for just one saccade.

1 Like

Perhaps the very fast scene recognition registers in 13 ms, but NOT details. The lower resolution image (whole retina) may be rapidly processed by the older subcortical structures.

I strongly suspect that drives the scanning for feature recognition by the cortex. This takes time to filter from V1 to the temporal lobe and enter into conscious perception and autobiographical memory.

You are right, but we don’t need details for recognition of a familiar pattern, we even don’t need all parts of the pattern.We need details only to learn it.

As far as HTM and spatial invariance - how does the brain do this?

It’s not all done in one go.

We know something about the processing that is done in the subcortical structures. We know of place cells, grid cells[7], head position cells[6], border cells (more than likely whisker cell signaling in rodents), and goodness knows what else.[13][14] It is clear that the brain is forming an abstract representation of space. We even know the native data structure format - distributed grids.

In part of this system, we have a gyro stabilized reference platform (the vestibular system) that is directly mapped to the eye tracking system to keep the eyes from being distracted by self-motion. (At that point, in that little spot on the brain stem, there is the closest thing that if you had the correct instrumentation, you could measure as the neural correlates to a sense of self.[1][2]) I believe that this is fed in a stream to the hippocampus. We know about the head position thing. I’m sure there is much more. [3][5][10]

As you progress around two oddly shaped hippocampal structures there is an interesting data reduction. This grid changes scales 1:1.4 (square root of 2) relative to the adjacent area.[8] That means as you sample longitudinally you have multiple scales of the same thing. If that pattern is projected to the same general cortex area you would get a “halo” of scaled representations all being imprinted/learned at the same time. It would have some of the properties of a scratch hologram[4]; these features are self-reinforcing for retrieval. You might call that scale invariance. I certainly do.

One of the virtues of well distributed neural implementation is the massive redundancy of processing power and data integration. The grid things are as good a distributed data format as anything I could have thought of. How exactly does that map to columns? I suspect that the column features are supportive of processing the kind of thing the hippocampus likes to signal. Look at the “place cells” activation area or size. Compare that to the cortical columns that connect to that area and the population of grid cells being sampled by those cells. That should give you a working approximation for the scope of the local processing that is being done to form the place recognition. What is it in the local grid representation that can communicate the idea of a place? [9] While grids provide a rich framework for this level of coding I don’t think anyone has looked for them as an answer before:
“At the top of the cortical hierarchy, where information is combined across sensory systems, it was often no longer possible to match the firing patterns to any experimentally defined stimulus patterns” [7]

The hippocampus is famous for being longish with the end being next to the Amygdala which is well known for being hardwired to sense certain patterns.[11] This system also demonstrates an interesting property - the translation from retina-centric reference to the coordinate system you perceive around you. I used to think that this some complicated math transformation and wondered how the brain could calculate that. I see now that I was making it too hard: this is the “anchor” on the other end of the vestibular system. You could think of the hippocampus as a spatial or geometry co-processor. Call it an HTM reality encoder if it sounds better.

If the amygdala is able to decode facial features out of the coded data SDRs should be able to read it too. The level of communications at this point is actual situational recognition.[12]

Summary: If HTM/SDRs in combination with some supporting neural network types are configured in a SYSTEM that provides the functions that are known to be in the subcortical processing system, they should be able to do coordinate transformations, recognition, and scaling of a stream of objects in images. A spatial reference must be provided with the stream.

[3] The role of vestibular and somatosensory systems in intersegmental control of upright stance - Rob Creatha, Tim Kiemela, Fay Horakc, and John J. Jekaa,.
[5] How Basal Ganglia Outputs Generate Behavior
[6] Topography of head direction cells in medial entorhinal cortex
[7] Grid Cells and Neural Coding in High-End Cortices
[8] Topography of Place Maps along the CA3-to-CA2 Axis of the Hippocampus
[9] Computational Models of Grid Cells
[10] Time Finds Its Place in the Hippocampus
[11] The role of the amygdala in face perception and evaluation
[12] The Amygdala Modulates Memory Consolidation of Fear-Motivated Inhibitory Avoidance Learning
[13] Neuronal Implementation of Hippocampal-Mediated Spatial Behavior: A Comparative Evolutionary Perspective
[14] Landmark-Based Updating of the Head Direction System by Retrosplenial Cortex: A Computational Model


As far as I’m aware, it does not (not yet at least). The inputs to each cell is local to the receptive field. Like you’ve mentioned, convolution is an easy work around, but the brain doesn’t do that.

The premise for HTM in Jeff’s book On Intelligence (and many other publications) suggest that hierarchy deals with invariance. The combination of feedforward and feedback from above and below regions (and region onto themselves) form a stable representation of the sensory input based upon overall context from other regions. In theory, invariance does not occur in one place (ie V1 layer 4 / spatial pooler) but instead is a distributed process occuring throughout all the regions of the ‘what’ pathway of the visual cortex.

Invariance is a problem of perception. Sometimes you will need to scan a scene to form a stable representation because your cortex will need to build up the context and content of that scene. Other times you can instantly form a representation in a blink of an eye. It all depends on how much context has already been built up about the scene.

Perception is controlled by top-down bias but promoted by bottom-up stimulation - meaning input from regions higher in the cortex bias the input from lower regions of the cortex. While you are looking at a scene/object, your cortex is in an unstable state caused by competition between neuronal groups in different layers in different regions. Eventually the cells settle into an attractor state that forms a stable representation.

Of course, this problem of invariance applies to more than just vision, but to other sensory inputs, and the combination of them all.

While HTM is still in the TM phase, it seems a lot of invariance problems will be resolved by implementing hierarchical structure of regions, in the H phase.


Yes and no. The moving of the eye gaze really does do much the same thing. I would follow this up in directing the configuration of the focus in a convolutional network based on a pre-analysis from a prior level. For bonus points, you could put this in the middle of a RSS loop.

You could mount a case for the feedback from the frontal cortex is doing much the same in the cocktail effect.

I used to think the same thing, however it does not explain how you can recognize an object before your eyes even have a chance to perform a saccade. Nor does it explain how you can recognize an object that is outside your gaze in your peripheral vision. This suggests there is more going on than weight sharing.

There is distortion of the incoming visual scene caused by the retina, that blows-up the center and shrinks the peripheral, making the center of gaze the dominant input space. But because the center is ‘zoomed in’ the saccadic movement of gaze is used to break down the image into finer details. The more time you spend looking as something, the more you see, simply because you are spending more time scanning the scene.

If saccades were a requirement in order for you to recognize something, then we would be too slow to react to dangerous situations. If a predator instantly appeared in your sight and you had to make a number of saccading movements in order to recognize it as a tiger, you would probably already be dinner.

Both the speed and fact that it has to be scanned by the cortex for fine details suggests that at least in part that this fast recognition it is done by the evolutionary older subcortical structures using the mostly lower resolution total retinal image.

1 Like