as someone had already stated in this thread, the view field of your retina is about the size of your thumb at one arm distance. and you’re constantly moving that same retina in order to effectively see the world. so you see a whole image by iterating your retina over it, and that’s in some ways similar to convolution (because of weight sharing) but the nature of it is not to store the result of the convolution kernel in spatial order, but is to view that as a sequence of feature-locations.
That makes sense, instead of processing series of image sequentially, process the image itself as a sequence of features in space within a given window and from those feature determine if there is an object or many object, what they are and or their extent.
This would mean that for the first encoding what could be done is for each subset of the data a sparse array would be created and would contain something along the lines of: the type of feature, it’s scale ( or that could also be considered as a different type depending on the size) and that way this would be quite similar to typical convolutionnal neural networks that learn features but instead of doing the whole image with many “cells”/kernels it would be done sequentially.
That would make a lot of sense since the most important part is the first encoding of anything in order to make sure that no critical information is lost.
Thanks again for those replies, they are truly helpful!
That’s too simplified view, which doesn’t explain how we can recognize anything for a period less, then a saccade. Actually, our brain works pretty well with sequences of frequently changed pictures - as I remember, experimentally proven time is about 13 ms in contrast to about 200-300 ms needed for just one saccade.
Perhaps the very fast scene recognition registers in 13 ms, but NOT details. The lower resolution image (whole retina) may be rapidly processed by the older subcortical structures.
I strongly suspect that drives the scanning for feature recognition by the cortex. This takes time to filter from V1 to the temporal lobe and enter into conscious perception and autobiographical memory.
You are right, but we don’t need details for recognition of a familiar pattern, we even don’t need all parts of the pattern.We need details only to learn it.
As far as HTM and spatial invariance - how does the brain do this?
It’s not all done in one go.
We know something about the processing that is done in the subcortical structures. We know of place cells, grid cells, head position cells, border cells (more than likely whisker cell signaling in rodents), and goodness knows what else. It is clear that the brain is forming an abstract representation of space. We even know the native data structure format - distributed grids.
In part of this system, we have a gyro stabilized reference platform (the vestibular system) that is directly mapped to the eye tracking system to keep the eyes from being distracted by self-motion. (At that point, in that little spot on the brain stem, there is the closest thing that if you had the correct instrumentation, you could measure as the neural correlates to a sense of self.) I believe that this is fed in a stream to the hippocampus. We know about the head position thing. I’m sure there is much more. 
As you progress around two oddly shaped hippocampal structures there is an interesting data reduction. This grid changes scales 1:1.4 (square root of 2) relative to the adjacent area. That means as you sample longitudinally you have multiple scales of the same thing. If that pattern is projected to the same general cortex area you would get a “halo” of scaled representations all being imprinted/learned at the same time. It would have some of the properties of a scratch hologram; these features are self-reinforcing for retrieval. You might call that scale invariance. I certainly do.
One of the virtues of well distributed neural implementation is the massive redundancy of processing power and data integration. The grid things are as good a distributed data format as anything I could have thought of. How exactly does that map to columns? I suspect that the column features are supportive of processing the kind of thing the hippocampus likes to signal. Look at the “place cells” activation area or size. Compare that to the cortical columns that connect to that area and the population of grid cells being sampled by those cells. That should give you a working approximation for the scope of the local processing that is being done to form the place recognition. What is it in the local grid representation that can communicate the idea of a place?  While grids provide a rich framework for this level of coding I don’t think anyone has looked for them as an answer before:
“At the top of the cortical hierarchy, where information is combined across sensory systems, it was often no longer possible to match the firing patterns to any experimentally defined stimulus patterns” 
The hippocampus is famous for being longish with the end being next to the Amygdala which is well known for being hardwired to sense certain patterns. This system also demonstrates an interesting property - the translation from retina-centric reference to the coordinate system you perceive around you. I used to think that this some complicated math transformation and wondered how the brain could calculate that. I see now that I was making it too hard: this is the “anchor” on the other end of the vestibular system. You could think of the hippocampus as a spatial or geometry co-processor. Call it an HTM reality encoder if it sounds better.
If the amygdala is able to decode facial features out of the coded data SDRs should be able to read it too. The level of communications at this point is actual situational recognition.
Summary: If HTM/SDRs in combination with some supporting neural network types are configured in a SYSTEM that provides the functions that are known to be in the subcortical processing system, they should be able to do coordinate transformations, recognition, and scaling of a stream of objects in images. A spatial reference must be provided with the stream.
 The role of vestibular and somatosensory systems in intersegmental control of upright stance - Rob Creatha, Tim Kiemela, Fay Horakc, and John J. Jekaa,.
 ABRASION HOLOGRAMS FREQUENTLY- ASKED QUESTIONS
 How Basal Ganglia Outputs Generate Behavior
 Topography of head direction cells in medial entorhinal cortex
 Grid Cells and Neural Coding in High-End Cortices
 Topography of Place Maps along the CA3-to-CA2 Axis of the Hippocampus
 Computational Models of Grid Cells
 Time Finds Its Place in the Hippocampus
 The role of the amygdala in face perception and evaluation
 The Amygdala Modulates Memory Consolidation of Fear-Motivated Inhibitory Avoidance Learning
 Neuronal Implementation of Hippocampal-Mediated Spatial Behavior: A Comparative Evolutionary Perspective
 Landmark-Based Updating of the Head Direction System by Retrosplenial Cortex: A Computational Model
As far as I’m aware, it does not (not yet at least). The inputs to each cell is local to the receptive field. Like you’ve mentioned, convolution is an easy work around, but the brain doesn’t do that.
The premise for HTM in Jeff’s book On Intelligence (and many other publications) suggest that hierarchy deals with invariance. The combination of feedforward and feedback from above and below regions (and region onto themselves) form a stable representation of the sensory input based upon overall context from other regions. In theory, invariance does not occur in one place (ie V1 layer 4 / spatial pooler) but instead is a distributed process occuring throughout all the regions of the ‘what’ pathway of the visual cortex.
Invariance is a problem of perception. Sometimes you will need to scan a scene to form a stable representation because your cortex will need to build up the context and content of that scene. Other times you can instantly form a representation in a blink of an eye. It all depends on how much context has already been built up about the scene.
Perception is controlled by top-down bias but promoted by bottom-up stimulation - meaning input from regions higher in the cortex bias the input from lower regions of the cortex. While you are looking at a scene/object, your cortex is in an unstable state caused by competition between neuronal groups in different layers in different regions. Eventually the cells settle into an attractor state that forms a stable representation.
Of course, this problem of invariance applies to more than just vision, but to other sensory inputs, and the combination of them all.
While HTM is still in the TM phase, it seems a lot of invariance problems will be resolved by implementing hierarchical structure of regions, in the H phase.
Yes and no. The moving of the eye gaze really does do much the same thing. I would follow this up in directing the configuration of the focus in a convolutional network based on a pre-analysis from a prior level. For bonus points, you could put this in the middle of a RSS loop.
You could mount a case for the feedback from the frontal cortex is doing much the same in the cocktail effect.
I used to think the same thing, however it does not explain how you can recognize an object before your eyes even have a chance to perform a saccade. Nor does it explain how you can recognize an object that is outside your gaze in your peripheral vision. This suggests there is more going on than weight sharing.
There is distortion of the incoming visual scene caused by the retina, that blows-up the center and shrinks the peripheral, making the center of gaze the dominant input space. But because the center is ‘zoomed in’ the saccadic movement of gaze is used to break down the image into finer details. The more time you spend looking as something, the more you see, simply because you are spending more time scanning the scene.
If saccades were a requirement in order for you to recognize something, then we would be too slow to react to dangerous situations. If a predator instantly appeared in your sight and you had to make a number of saccading movements in order to recognize it as a tiger, you would probably already be dinner.
Both the speed and fact that it has to be scanned by the cortex for fine details suggests that at least in part that this fast recognition it is done by the evolutionary older subcortical structures using the mostly lower resolution total retinal image.
Thank you for providing your vision of the whole picture with supporting links.
Could you also elaborate on this statement?
I support this point of view, but could point to any supporting evidence?
Please look at links , , and  above for some basic information on grid cells.
I believe everybody is familiar with Mosers’ works here, but you mentioned it as a universal data structure format, however, to the best of my knowledge, it’s only a part of the navigation system. Do you have another vision of it?
Expand your view of navigation.
We know from patient HM that without the hippocampus you lose your ability to encode your navigation through experience. It makes some sense to think that the hippocampus samples the grid structure and rearranges or encodes the data that is proximal to the temporal lobe to be learned as your autobiographical experience.
The cortex is still forming the “global workspace” but the mapping of navigation in this workspace is not being encoded and stamped into the temporal lobe.
 CONSCIOUSNESS AND COGNITIVE ACCESS NED BLOCK; See figure 4.
This could be the where versus the what pathways of vision. If I am a fish and believe I am in a dangerous location and something appears to the right I know to fish fast to the left. On the other hand if I believe mates are near I choose to proceed in a courtly manner to the right.
For detailed “what” saccading to center the region of interest makes sense. Orientation is covered by both Hawkins’ theory and Hinton’s capsule networks. Hawkins being more biologically near and generalize-able.
Something when a thing is highly rotated I rotate my head to better see it, not something a mature adult does due to vanity.
I also appreciate your informative answer above. Can you please explain “distributed grids”.? Thanks.
It looks you are talking about place cells (let’s generalize the term for direction and other functionally described types), not grid cells, which are located in the medial entorhinal cortex. Place cells don’t have the grid structure.
There is good reason to believe that the place cells sample the grid cells with SDRs to form a pocket of response.
Sure, it’s quite clear, that the grid cells provide an important part of the input to activate some place cells.
I mean, grid cells are just one (even very important for real space navigation) of such inputs. Or, perhaps, it would be more clear to say, place cells can work without grid cells, but grid cells are useless without place cells.
Place cells by themselves don’t have any grid or any other known to me structure. That’s why I was surprised to hear about universal data structure format in the brain.