In vision, but also in sound, things are grouped together. When watching a group of things we are perfectly capable of identifying one item or one person from another. What I am trying to understand is how the brain does this according to HTM theory.
Even in a picture, (so a not moving image) we are perfectly able to identify individual items. How does this work in HTM theory? I know it’s about learned patterns, meaning that what we see we alread have a pattern for and therefor we recognize the pattern. That part makes sense. What I’m trying to understand is how we can distinguish these patterns from the background noise in the image.
I’m guessing eye movement and head movement helps to isolate inividual objects from the background. Even the position of both eyes and amount of focus help because it gives information about distance. But just basic and simple: In HTM theory, How does the system stay focused on 1 thing only and how is it able to isolate this one thing from the background?
During training, you learn to identify spatial features that commonly co-occur together. You also learn temporally pooled representations that respond to an of the elements in a learned, predictable sequence. The objects that we identify can be learned in these two ways (amount others).
Now when you see a static image, the learned objects in the image will active the spatial and temporally pooled representations that represent the learned object. The fact that temporal pooling during training can help static/flash inference isn’t intuitive but it’s a powerful property.
This is a hand-wavy explanation. We don’t have specific temporal pooling algorithms documented but we have experimented with various techniques and are actively working on figuring out how this works in complex systems. The spatial pooling part is well understood.
As far as attention, that’s a less understood topic. We generally work without attention and assume that the representations capture everything in the sensory space. But there are certainly some attention aspects that allow you to narrow your focus and I don’t understand those well. Jeff may have some ideas about it though.
I would like to approach from a different perspective.
We know that identifying “what belongs together” is both a temporal (sequence of images) and spatial (locations on the image) problem.
Let’s say you saw a car that had similar form to your favorite car model. What you automatically do is, produce saccade (eye movement) that checks the important places identifying the car. Like the logo, headlights etc. From the outside it looks as if you are focusing on one object. But internally, the form of the car activated some learnt pattern and some cells were depolarized (predictions) because of this activation. These predictions create the necessary motor behavior (moving your eyes toward identification places) to further understand what pattern is really being observed. This in turn leads to a sequence of images that further define what is being seen and as a result new or extended activation will take place based on this temporal information. Consequently the predictions of this new activation will make you look to more accurate and better places on the object that would further help identification until you are satisfied with the result. So identification is a process that involves temporal and spatial data, sensorimotor activity and it cannot be decoupled from attention as far as my understanding goes.
These are hot topics that are being researched and not much is clear about these processes.