TM-driven toy robot learning strategy

I think the fovea-motion stream mentioned here deserves a bit of detail.

Why this matters - because it might feed some experiential meat to HTM theory in order to move further. One that is conceptually simple enough for average software engineers to follow with lesser effort than previous experiments on HTM reinforcement learning.

Here are the core points of this proposal:

  • We assume a visual robot is exposed to a picture.
  • Unlike other vision-modelling strategies (CNN-s, Transformers, whatever) where the model swallows in the whole picture in order to “make sense” of it, this one has two important properties which it uses to navigate the picture it is exposed to:
  • it can only see a foveic patch of the picture - which means two things:
    • the patch image is provided by a low resolution camera.
    • this camera is movable in three directions - x, y, z. horizontal, vertical and zoom. What is important is as the camera zooms out it will provide a blurry low res image, so in order to “see” more detail of an area it needs to close into that respective area
  • the second property is it is able to move the camera. Motions of the camera are encoded as a set of three scalars representing either the new absolute position of the camera or a relative one to its current position. (I think the relative motion encoding is more fruitful in the beginning)
  • So an (H)TM learning the “world” is to make a good prediction of the next step on an arbitrary State->Action->State->Action … stream, where “State” is an SDR encoding what the camera sees, and “Action” is an SDR encoding a camera movement.

Well the above has a few problems:

  • if camera xyz movements are random, that might hinder the TM’s ability to learn.
  • if they are not random, that might hinder TM’s ability to predict a state following a new, out of training samples motion.
  • most important, that strategy might not be as good at teaching the model what motion should I do next, simply because, in principle, any motion is possible hence valid.

For now I’ll leave these questions here, and follow up soon.


When discussing machine visual perception, the concept of ‘zooming out’ becomes relevant, particularly in a visual field where pixels adhere to a fixed pitch.

The human eye possesses high resolution in the fovea, while the size of perceived ‘pixels’ varies as we move towards peripheral vision.

During small local saccades, the visual presentation towards the edges of the field remains relatively stable.

As a result, I have a coarse ‘proto’ image that remains mostly unchanged as I fill in the finer details. This proto image serves as a crucial element for navigation and motion detection.

These saccades appear to follow highly stereotyped patterns based on the gross features of objects, leading to similar sequences being fed to the recognition unit for similar objects.

Think Circular, rectangular, linear, faces, …

The subcortical portions responsible for guiding the eyes from projections to the frontal eye fields have connections to the visual stream. It is reasonable to assume that this connection primarily focuses on gross features and directs the eyes to move towards key points, allowing the cerebral cortex to handle object recognition effectively. The outcome of this process yields a stream of local features and length of saccades giving relative size.

Notice how this scanning process mostly solves the size invarience problem.

Based on my understanding of the literature (unfortunately, I cannot provide a specific reference at the moment), larger saccades seem to involve a mechanism that pre-fills my perception with the contents of the destination, ensuring proper orientation at the end of the motion. This mechanism could be related to the patterns mentioned earlier.

Given these foundational assumptions, you may find a starting point to address the problems you have raised.