Of course, one need not be an Einstein to realize that, in perception, at least, movement is relative, Let's suppose that your cup is balanced on a model train-car at a Silly-Club banquet; whether you move or the train moves, relative to the room, your purely visual HTM learns from movement.
Ray Kurzweil may have missed one of the model boats when he wrote that the 'Temporal' in HTM reflects an over-emphasis on movement as a source of sensory evidence revealing the invariant stricture of our visual worlds. In How to Create a Mind he points out that in George and Hawkins's implementation of HTM, simulated scanning eye-movements are required to handle static forms like letter shapes. [Why data fire-hoses are NuPIC-friendly].
However, we should not let success in the sensory-motor layers fool us into ignoring a wealth of natural constraints that an HTM working from an immobile monocular eye can use. In The Senses Considered as Perceptual Systems, J.J.Gibson explores a host of sources that would give ample evidence that the cup and the train are not patterns of color glued to the background. And if our monocular eye is moving, all the visible objects stand out from the background.
Binocular eyes get still more evidence, even without movement. But that raises the issue of perceptual fusion. We humans, and presumably real intelligences made of silicon or quanta, don't normally hear sounds or see patterns of light: We perceive objects and events, and spaces -- using all our senses along with memory and other constraints. Even meaningless sounds and patterns of light are, by the time we are aware of them, already fused images. Direct awareness of raw sensory input, unfused, is difficult or impossible. we can't hear both auditory signals as separate sounds. We hear one sound coming from a particular location. If we are listening through earphones, that location may be in the middle of our head!
Perceptual fusion is also the norm in muti-modality perception. The sound of a drum being played by a visible drummer comes in both our ears -- but the event of the sound being created comes also through our eyes, if we have a functional visual system. If the drum is on a big movie screen and the sound comes from a loudspeaker below the screen, the perceived source of the sound is displaced away from the location of the speaker toward the location of the image on the screen,
I suspect that a fleshed out HTM model will reveal perceptual fusion to be a noin-problem. Any HTM capable of multimodal perception will have perceptual fusion dropping into the out basket as an unintended but welcome side effect.