As far as matching parts of image, one possible approach is to have multiple sized receptive fields at different levels or stages of processing. If there is no change on a “bigger” receptive field that feeds back to a lower level. The “edges” that get feedback that there is no change at the higher level but see a change at a lower level would indicate movement of a relatively larger object.
An “object” is a filled in space in the processing map at that level, composed of Calvin Tiles, as I have described many time in this forum. These roughly correspond to “grid cells.”
In the brain the spatial scaling in the various grid fields of the Entorhinal Cortex is about 1:1.14.
Please see this page for more details:
Number encoder based off of entoehinal grid cells - #2 by Bitking
The “new” vs. “old” that @cezar_t is mentioning can be the Alpha (10 Hz) basic processing rate in cortex. The relation between the fields can be both spatial (edge) and temporal (movement) pooling.