A naive approach I had thought of in the past is similar to the quad-tree demonstrated above. Imagine all the ‘pixels’ of the retina feeding into the cortex. The layer of pixels is 16x16. Parallel to that layer is the same retina feed but into a layer of 8x8 pixels. Each pixel is on/off depending on the number of on pixels within its ‘receptive field’ of the pixels below it in 16x16 layer. If over half the children are on, then the parent is on.
Putting to together into a 2d hierarchy (or quad-tree) then you have a representation of the image at various levels.
(the below drawing is not exact, but good enough for illustration)
The purpose for this is to control the movements of saccades top-down in the hierarchy/tree. In the 2x2 there are two on pixels representing 2 areas of interest. Down to 4x4 the form becomes clearer representation but serves to focus the attention on the relative objects/corners/edges. The control then feeds further down until you get to an exact edge or corner in 32x32. The jump from (say corner to corner), can easily be done by feeding down the target ‘features’ from 4x4 to target corners in 16x16.
As you can see above - the movement from one point to another is smaller as you go up the hierarchy. The saccades still occur on the 16x16 but the control works on all levels in a coordinated fashion.
This could also help in scale invariance too. If features were to be detected at each level then a feature close-up or far-away will be captured as the same.
But of course, pure theory.