Assuming that we have the stable representation forming in the association area, each node in the grid is following its own sequence as the saccades progress.
Thinking of one of the most basic and highly tuned human skills, face recognition. The saccade scan pattern is highly stereotyped for any given individual.
First we get the corner of an eye and lots of mini-columns try to match the static pattern. (proximal inputs) Many apical dendrites will also try to predict something.
After the first saccade to the next feature the next presentation the sequential predictions will not match as many features and with lateral voting many possibilities will be eliminated. Each saccade will strengthen the confirming sequences and eliminate the sequences that don’t match.
What you should end up with is a stable hex grid that means John and not Mary.
Similar objects will have similar hex-grid patterns and different objects will have very different hex-grid patterns. The patterns that are similar should have the same basic grid shape with the bits around the edges of the pattern differing to set the object apart.
I think that things like number and letter recognition involve multiple maps working together and the explanation gets too long to go through in a post like this - but the basic principle is the same; consider that speech in humans involves several maps in both Broca and Wernicke’s areas over and above simple object recognition.
Going the other way - towards the senses - as Jeff Hawkins has mentioned before there is an increase in connections as you go form V1 forward. What I think is going on is processing to split out and enhance features and present a rich cocktail of features to select from (edges and such) so that the hex-grids have the most contrast to use to form a pattern.
The feedback connections from the hex-grid act as a filter to focus this feature extraction; perception is active recall of learned features.