24x24 is a bit extreme, I imagine you got that by cropping out the 2pixel white border of MNIST images.
I’ll stick to grays (no colors).
Before continuing, my plan is to experiment with fixed 128 long vectors.
Any encoder (aka scopes) produce a 128 dense vector which can be fed to a attached agent
OR
the dense vectors from few scopes are added up by a… correlator scope.
I touched the rationale for simple addition of few dense embeddings and for the term - correlator -  in the PS here
In order to allow potentially many agents to have different perspectives about the same image I assume a useful approach is to use patches,
and to keep things simple I’d start with squares (aka windows of focus) of various sizes.
And I assume the scope embedding should start encoding both where and what.
- 
where: x,y patch coordinates and its size,
 three 128 long scalar embeddings , added together & normalized to produce an 128 long “where” vector. Here-s some arguing for that
- what: a 128 long scalar vector obtained somewhat as here from whatever the small window contains.
And the output of this particular of scope is obtained by adding the above where and what vectors.
This 128 long dense output can be used alone by its containing agent learning/estimator ML model(s) OR can be used by a compounding scope that adds it to outputs from:
- another scope looking at a different patch in the same image
- or a patch from recent past to emphasize motion
- or a scope looking at an entire different source e.g. a sound, pressure or content of the pocket.
Regarding your question on overlapping.
Apparently at least, it makes little sense:
- to have two different scopes overall looking at the same patch. A potential exception would be if they use entirely different algorithms. Maybe a small CNN trained to output an 128 embadding on low resolution images could be more useful. Or several CNNs trained with different criteria, different hyperparameters or different dataset.
- to have two agents fed from the same scope. A possible exception would be that they-re needed in different contexts but I think such a case is pretty far down the line.
It should make sense to have many different correlative scopes that all share feed from one wide-view scope (e.g. the 24x24 full image) combined independently with different smaller patches from different positions.
Pretty similar with attention mechanism in visual transformers that compute the value of correlation between two patches. Unlike there I there is no restriction on patch size&position in what I described here.