Yes, it’s part of it - the article I linked shows a simple CNN outperforming classical ML networks and a MLP NN.
However, the MNIST I picked only as an example, not intending to limit this method to visual recognition problems.
One way to make it more spatial-aware (for MNISTcase at least) would be to instead of random pixels to use small patches.
A K-means (or other clustering algo) can easily “segment” a MNIST digit in small adjacent patches. E.G. 10 patches each image.
If each patch shape&position could be encoded with a space-aware algorithm that encodes both its position and shape (like this one), then we can train the sequential model with a series of (sdrs encoding) patches belonging to the same image.
In general any “thing” can be defined by the set of its “parts”.
I just assume here the thing can be simply described as a long “random walk” between its parts.
PS (few more thoughts)
- there are maybe cleverer ways to navigate the “thing” than random walking between its parts. I don’t know, maybe our saccades have a few tricks on picking the next movement of the eye/fovea.
- one quality of the algorithm above is it can be trivially parallelised - different threads can run their own walks on the same query image using the same sequence predictor.
- there might be a general principle here that living intelligences convert any seemingly static problem in a dynamic, sequential one. Maybe we always navigate our inner representations of the world, both the fact that our awareness can’t really stop on a thing/thought and that transformers are pretty general suggests that “sequentiation” is a powerful strategy.