I don’t think “flash inference” correctly characterizes GLOM. It is explicitly temporal and takes on a dynamical systems flavor, which can be seen, for instance, in Hinton’s reference to 2D Ising models. Imagine a random array of 2D vectors at rest. As new input arrives, those vectors “spin” and update their neighbors. If the input is constant (like a static image), then the system will settle on a stable configuration. If the input changes (like in video), then the system will be perturbed and continually fall toward a new attractor.
How will it represent changes like pose? Higher-level representations will carry over from previous time steps and flow downward to condition lower-level representations at the next time step. From page 32:
An advantage of using islands of identical vectors to represent an object is
that motions between successive frames that are small compared to the size of
the object only require large changes to a small subset of the locations at the
object level. All of the locations that remain within the object need to change
only sightly to represent the slight change in pose of the object relative to the
camera.
If the changes in an image are small and predictable, the time-steps immediately
following a change of fixation point can be used to allow the embeddings
at all levels to settle on slowly changing islands of agreement that track the
changes in the dynamic image. The lowest level embeddings may change quite
rapidly but they should receive good top-down predictions from the more stable
embeddings at the level above…
GLOM also solves a problem that Yannic Kilcher misses and Hinton himself somewhat glosses over: it implements image parsing without dynamic node allocation or a fixed set of node types. Contrary to traditional image parsing methods, in GLOM the number of parses is bounded by the dimensionality of the embedding vectors, yielding much higher capacity.
Why is parsing important? Hinton answers “interpretability”, but more so parsing and compositionality are critical from a theory of computation standpoint. Intelligent sensory systems must be more powerful than the family of regular languages and finite automata, which exhibit the easy-to-fool pattern matching of CNNs, frogs, and ducklings. If GLOM works, then Hinton has moved CV up at least one tier of the Chomsky hierarchy.
lucidrains at github has a (probably) reliable implementation of GLOM, without any experimental results attached. I can’t vouch for it, but he is well regarded for his transformer implementations, and GLOM is a kind of transformer.