Paper Review - GLOM: How to Represent Part-Whole Hierarchies in a Neural Network by Geoffrey Hinton

Through the lens of Numenta’s Thousand Brains Theory, Marcus Lewis reviews the paper “How to represent part-whole hierarchies in a neural network” by Geoffrey Hinton. By focusing on parts of the GLOM model presented in the paper, he bridges Numenta’s theory to GLOM and highlights the similarities and differences between each model’s voting mechanisms , structure and the use of neural representations. Finally, Marcus explores the idea of GLOM handling movement.

Paper: [2102.12627] How to represent part-whole hierarchies in a neural network

Other resources mentioned:


Happy to see y’all discussing this paper. Always glad to hear neuroscientifically-informed reads on conceptual papers like these.

I think there was a bit of misunderstanding that might have gone unclarified in the context of the discussion.

Take one of the “columns” in the proposed GLOM. A few times in the chat it was said that this column is associated with some fixed pixel of an image, which was rightly pointed out as implausible. But instead, I take Hinton to mean that that column is associated with (for instance) a particular location on the retina. It can then represent the pose and identity of objects identified at that location in a reference frame relative to that sensor location (a location which itself may transform with movement of the body).

Similar to how you get retinotopic and other topographic maps in the nervous system, I think he has in mind a similar concept in how he proposes to build representations. The spatial relation between regions on the sensor may then be preserved in representations by way of the locality of voting (i.e. nearby locations on the hand, being locally connected to one another, can vote and converge on islands of similar models of held objects).

1 Like

I don’t think “flash inference” correctly characterizes GLOM. It is explicitly temporal and takes on a dynamical systems flavor, which can be seen, for instance, in Hinton’s reference to 2D Ising models. Imagine a random array of 2D vectors at rest. As new input arrives, those vectors “spin” and update their neighbors. If the input is constant (like a static image), then the system will settle on a stable configuration. If the input changes (like in video), then the system will be perturbed and continually fall toward a new attractor.

How will it represent changes like pose? Higher-level representations will carry over from previous time steps and flow downward to condition lower-level representations at the next time step. From page 32:

An advantage of using islands of identical vectors to represent an object is
that motions between successive frames that are small compared to the size of
the object only require large changes to a small subset of the locations at the
object level. All of the locations that remain within the object need to change
only sightly to represent the slight change in pose of the object relative to the

If the changes in an image are small and predictable, the time-steps immediately
following a change of fixation point can be used to allow the embeddings
at all levels to settle on slowly changing islands of agreement that track the
changes in the dynamic image. The lowest level embeddings may change quite
rapidly but they should receive good top-down predictions from the more stable
embeddings at the level above…

GLOM also solves a problem that Yannic Kilcher misses and Hinton himself somewhat glosses over: it implements image parsing without dynamic node allocation or a fixed set of node types. Contrary to traditional image parsing methods, in GLOM the number of parses is bounded by the dimensionality of the embedding vectors, yielding much higher capacity.

Why is parsing important? Hinton answers “interpretability”, but more so parsing and compositionality are critical from a theory of computation standpoint. Intelligent sensory systems must be more powerful than the family of regular languages and finite automata, which exhibit the easy-to-fool pattern matching of CNNs, frogs, and ducklings. If GLOM works, then Hinton has moved CV up at least one tier of the Chomsky hierarchy.

lucidrains at github has a (probably) reliable implementation of GLOM, without any experimental results attached. I can’t vouch for it, but he is well regarded for his transformer implementations, and GLOM is a kind of transformer.


Agreed. I never (consciously) implied that GLOM was only focused on flash inference. I mentioned that “voting on object-at-pose” was an important addition to our model in part because it enables flash inference, but I think that’s the only statement I made on this topic.


Yep, you never implied that, but one of your interlocutors did. Thanks for the presentation and discussion.