Jeff's Sensorimotor multiple view theory like Hinton's capsule intuition?



I just learned that Hinton has a capsule theory in which he thinks neurons could be organized into capsules that see the same data from different points of view and vote on features that they agree exist.

It seems to me that many of the intuitions that go into that way of thinking are already understood in a higher resolution through HTM theory. SDR’s provide the mathematical mechanism of voting on which features exist and in the latest Sensorimotor discussion videos Jeff outlines the idea that every region of neocortex essentially tries to build a whole model of the world given everything it’s ever seen, not just a detailed piece of the world puzzle.

What does the community think? all this is based on about 1 hour of research on my part, so I may not know what I’m talking about at all.

Geoffrey Hinton 's "Capsule" Neural Net is SIMILAR to HTM?
Hinton's capsule and what is wrong with convolutional neural nets
Dynamic Routing Between Capsules (new paper from Hinton)
Capsule Network

Agreed. I said a similar thing a few years back when he gave a preliminary talk about it, but sensorimotor theory provides a good new perspective on the voting bit, in addition to the parallels between capsule encoded “attributes” and context encoded by the temporal memory.


I think Hinton is going the right direction, but I also think he’s a long way behind HTM. Some of the things he says is incorrect (like that visual perception is a “routing problem” when an entity’s perspective changes). I think he misunderstands SDRs as some type of multidimensional array (which indeed might have some similar matching properties). But he says nothing about neuronal receptivity zones, which are core to HTM theory. And his association of “capsules” to mini-column seems errant. Take this screenshot, for example:

Each of those long ellipses are supposed to represent a capsule, and each capsule is detecting whether a feature exists (nose, mouth, face). But each capsule is also supposed to be a mini-column. The addition of a hierarchy confuses the matter more. This seems like it’s reaching in the same direction as HTM, but I think HTM explains the workings of mini-columns much more realistically.

I also think Mr. Hinton has some micro vs macro concepts convoluted a bit. In some spots he seems to be talking about mini-columns and spatial feature recognition, but then he jumps to explaining object orientation in space using the same terms, which makes this confusing.

He is framing this as a computer problem. How can we make computers intelligent? He’s using neuroscience to inform his research, so logically he’d come across the same structures that underly HTM theory (mini-columns, layers, columns).

We have always framed our research as an intelligence problem. At the core, the question is “why are we intelligent?” In order to answer this question, we’ve chosen to research the neocortex and build intelligent machines to prove out the theories.

I think this perspective difference is a big reason why there are disconnects within the ML and HTM communities.


The implementation of the capsule solution in the video isn’t elegant and far from cortical approach, but I think the idea of the capsule itself is super interesting:


@spin I moved your post here because there was already a discussion on the topic and I wanted to also include your new video.


Yes, I found it quite interesting as well. There are some very loose analogies to our Layers and Columns paper. The basic idea that objects need to be associated with the relative locations of their features is a common thread between the two. We are also trying to deal with sensors that move independently and we explicitly model how information is integrated over time and space. There are a lot of other differences. There is no intent to be biologically plausible, so obviously details vary quite significantly.

Looks like they still need to figure out how to do the “routing” efficiently and scale there system. There is going to be an updated NIPS paper in a couple of months, so I look forward to seeing that:


Adding on to Subutai’s response, there are similarities:

  • Hinton’s capsules act as coincidence detectors similar to HTM minicolumns
  • Hinton’s routing problem is present in our work as well, possibly in the lateral connections between cortical columns, possibly in convergence up the hierarchy

And there are differences:

  • Hinton’s solution requires back prop and it doesn’t integrate information over time. This reflects a general difference between DL approach and HTM
  • Hinton is thinking about vision, which doesn’t require him to handle the cases where sensors move independently. If you have fingers moving independently and feeling the same object, the problem of knowing the relative positions of features being sensed is a little more difficult


He recently said that the back prop is a wrong way, so its using here should be considered only as a temporary solution.

Do you see it as a significant difference?
From my perspective, there is the only difference in numbers (optionally more than two) and synchronization of eyes’ movements - fingers don’t move independently, they are just not synchronized.
BTW, may the one-step movements of, let’s say, three fingers, be regarded as three simultaneous saccades?


That’s a good point and I don’t know if the difference with sensors that move non-synchronously is fundamental - i.e. whether there are solutions to the synchronous-only problem that won’t solve the non-synchronous version. But it’s useful to keep in mind moving sensors as the more general formulation of the problem.

There is a fundamental difference between batch learning / back prop and online / temporal learning though. Hopefully Hinton, others will continue down this path.


There is a fundamental difference between batch learning / back prop and online / temporal learning

You don’t have to use batches in DL (an ANN can learn from one example at a time), backpropagation can be done through time, and numerous RNN-type models demonstrate temporal learning (e.g. LSTM).


I was trying to identify similarities and differences between Hinton’s capsules-based sensorimotor algorithm (described at the link) with current HTM sensorimotor theory. I’m not sure that general ANNs and LSTM impact that comparison. Also note that I’m not saying differences from HTM theory are bad, current DL can solve plenty of tasks better than current HTMs.


I just saw this paper about capsules. I don’t really have a firm grasp of the theory, but on the surface they seem similar to encoders from HTM theory. Can anyone explain how they are similar and different?


as i understood that, the main similarity is getting “competition” involved in the feedforward process. traditionally CNNs used to create spatial invariance among sub-features of a whole (the distance between the eyes of people are different, yet the “two-eye” should be considered the same feature), by max-pooling. but there are a lot of problems with max-pooling. it’s not differentiable meaning that it can’t learn, and it’s kind of dumb, it just picks based on where it shouts louder, and it picks each feature just based on it’s own. capsule routing on the other hand, routs based on agreement it combines the features spatially and then routs based on where the whole result seems more probable.

Geoffrey Hinton talk 'What is wrong with convolutional neural nets’
Geoffrey Hinton- ‘Does the Brain do Inverse Graphics’


If anyone reads this and wants to post an analysis of how it relates to HTM, please do.


Not an analysis, but for those visual learners like me who struggled to understand the structure of the model described in the paper, this was very helpful:


One may consider to what extent the routed input connections to some capsule implicitly determine a (non-binary) SDR codeword comprised of the capsule units of the lower layer. This implicit code may very well exhibit similar properties such as item class overlap and storage efficiency, championed for in the Numenta papers. Note that I’m not necessarily referring to the squashing.


Here’s a paper by what we might assume is Hinton et al. which further develops the capsules idea.

This model bears even more resemblance to the new developments in HTM theory. This time the activation signals are 4x4 matrices capable of representing the affine transformation that relates the object to the viewer. These are multiplied with 4x4 weight matrices representing the relative poses of parts to the object they compose. Hinton doesn’t speak so much of body location as frames of reference and doing “inverted graphics” by taking advantage of the affine manifold in the feature space pertaining to vision in particular.

What I find interesting is that the convergence of the EM voting, the current bottleneck of this model, could, in my apprehension, be sped up by the introduction of contiguous data belonging to the affine manifold, that is to say by realistic temporal sequences of images, i.e. video.


We keep getting a lot of questions about this, so I wrote a short blog post comparing the similarities / differences of HTM sensorimotor theory with Capsules. It just appeared here:


Critically what the models have in common is assumptions made about input belonging to an affine manifold and that these assumptions are architecturally integrated. I don’t agree that capsules are “inherently a batch, supervised learning technique”: dynamic routing is orthogonal to how weights are updated. In fact what really is inherent in this type of voting design is the possibility for a hysteresis approach for instance by employing a weight decay so that capsules that have recently reached consensus are more likely to be in continued agreement – making assumptions about temporally related data. It is also easy to imagine how motor control could aptly influence the voting constellation or the transformation matrices. This is inherent to the design.


here is a new video of Hinton.

He is sounding very much like SDRs and clustering in a high dimension space.
Using the “supervision” provided by nature to learn.

He also talks about looking at a house and expecting a door and then expecting a doorknob. An example Jeff Hawkins used in his book On Intelligence.