Nice to see this addressed by you guys. You didn’t mention what Hinton calls doing “inverse graphics”, it’s in the video link I think. Hinton seems to have the idea of objects being represented in some archetypical form by incorporating the spatial prior into the network architecture. The 4x4 dimensionality of the matrices are for sure inspired by the affine transformation matrices of 3D space, though it’s never motivated as such anywhere in the paper. I suppose it’s a more constrained computation than what you are proposing is possible with the grid cell like stuff in the minicolumns. Anyway, there are many similarities, and I think it’s exciting to see this kind of bridging between your approaches and fields, with none less than Hinton on the other end.
Just came out: https://arxiv.org/abs/1906.06818
“An object can be seen as a geometrically organized set of interrelated parts. A system that makes explicit use of these geometric relationships to recognize objects should be naturally robust to changes in viewpoint, because the intrinsic geometric relationships are viewpoint-invariant. We describe an unsupervised version of capsule networks, in which a neural encoder, which looks at all of the parts, is used to infer the presence and poses of object capsules. The encoder is trained by backpropagating through a decoder, which predicts the pose of each already discovered part using a mixture of pose predictions. The parts are discovered directly from an image, in a similar manner, by using a neural encoder, which infers parts and their affine transformations. The corresponding decoder models each image pixel as a mixture of predictions made by affine-transformed parts. We learn object- and their part-capsules on unlabeled data, and then cluster the vectors of presences of object capsules. When told the names of these clusters, we achieve state-of-the-art results for unsupervised classification on SVHN (55%) and near state-of-the-art on MNIST (98.5%).”
I’m re-watching a lot of these research meetings again, and I noticed something interesting in this one that I had glossed over the first time through. Marcus talks about it a bit starting around 7:02. Hinton’s 2017 paper discusses it in section 5.
A capsule is able to learn on its own different ways of spanning the space of variations in the way a given digit is drawn (in the case of MNIST). By perturbing these different dimensions, you can see that the capsules learn interesting things like scale, thickness, scew, etc. as well as more abstract distortions.
Personally, I have been focused on the idea of pooling and making associations, but this extracting of properties/dimensions seems intuitively to me that it must also be a core part of what the cortical circuitry is doing as well.
I know @Bitking has mentioned something related to this as well on a few occasions. For example:
Has anyone given some thought to an HTM-compatible algorithm for this type of extraction process?
Exactly, that is what HTM is doing, just unsupervised and the properties/dimensions are allowed to be more abstract and do not have to map to visual properties like scale, skew or thickness.
Could you elaborate? I am not aware of where this functionality currently exists in the HTM algorithms (or in the theories that I have seen discussed so far)
I don’t think there is a general way to define “instantiation parameters” in CapsNet, this is mostly application-specific. In my model, they are derived by cross-comparing parameters of input capsules (my patterns).
Also, their “object” is defined as a recurring configuration of different parts. But such recurrence can’t be assumed, it should be derived by cross-comparing relative position among parts of matching objects. Which can only be done after their positions are cross-compared, which is after their objects are cross-compared: two levels above the level that forms initial objects. So, objects formed by positional equivariance would be secondary. But they may be stronger, displacing initial similarity-defined objects as a primary representation of the same parts.
I am more talking from a higher level of abstraction – does the idea of “unsupervised” extraction of different dimensions/spaces of variation among concepts (I do realize there aren’t really discreet “concepts” in HTM, but it is difficult to word this in a way which takes a continuum into account) seem like a core function that the cortical circuitry should be doing? If so, has anyone thought about this from an HTM perspective?
As far as I understand, HTM and basic neuronal models only work in a positive fashion, detecting coincidences. Detecting “variation” and “equivariance” would require inverse operations: subtraction, division, etc. I think to do that you have to model deeply coupled neuron-interneuron systems.
This is a good observation. It is clear that basic SDR math supports these types of operations.
For example, I have been working with Cortical IO technology a lot recently, and one thing you can do with a word representation is extract a list of semantic categories it can exist in (which is a similar class of problem to extracting dimensions in the above scenario). You start with the closest match, subtracting those bits, then take the closest match to the resulting representation, subtracting those bits, etc. Repeat until all the bits have been subtracted, and you end up with a nice list of contexts.
Probably need to think about biologically plausible circuitry for executing these types of subtractive operations. My initial thought is that one place this might potentially happen is around the borders of neighboring grids when they encounter each other as waves of activity spread out and re-integrate.
I think you are talking about synapse pruning in SDR, for that anti-Hebbian learning should work fine. I meant deriving grey-scale differences or ratios between specific inputs, which could become an output. Yes, it could be part of grid interactions, where primary output is input-driven and secondary / delayed output is driven by lateral inhibition. The problem is, that secondary output won’t be “signed”, so you may need two of them, one for each sign.
The signed input is not that far out of line with what is actually in the sensory stream. Both light and dark spots and lines have responsive cells. With this in mind, it is not a stretch to extend that through the processing streams.