Positive-only matching vs negative matching

I wasn’t going to post this until I had results to publish, but people are starting to talk about it now. So at the risk of generating a lot of abstract discussion… Really, I’d prefer it if we could ground this in concrete examples, standard test problems. (It is crazy there aren’t standard test problems for spatial pooling.)

How is HTM spatial pooling[*] significantly different from mainstream neural networks? I think it is this: that learned patterns are only a positive signal, the “on” bits; there is no negative signal to indicate an anti-match (to a dendrite’s template) as you have in SOM or ANNs:

  • SOM units match the input vector to a full template – which input bits should be “on” and which should be “off”.
  • Neural network units can have negative synaptic weights, meaning those input bits should be “off” to match.

HTM neurons are like this because they are modelling excitatory connections, leaving inhibitory connections as implicit in the process of k-winner-takes-all selection. Presumably neuroscience has shown that the important neuron-specific learned connections are excitatory. (?)

From a functional perspective, matching on positive-only subsets may seem to be important in recognising patterns from unions of several input patterns, as in:

  • visually-occluded objects (spatial pooler)
  • matching one of multiple simultaneous predictions (temporal memory)

However, (deep learning) neural networks and SOM seem to do just fine in feature learning and recognition.

I’m not convinced that positive-only matching works well enough.

@subutai’s mathematical work showed that due to sparsity a positive-only match is very likely to be “correct”. However I think it can be misleading to consider random SDRs, the world is defined by richly structured correlations. I’ve started looking at images and natural language for more realistic input.

So in HTM any input which has enough overlap will match the same target. Example: an input image of a “3” with only positive pixels can match the learned template for “8” (visually, the shapes overlap). I call this the subset matching problem. And columns will continue to learn on such partial matches; they are greedy, they may agglomerate too much.

Possible responses:

  • currently we rely on an indirect mechanism: any different but overlapping pattern will likely have fewer matches to the original template so may lose out in inhibition to other random columns.
    • this is supposed to be enhanced by the “boosting” mechanism.
    • I remain to be convinced.
    • there’s a tension between robust recognition and discrimination here.
  • the input could be encoded with a positive representation of empty space or edges/boundaries. However,
    • the subset matching problem still applies, to some subset of those edges etc.
    • I don’t see how this could apply in general at other levels, from region to region; this isn’t a problem just at the initial input level.

Branched from bbHTM Spatial pooler/mapper implementation since @mraptor used Hamming distance like SOM which actively matches “off” bits.

[*] In fact this issue applies to both spatial pooling (proximal dendrite matching) and temporal memory (distal dendrite matching).

4 Likes

I agree with you that you can compare HTM with standard ML. I think one of the main differences is the network structure:

  • It’s true that every segment is a linear classifier. Positive classification means cell activation. With binary weights. To learn the classifier you can do SP or some other form of SOM or L0-norm sparse coding or… The interesting point is the combination of several segments in one node. At least for me, other may see it different. I think with the right learning algorithm you could solve the invariance problem with this network structure. Every segment represents another view of the same concept.

  • Same for TM but there you classify over transitions. Again, I think you can use different algorithms to learn the transitions, the important thing is the combination of the spatial and temporal classifier in one network and the creation of context by activating only cells that are predicted. I don’t know something similar in ML (maybe you can compare using context with using conditional probabilities but it’s not the same).

  • The binary weights allow us the use of big SDRs. Their power is still under research and I’m also a little confused about the correlation de-correlation issue, so I let somebody else answering this.

1 Like

How is HTM spatial pooling[*] significantly different from mainstream neural networks? I think it is this: that learned patterns are only a positive signal, the “on” bits; there is no negative signal to indicate an anti-match (to a dendrite’s template) as you have in SOM or ANNs:

Hi Felix,

Forgive me if I don’t understand exactly what you mean by this, but I’m confused because if its true that spatial location of bits is semantically meaningful (i.e. 0001100 vs. 0000110) then isn’t the anti-match implied here? …and not only implied but encoded and processed due to the fact that the dendrite in the 4th position (works either way left or right) is its own entity and so its non-participation in active and lateral accounting is negative information? I don’t see the distinction you’re making (in some respects only)?

Felix,

Are you saying that instead of an overlap metric HTMs should use a Euclidean metric, Hamming distance, or some other metric like that?

I do know the overlap metric allows at least two very nice properties: extreme noise robustness (a segment simply ignores all the bits it doesn’t care about) and the union property. It may be possible to get these properties some other way too.

In biology, segments in neurons don’t compute normal Lp distances because each segment only connects to a tiny percentage of the input space. It dynamically grows connections to parts of the input space as needed as a function of learning. We want to stick closely to biology.

You are right that the current spatial pooler and sequence memory doesn’t explicitly model inhibitory connections. They are implied. I think there is more we can do with inhibitory connections that we don’t even model today.

Uncorrelated SDRs: here’s an interesting tidbit. If you take any input, such as images, and use normal Hebbian learning with linear neurons plus inhibition, the neurons learn the principal components of the input [1]. These neurons will become uncorrelated through learning even though the inputs are highly correlated. This is one example of the pressure towards uncorrelated features in the brain.

[1] Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biol. 15, 267– 273. doi:10.1007/BF00275687. http://deeplearning.cs.cmu.edu/pdfs/OJA.pca.pdf

1 Like

(Incidentally, euclidean distance on binary data is order-equivalent to hamming distance.)

Well no, I guess I’m saying that it’s not clear that positive overlap is actually sufficient. Or what other mechanisms or restrictions are needed together with it. This method is significantly different from previous research so we need a convincing test of whether it works on real data. The hard part of that is defining “works”.

Reconstruction or classification of images? With or without saccades? cue linked topic

Sure, my point was about the lack of a negative signal; by all means it can still be a sparse coding. That could be achieved if some of the learned inputs were inhibitory. (Equivalently, if the inputs were excitatory but some came from sources that were themselves inhibited below a baseline firing rate.)

I think noise robustness and the union property arise from sparsity, not from being only a positive signal. Of course, we would need to account for the potential issue of bias towards negative signal in sparse data…

That paper only considered one neuron in isolation, so doesn’t demonstrate decorrelation. And it’s the typical ANN neuron with scalar activation and scalar weights, I think.

Anyway I take your point that there are mechanisms for decorrelation in the brain. I think we need to work harder on doing that in software. And noting that other ML literature seems to be doing it already.

Thanks for your thoughtful response, I appreciate it!

1 Like

Yes that is interesting. But note that standard implementations of HTM allow only one proximal (feed-forward, spatial-pooling) segment. I’m told that is consistent with the biology. On the other hand:

  • feedback segments can (should) influence the spatial pooling process. (“paCLA”)
  • it looks like Layer 5 cells are activated by distal/apical dendrites, not just a single proximal dendrite.

http://lists.numenta.org/pipermail/nupic-theory_lists.numenta.org/2016-February/003540.html

So the multiple perspectives aspect you highlighted should still be valid.

Agreed! It’s also hard to look at the SP in isolation and just think in terms of pure information theory. To me the definition of “works” includes being able to form good sequences, support continuous learning, unions, feedback, drive behavior, etc. etc.

Hmm, you’re right. Oja does have a paper with multiple neurons but I can’t find it now. I did find this one [1]. I didn’t read through it in detail, but see Fig 1 on page 465. It looks like Hebbian plus lateral inhibition. In any case there is a lot of evidence for decorrelation in the brain. It is somewhat controversial - if you search you will also find evidence for the opposite :smile: Such is Neuroscience!

[1] http://courses.cs.washington.edu/courses/cse528/09sp/sanger_pca_nn.pdf

@floybix I agree with you that allowing negative weights would make a nicer algorithm from a machine learning point of view. But it is not clear to me the positive-only weights are insufficient to solve the subset matching problem. Take your 3 vs 8 recognition problem as an example. I can construct a segment that responds only to 8 but not 3 by subsampling from the SDR associate with 8. Assume each digit leads to 40 active bits. A segment subsamples 20 of them randomly and the activation threshold is 15. If you just present 3 (or any half of the 8), you will active 10 bits on this segment, which is not strong enough to cause a “match”.

I disagree. SP is a potential bottleneck of information. If information doesn’t survive the transformation into SP then it is certainly not available to form sequences or whatever else.

Wow, that paper (Sanger 1989) actually describes the same method as another paper I’ve been interested in recently, Spratling & Johnson 2002.

No it is not lateral inhibition. Rather, the method “uses up” each input bit’s strength progressively as target neurons are activated, meaning those same inputs can’t contribute to the activation of other target units. Privately I’ve been calling it suck decorrelation, because it sucks out the energy of input bits.

On the other hand Földiák 1990 does decorrelation (sparse coding) using learned lateral inhibition. That whole paper is blowing my mind. I wish I’d read it 2 years ago when I first got into HTM.

1 Like

Sure. All I am saying is that constraint (preserving information) is by itself insufficient. The representation also has to work well for sequence memory, continuous learning, etc. and these impose additional constraints. At the end of the day, the SP is one component in a larger system, and you have to consider all the constraints not just pure information theory.

1 Like