HTM spatial pooler: how to ensure feature variety?

A while back I implemented the HTM spatial pooler and tested it on MNIST (I don’t have access to the code right now) I was using 1024 hidden units with a top-32 WTA activation scheme. I also implemented boosting. It was pretty textbook and followed the description of the spatial pooler in the YouTube “HTM School” series almost exactly.

Anyway, I found most features ended up being simple prototypes. The prototypes most similar to the input were the ones that activated. It seems that, if the network has high enough capacity, it would prefer to memorize prototypes rather than learn a factorized representation.

My theory is that this behavior is due to the lack of a negative/contrastive learning phase (as in restricted Boltzmann machines) or any type of error-based learning. So I hacked in a simple contrastive learning term, and indeed; the features became much more diverse and representations became more factorized and less redundant.

I suspect that whitening the input data as an alternative to contrastive learning would also help, although since the HTM spatial pooler uses binary inputs I’m not sure how that would work.

What are your thoughts?


Can you explain more about this?

How did you go about this?

Prototype refers to a feature that basically looks like a training sample. Examining the resulting weights showed that almost all of the features were entire 5s, 6’s, etc. There weren’t any features that were just a single stroke for instance. It was as if I had run k-means on MNIST. When all your features are prototypes, you end up with a very redundant encoding that doesn’t factorize the input well at all (and factorized representations are generally desirable).

I implemented the contrastive term similarly to how it’s done with RBM’s. I created a reconstruction of the input by multiplying the hidden layer activations with the transposed weight matrix and then applying top-k sparsity and binarizing the result (the value of k was determined by counting how many pixels were active in the original input image, so I matched the sparsity). Then I fed the reconstruction forward through the network again. And then, using the reconstructed input and new hidden layer, I computed the permanence matrix update, but inverted it. Basically, contrastive divergence, but for HTM.

There’s actually another way of encouraging variety that I’ve seen used in other approaches. The gist is to have the top-k winners “learn” the input, but then have the top-n “almost winners” “unlearn” the input. So for example the permanence update would be computed as normal for the top 20 features, but features 20-40 would have the permanence update inverted to “unlearn” what they are seeing. Basically, hebbian learning for strongly activated features, anti-hebbian learning for weakly activated features, and no learning for unactivated features.


Would that mean that the HTM implementation (your first version) did not detect badly written characters at all. (So for instance a ‘5’ with a closed bottom hoop would never be recognised as a 5, even when a human clearly does recognise it?).

As interesting as this sounds, it seems a very unbiological process.

Wouldn’t the problem be solved with a better input stream? Maybe more crooked 5’s. Or almost no good 5’s but lots of different partial 5’s.

It’s also possible that I completely misunderstand your point. In that case, sorry.

(Kudos on you work btw. I am so jalous on your persistence).

Another option might be to use a tighter topology configuration for your SP, so the minicolumns aren’t all looking at the whole image. Boosting might also lead to better diversity.


I don’t know much about RBMs or contrastive learning, but I have some experience with spatial pooling on MNIST. I agree with Paul_Lamb, think this problem has to do with topology, and MNIST in particular.
My input for factorization: MNIST is notoriously easy, and any spatial pooler with sufficient capacity can just memorize a large amount of digit shapes. If the pooler is looking at the entire image, adding topology might make it learn more factorized representations (in a somewhat trivial way).

As far as diversity goes, I think this is a subtle point about boosting. With too low of a boosting implementation you would expect to see some good variety, but I wouldn’t expect to see extremely rare digit shapes. Too aggressive boosting would learn more rare shapes, but risks instability.

Another interesting thing to look in to is saccades. In my experience, showing random crops of images increased the data variety dramatically and lead to more general, CNN like features with oriented edges (especially interesting with more complex, colored imagenet images. features looked very CNN-like).


I agree with @Paul_Lamb and @thejbug.
Topology and boosting are definitely the things that immediately popped up in my head while reading through this.
I see you’ve already implemented boosting, though. But topology seems to always help when there’s information in the spatial arrangement of the input.


Topology and random crops seem like good ideas. At some point I might put together a comparison of different approaches.