How Can We Be So Dense? The Benefits of Using Highly Sparse Representations

WTA AE paper is about classification.

Sorry for the confusion. It’s true that WTA AE is used for classification. But it’s done by sending the encoded vector into a SVM and then use the SVM to classify the image (if I read the paper correctly). The sparsity part is never a part of the classifier. However in this paper, it’s an end to end approach and sparsity is a part of the classifier.

I actually implemented WTA AE a couple of years ago, when I was evaluating various autoencoders for feature extraction on a custom dataset. First of all, the type of the classifier to classify the vector of top level features usually does not matter. What matters is the quality of the features. I used a linear classifier in my implementation, and it didn’t make much difference compared to SVM. I also trained it end to end (rather than one layer at a time, like they did in the paper), it trained just fine, and again, didn’t make much difference. The sparsity is used during training so that the network learns to extract better features. During testing we remove sparsity constraints, because they are not needed and don’t make sense if we want to do classification. By the way, in Numenta’s paper they also reduce sparsity during testing to 50%.

My point is it’s not clear if Numenta’s modification leads to better noise robustness (or better noise-free classification accuracy) because it has not been tested against the original method. It might be superior, but we simply don’t know at this point. I might test it when I have time, but this really should have been done in the paper.

1 Like

After further messing around. KWinner is by far the best solution against noise without training against noisy data. Also, KWinner is effective against all sort of noise, set random pixels to random values, add random values to pixels, set random values to 0.5, etc… In fact in most cases the raw network without regularization performs better against noise than networks trained with regularizations besides KWinner + boosting.

For example, corrupting images with Gaussian noise: image_new = clamp(image + 0.8*gaussian_noise, 0, 1)

Raw Dropout Batch Norm KWinner KWinner + boosting
61.21% 35.41% 44.21% 45.18% 69.67% (Accuracy)

@mảty1885 maybe Contractive AE can do better performance?

Maybe, maybe not. I’m using KWinner in a classifier instead of in an AE (as a replacement of dropout/batchnorm). I’ll investigate.

I written a follow-up to my previous post. Hopefully it helps someone.
I have never thought boosting would be so powerful in NNs. Like… jaw drooping amazing. This is how well different regularization method fairs against Gaussian noise. (more realistic then my previous test of setting random pixels to random values)

noise/accuracy(%) Raw Dropout Batch Norm KWinner KWinner+boosting
0.0 98.91 98.11 97.79 98.47 98.22
0.05 94.58 76.99 62.23 71.35 95.68
0.1 94.33 76.97 61.77 70.89 95.55
0.2 93.4 76.39 60.6 68.28 95.35
0.3 91.29 75.69 58.73 66.03 94.81
0.4 88.58 74.25 56.49 62.31 94.15
0.5 85.57 71.75 53.05 57.43 92.66
0.6 82.28 68.25 48.23 51.25 88.96
0.7 77.52 63.22 42.65 43.39 82.41
0.8 71.72 53.76 32.94 35.75 74.18

K-winners and boosting go hand-in-hand, evolutionarily speaking. The minicolumns are under pressure to represent as much diverse spatial input as possible. Without homeostatic regulation (boosting), the same small percentage of winners will just keep winning. It does not help much until you add the inhibitory neuron pressure. It’s like pressing down a histogram to squelch top top performers, pushing patterns across the entire layer.


To follow on with the previous posts, I originally thought you could get around not using k-winner or boosting of any form in my own more temporal version of htm. Boy was I wrong, even with learning and a specific hex grid / lateral connection - inhibition approach you just end up with the same columns dominating.

Also as a side note, regarding MNIST I originally a long time ago whipped up a nupic HTM version to solve it using just the spatial pooled getting around 95%. I was happy until last week I took a quick double check and noticed I had all learning set to 0.

It seems that using the spatial poppers k-winner takes all to simply pick the top 5% columns and feed them into a KNN is enough even without learning, ha typical.


You get ~95% on MNIST using template-and-match… Well… So what happens if you turn on learning?

1 Like

Depends on the reception field size, if I use a practically global size input dimensions (28,28) == potential radius (28, 28) it just screws up and the number of active columns shrinks down to 5-10 where it is normally 240.

The settings is use for no learning is:

spParamTopologyWithBoostingGlobalInhibition = {
    "inputDimensions": (28, 28),
    "columnDimensions": (64, 64),
    "potentialRadius": 28,
    "potentialPct": 0.8,
    "globalInhibition": True,
    "localAreaDensity": -1,
    "numActiveColumnsPerInhArea": 240,
    "wrapAround": True,
    "stimulusThreshold": 0.0,
    "synPermInactiveDec": 0.0,
    "synPermActiveInc": 0.0,
    "synPermConnected": 0.8,
    "minPctOverlapDutyCycle": 0.001,
    "dutyCyclePeriod": 100,
    "boostStrength": 1.0,
    "seed": 7777,
    "spVerbosity": 0

Final Results - Error: 4.27% Total Correct: 9573 Total Incorrect: 427

However this could be wrong elsewhere in the code, I’m not sure how nupic handles a param value of 0 for stimulus threshold, increments etc.

Seems odd, so prob something wrong my end but I thought it was interesting.

After messing with the concept of spare networks when I should have been studying for final exam. Here are my conclusions (I re-implemented the paper from scratch in PyTorch):

  1. KWinner + Boosting works very well against noise on MNIST
  2. In fact it also protects the network against adversarial attacks (IFGSM and MIFGSM)


  1. In MNIST, an Dropout layer can also provide the same level of protection against noise and even adversarial attack

Bumping up the difficulty. I also tried CIFRA-10 with VGG11.

  1. KWinner + Boosting provides the same level of protection against adversarial attack and noise (that is, barely anything, like 2%)
  2. The sparse network does not perform any sort of gradient masking. Thus a straight forward attack works.
  3. The adversarial attacks does transfer from a normal network to a sparse network.
  4. Funny thing, the adversarial examples from a network with dropout does not transfer well to a sparse network.

Hi @marty1885 , awesome job! Is your implementation available in github? We are also interested in evaluating robustness against adversarial attacks as well, and it seems you’ve already done some work on that. And great blog posts, by the way.

The latest paper presented at ICML Uncertainty and Robustness workshop includes some results on CIFAR-10. We’ve also run experiments on CIFAR-100.

As an update, we are currently working on dynamically changing the “sparsity map” during training (changing which weights are set to zero and frozen), something which would be equivalent to learning the structure along with learning the weights. For that, in our new version the connections will be allowed to grow or decay during training. As @michaelklachko correctly pointed out, in the current implementation the zeroed out weights are randomly chosen when the model is initiated, which is far from optimal.

Please keep the contributions coming, I feel they are extremely relevant and helpful to our ongoing research on the topic and would love to discuss collaborations.

Some side notes:

A model with a high degree of weight decay, whatever norm is used, can also push the weights to zero during training and promote sparsity - however, we are looking at structure learning methods which are preferably not gradient based and do not rely on the main loss function. Dropout techniques (such as weight dropout) are powerful regularizers and a related topic, but not the same. In dropout you still need to store all the weights, so memory requirements are the same as regular dense models (calculations can be done on sparse matrices, however you still need the full weight matrix in memory). But most importantly, sparsity in dropout only happens during training and not at inference time.

Research related to pruning is also very relevant. The main difference is that pruning is usually done after a full round of training - the models are first trained as dense models, and then pruned. Our end goal with this work is to have a sparse network during and at the end of training, leading to networks that can do faster predictions with less power requirement.


Thanks for the WTA AE reference, @michaelklachko, I will go over the paper. You are correct, it would be really interesting to see that comparison.


Hang on. It is a hot mess right now. I’ll post it as soon as I have it cleaned up a bit.


Hey @marty1885, by ‘kwinner’ do you mean the orig. WTA AE model (i.e. both spatial and temporal sparsity)? By the way, another good method to increase noise robustness is gradient regularization:

I’m going to take a look at this in the next few days (impact of sparsity on noise tolerance and power consumption).

@lucasosouza, do you have any more details on what you’re working on? Perhaps we can collaborate.


@michaelklachko I ensured both lifetime and spatial sparsity. But unlike WTA AE’s solution, I ensured the sparsity using boosting. Which ensures sparsity across batches.

@lucasosouza I’m still working on cleaning up the code. On what subject could we collaborate? Maybe with some knowledge and know-how from Numenta. We can solve the adversial attack problem once and for all.


Would be great to collaborate, @michaelklachko @marty1885 Let me write down some thoughts and research directions we are going with this and perhaps we can write down a joint plan.

@marty1885 Regarding your dropout vs k-winners + boosting suggestion: the only issue I see is that dropout models increase regularization and improve accuracy in the testing set, while our model does not. We specifically show it improves accuracy when noise is added, but not in the regular test set.

If we can get to a working model that performs better than the dense + dropout model in the regular test set, then it could be advertised as a replacement for dropout. I know you have done a lot of experiments in this, I’m curious to hear your thoughts.


We would love to get some feedback in the nupic.torch implementation, it is open source, same as nupic.

This is the latest paper presented at Uncertainty and Robustness workshop at ICML’19. It includes results on CIFAR-10 as well.

I have a doubt about this paper.
The sparse_weight seems to reset a fixed part of the weight to zero.
Dropconnect randomly sets the weights to zero in each epoch.
What is the difference between this fixed and random, why is this fixed zero method improving the resistance?