How Can We Be So Dense? The Benefits of Using Highly Sparse Representations

Thanks for the comments. Looks like we put in the wrong figure for 5B (it’s a really old one). Thank you for pointing that out - we will update this in a revision. Attached is the correct figure:

Dropout was attempted for MNIST (see section on “Impact of dropout”). Basically it helped a little for Dense, and hurt for sparse networks. As far as we can tell there was no overfitting - the raw test scores were always high for the configurations listed.

With GSC, we managed to get pretty good results with a smaller network. However, for Kaggle, the last few decimal points are important - we definitely did not get to their best score. I’m not 100% sure why dropout had a negative effect, but note that we were using batch norm for GSC. Some people have reported that dropout does not help too much if batch norm is used.

Sparse weights are a randomized sampling of the inputs, unlike dilated convolutions. As such, it should be able to pick up on a larger set of patterns (similar to the compressed sensing literature). Also, we can use it for regular linear layers, not just convolutions.

The backward pass is to make k-winners behave just like ReLU. A gradient of 1 for the winners, zero everywhere else. (This was explained in the paragraph just before Boosting.) Unless we made some mistake in the code, I don’t think it is adding any noise - it’s the right thing to do here, analogous to ReLU.

Note: I will be out for the next week, but will keep an eye on this page for any other corrections / comments. @lscheinkman and I plan to do an updated version with errata - really appreciate the feedback from the community.


Wow, that’s a great blog post! Thanks for doing this so quickly!

One quick note: you say that “In the paper the gradient are the sign of the sum of the not-discarded gradients.” This is not quite correct. The gradient for k-winners should be 1 for the winners, 0 everywhere else.

1 Like

I haven’t read the paper yet, but reading through this thread, it’s starting to sound like the k-winner layer Numenta introduced is not differentiable.
Which introduces a critical noise to gradients of DL networks.
Am I interpreting this correctly?

  1. k-winner is an approximation of the various inhibitory inter-neurons which have a definite spatial extent. You could play with some form of area integration if you were so inclined but set membership and intersections may be a better descriptive tool.
  2. Sparse binary in general does not lend itself to calculus tools. Did I mention Set Theory?

I just looked at the code again with fresh eyes, and yes indeed the backward pass looks correct. My mistake. Not sure what @marty1885 meant regarding the gradient calculation.

Regarding sparse weights, if it’s “a randomized sampling of the inputs” then how is it different from dropconnect [1]?

To summarize the paper, I see three main ideas:

  1. Dropconnect: set random weights to zero (does not help for convolutional layers).

  2. Spatial sparsity: select k largest activations from each layer, setting all others to zero, at each forward pass.

  3. Lifetime sparsity: keep track of each neuron activity and boost activations which have been frequently set to zero (making it more likely to get selected in step 2), so that each neuron is equally active, on average.

Is this correct?

Spatial and lifetime sparsities have been combined in WTA autoencoder paper [2] where the lifetime sparsity was enforced by selecting k largest activations of a particular position across time (minibatch), and the spatial sparsity was enforced by selecting k largest activations across space (separately from each feature map).

Your main contribution seems to be a novel method to enforce lifetime sparsity, but you haven’t demonstrated that it leads to any improvement over the original method. To be fair, you don’t claim it to be superior. Is it?


1 Like

I believe the claim is superior performance with a noisy data-set.
See figure 5B in post 12 above for an example.

Superior to WTA AE? Where is it mentioned?

I think that main contribution is that sparsity can be used on classification. And that sparsity can provide extra resistance against noise.

I have been messing with it lately and it seems to be providing better regularization performance vs dropout.

Nevermind it’s my confusion between how Torch works and how tiny-dnn works

WTA AE paper is about classification.

Sorry for the confusion. It’s true that WTA AE is used for classification. But it’s done by sending the encoded vector into a SVM and then use the SVM to classify the image (if I read the paper correctly). The sparsity part is never a part of the classifier. However in this paper, it’s an end to end approach and sparsity is a part of the classifier.

I actually implemented WTA AE a couple of years ago, when I was evaluating various autoencoders for feature extraction on a custom dataset. First of all, the type of the classifier to classify the vector of top level features usually does not matter. What matters is the quality of the features. I used a linear classifier in my implementation, and it didn’t make much difference compared to SVM. I also trained it end to end (rather than one layer at a time, like they did in the paper), it trained just fine, and again, didn’t make much difference. The sparsity is used during training so that the network learns to extract better features. During testing we remove sparsity constraints, because they are not needed and don’t make sense if we want to do classification. By the way, in Numenta’s paper they also reduce sparsity during testing to 50%.

My point is it’s not clear if Numenta’s modification leads to better noise robustness (or better noise-free classification accuracy) because it has not been tested against the original method. It might be superior, but we simply don’t know at this point. I might test it when I have time, but this really should have been done in the paper.

1 Like

After further messing around. KWinner is by far the best solution against noise without training against noisy data. Also, KWinner is effective against all sort of noise, set random pixels to random values, add random values to pixels, set random values to 0.5, etc… In fact in most cases the raw network without regularization performs better against noise than networks trained with regularizations besides KWinner + boosting.

For example, corrupting images with Gaussian noise: image_new = clamp(image + 0.8*gaussian_noise, 0, 1)

Raw Dropout Batch Norm KWinner KWinner + boosting
61.21% 35.41% 44.21% 45.18% 69.67% (Accuracy)

@mảty1885 maybe Contractive AE can do better performance?

Maybe, maybe not. I’m using KWinner in a classifier instead of in an AE (as a replacement of dropout/batchnorm). I’ll investigate.

I written a follow-up to my previous post. Hopefully it helps someone.
I have never thought boosting would be so powerful in NNs. Like… jaw drooping amazing. This is how well different regularization method fairs against Gaussian noise. (more realistic then my previous test of setting random pixels to random values)

noise/accuracy(%) Raw Dropout Batch Norm KWinner KWinner+boosting
0.0 98.91 98.11 97.79 98.47 98.22
0.05 94.58 76.99 62.23 71.35 95.68
0.1 94.33 76.97 61.77 70.89 95.55
0.2 93.4 76.39 60.6 68.28 95.35
0.3 91.29 75.69 58.73 66.03 94.81
0.4 88.58 74.25 56.49 62.31 94.15
0.5 85.57 71.75 53.05 57.43 92.66
0.6 82.28 68.25 48.23 51.25 88.96
0.7 77.52 63.22 42.65 43.39 82.41
0.8 71.72 53.76 32.94 35.75 74.18

K-winners and boosting go hand-in-hand, evolutionarily speaking. The minicolumns are under pressure to represent as much diverse spatial input as possible. Without homeostatic regulation (boosting), the same small percentage of winners will just keep winning. It does not help much until you add the inhibitory neuron pressure. It’s like pressing down a histogram to squelch top top performers, pushing patterns across the entire layer.


To follow on with the previous posts, I originally thought you could get around not using k-winner or boosting of any form in my own more temporal version of htm. Boy was I wrong, even with learning and a specific hex grid / lateral connection - inhibition approach you just end up with the same columns dominating.

Also as a side note, regarding MNIST I originally a long time ago whipped up a nupic HTM version to solve it using just the spatial pooled getting around 95%. I was happy until last week I took a quick double check and noticed I had all learning set to 0.

It seems that using the spatial poppers k-winner takes all to simply pick the top 5% columns and feed them into a KNN is enough even without learning, ha typical.


You get ~95% on MNIST using template-and-match… Well… So what happens if you turn on learning?

1 Like

Depends on the reception field size, if I use a practically global size input dimensions (28,28) == potential radius (28, 28) it just screws up and the number of active columns shrinks down to 5-10 where it is normally 240.

The settings is use for no learning is:

spParamTopologyWithBoostingGlobalInhibition = {
    "inputDimensions": (28, 28),
    "columnDimensions": (64, 64),
    "potentialRadius": 28,
    "potentialPct": 0.8,
    "globalInhibition": True,
    "localAreaDensity": -1,
    "numActiveColumnsPerInhArea": 240,
    "wrapAround": True,
    "stimulusThreshold": 0.0,
    "synPermInactiveDec": 0.0,
    "synPermActiveInc": 0.0,
    "synPermConnected": 0.8,
    "minPctOverlapDutyCycle": 0.001,
    "dutyCyclePeriod": 100,
    "boostStrength": 1.0,
    "seed": 7777,
    "spVerbosity": 0

Final Results - Error: 4.27% Total Correct: 9573 Total Incorrect: 427

However this could be wrong elsewhere in the code, I’m not sure how nupic handles a param value of 0 for stimulus threshold, increments etc.

Seems odd, so prob something wrong my end but I thought it was interesting.