How Can We Be So Dense? The Benefits of Using Highly Sparse Representations

This paper, about using aspects of Spatial Pooling in deep learning networks, is now available on arXiv:

@subutai also talks about this work in the last half of this presentation at Microsoft Research.


Very exciting paper! Good job guys!

1 Like

Such humor in the title of a scientific paper. I love it. :-D.

Thanks @rhyolight for posting.
And huge thanks to @subutai and @lscheinkman for writing it! Congratulations with the publishing.


ArXiv is actually a pre-print server, so no congrats are in order (pretty much any paper can be published there). But thank you anyway! :slight_smile:


I made a quick partial implementation in tiny-dnn (with only k-winner). Made it to classify MNIST. And the noise resistant capability surpasses batch norm and dropout! Amazing! Really. wow.

Noise/Accuracy(%) Raw Dropout Batch Norm KWinner
0.00 98.36 96.62 92.98 94.38
0.05 93.97 94.66 92.52 92.87
0.1 83.78 90.95 90.30 89.78
0.2 64.13 73.79 79.98 80.23
0.3 51.28 51.52 61.45 69.60
0.4 42.74 31.82 37.54 62.41
0.5 37.26 21.09 18.53 54.75
0.6 31.93 16.40 11.73 46.97
0.7 27.77 14.63 10.26 37.71
0.8 20.88 13.52 10.04 25.66

Time to clean the code up and make a blog post :smiley:


Nice results. Thanks

Just looked at the paper, now looking at the code. A few things I noticed so far:

  1. Table 1 compares Dense CNN-2 to Sparse CNN-2, however, Table 3 shows that Sparse CNN-2 has 64 filters in the second layer, while Dense CNN-2 has 30. Also, Fig 5B shows that Dense CNN-1 has better performance than Dense CNN-2, which is strange.

  2. Dense CNN-2 config would severely overfit on non-augmented MNIST, making it significantly less robust to noise in the test set. I’m curious why dropout was not used for MNIST results. Or any other regularization method to prevent overfitting (e.g. L1 or L2 penalties).

  3. GSC results: if Kaggle top results were 97.0 - 97.5% using large models (ResNets and VGG), then how could you achieve similar results (96.5 - 97.2%) using a small and simple 2 layer CNN? Also, it’s a bit strange that dropout had a negative effect on the noise score for Dense CNN-2 model in Table 2.

  4. Sparse weights method looks like a form of dilated convolutions. Are the weights to be zeroed chosen randomly? Seems like it, unless I’m missing something.

  5. Last but not least: backward pass for k_winner, why do you override gradients like that? This introduces a lot of noise into the training process, so it’s possible that alone makes the sparse models more noise-tolerant. I’m very curious to understand the reasoning there.


Wait a sec. Seems I didn’t read the paper carefully enough and missed that part. In my partial implementation, I did a proper back propagation. After some test, seems that overriding the gradients indeed introduces a lot of noise (Accuracy dropped from 94.38% to 77% at best).

Here is it.


@marty1885 excellent job, especially with tiny_dnn!

1 Like

Thanks for the comments. Looks like we put in the wrong figure for 5B (it’s a really old one). Thank you for pointing that out - we will update this in a revision. Attached is the correct figure:

Dropout was attempted for MNIST (see section on “Impact of dropout”). Basically it helped a little for Dense, and hurt for sparse networks. As far as we can tell there was no overfitting - the raw test scores were always high for the configurations listed.

With GSC, we managed to get pretty good results with a smaller network. However, for Kaggle, the last few decimal points are important - we definitely did not get to their best score. I’m not 100% sure why dropout had a negative effect, but note that we were using batch norm for GSC. Some people have reported that dropout does not help too much if batch norm is used.

Sparse weights are a randomized sampling of the inputs, unlike dilated convolutions. As such, it should be able to pick up on a larger set of patterns (similar to the compressed sensing literature). Also, we can use it for regular linear layers, not just convolutions.

The backward pass is to make k-winners behave just like ReLU. A gradient of 1 for the winners, zero everywhere else. (This was explained in the paragraph just before Boosting.) Unless we made some mistake in the code, I don’t think it is adding any noise - it’s the right thing to do here, analogous to ReLU.

Note: I will be out for the next week, but will keep an eye on this page for any other corrections / comments. @lscheinkman and I plan to do an updated version with errata - really appreciate the feedback from the community.


Wow, that’s a great blog post! Thanks for doing this so quickly!

One quick note: you say that “In the paper the gradient are the sign of the sum of the not-discarded gradients.” This is not quite correct. The gradient for k-winners should be 1 for the winners, 0 everywhere else.

1 Like

I haven’t read the paper yet, but reading through this thread, it’s starting to sound like the k-winner layer Numenta introduced is not differentiable.
Which introduces a critical noise to gradients of DL networks.
Am I interpreting this correctly?

  1. k-winner is an approximation of the various inhibitory inter-neurons which have a definite spatial extent. You could play with some form of area integration if you were so inclined but set membership and intersections may be a better descriptive tool.
  2. Sparse binary in general does not lend itself to calculus tools. Did I mention Set Theory?

I just looked at the code again with fresh eyes, and yes indeed the backward pass looks correct. My mistake. Not sure what @marty1885 meant regarding the gradient calculation.

Regarding sparse weights, if it’s “a randomized sampling of the inputs” then how is it different from dropconnect [1]?

To summarize the paper, I see three main ideas:

  1. Dropconnect: set random weights to zero (does not help for convolutional layers).

  2. Spatial sparsity: select k largest activations from each layer, setting all others to zero, at each forward pass.

  3. Lifetime sparsity: keep track of each neuron activity and boost activations which have been frequently set to zero (making it more likely to get selected in step 2), so that each neuron is equally active, on average.

Is this correct?

Spatial and lifetime sparsities have been combined in WTA autoencoder paper [2] where the lifetime sparsity was enforced by selecting k largest activations of a particular position across time (minibatch), and the spatial sparsity was enforced by selecting k largest activations across space (separately from each feature map).

Your main contribution seems to be a novel method to enforce lifetime sparsity, but you haven’t demonstrated that it leads to any improvement over the original method. To be fair, you don’t claim it to be superior. Is it?



I believe the claim is superior performance with a noisy data-set.
See figure 5B in post 12 above for an example.

Superior to WTA AE? Where is it mentioned?

I think that main contribution is that sparsity can be used on classification. And that sparsity can provide extra resistance against noise.

I have been messing with it lately and it seems to be providing better regularization performance vs dropout.

Nevermind it’s my confusion between how Torch works and how tiny-dnn works