Thanks for the comments. Looks like we put in the wrong figure for 5B (it’s a really old one). Thank you for pointing that out - we will update this in a revision. Attached is the correct figure:
Dropout was attempted for MNIST (see section on “Impact of dropout”). Basically it helped a little for Dense, and hurt for sparse networks. As far as we can tell there was no overfitting - the raw test scores were always high for the configurations listed.
With GSC, we managed to get pretty good results with a smaller network. However, for Kaggle, the last few decimal points are important - we definitely did not get to their best score. I’m not 100% sure why dropout had a negative effect, but note that we were using batch norm for GSC. Some people have reported that dropout does not help too much if batch norm is used.
Sparse weights are a randomized sampling of the inputs, unlike dilated convolutions. As such, it should be able to pick up on a larger set of patterns (similar to the compressed sensing literature). Also, we can use it for regular linear layers, not just convolutions.
The backward pass is to make k-winners behave just like ReLU. A gradient of 1 for the winners, zero everywhere else. (This was explained in the paragraph just before Boosting.) Unless we made some mistake in the code, I don’t think it is adding any noise - it’s the right thing to do here, analogous to ReLU.
Note: I will be out for the next week, but will keep an eye on this page for any other corrections / comments. @lscheinkman and I plan to do an updated version with errata - really appreciate the feedback from the community.