How Can We Be So Dense? The Benefits of Using Highly Sparse Representations

K-winners and boosting go hand-in-hand, evolutionarily speaking. The minicolumns are under pressure to represent as much diverse spatial input as possible. Without homeostatic regulation (boosting), the same small percentage of winners will just keep winning. It does not help much until you add the inhibitory neuron pressure. It’s like pressing down a histogram to squelch top top performers, pushing patterns across the entire layer.

3 Likes

To follow on with the previous posts, I originally thought you could get around not using k-winner or boosting of any form in my own more temporal version of htm. Boy was I wrong, even with learning and a specific hex grid / lateral connection - inhibition approach you just end up with the same columns dominating.

Also as a side note, regarding MNIST I originally a long time ago whipped up a nupic HTM version to solve it using just the spatial pooled getting around 95%. I was happy until last week I took a quick double check and noticed I had all learning set to 0.

It seems that using the spatial poppers k-winner takes all to simply pick the top 5% columns and feed them into a KNN is enough even without learning, ha typical.

2 Likes

You get ~95% on MNIST using template-and-match… Well… So what happens if you turn on learning?

1 Like

Depends on the reception field size, if I use a practically global size input dimensions (28,28) == potential radius (28, 28) it just screws up and the number of active columns shrinks down to 5-10 where it is normally 240.

The settings is use for no learning is:

spParamTopologyWithBoostingGlobalInhibition = {
    "inputDimensions": (28, 28),
    "columnDimensions": (64, 64),
    "potentialRadius": 28,
    "potentialPct": 0.8,
    "globalInhibition": True,
    "localAreaDensity": -1,
    "numActiveColumnsPerInhArea": 240,
    "wrapAround": True,
    "stimulusThreshold": 0.0,
    "synPermInactiveDec": 0.0,
    "synPermActiveInc": 0.0,
    "synPermConnected": 0.8,
    "minPctOverlapDutyCycle": 0.001,
    "dutyCyclePeriod": 100,
    "boostStrength": 1.0,
    "seed": 7777,
    "spVerbosity": 0
}

Final Results - Error: 4.27% Total Correct: 9573 Total Incorrect: 427

However this could be wrong elsewhere in the code, I’m not sure how nupic handles a param value of 0 for stimulus threshold, increments etc.

Seems odd, so prob something wrong my end but I thought it was interesting.

After messing with the concept of spare networks when I should have been studying for final exam. Here are my conclusions (I re-implemented the paper from scratch in PyTorch):

  1. KWinner + Boosting works very well against noise on MNIST
  2. In fact it also protects the network against adversarial attacks (IFGSM and MIFGSM)

But

  1. In MNIST, an Dropout layer can also provide the same level of protection against noise and even adversarial attack

Bumping up the difficulty. I also tried CIFRA-10 with VGG11.

  1. KWinner + Boosting provides the same level of protection against adversarial attack and noise (that is, barely anything, like 2%)
  2. The sparse network does not perform any sort of gradient masking. Thus a straight forward attack works.
  3. The adversarial attacks does transfer from a normal network to a sparse network.
  4. Funny thing, the adversarial examples from a network with dropout does not transfer well to a sparse network.
3 Likes

Hi @marty1885 , awesome job! Is your implementation available in github? We are also interested in evaluating robustness against adversarial attacks as well, and it seems you’ve already done some work on that. And great blog posts, by the way.

The latest paper presented at ICML Uncertainty and Robustness workshop includes some results on CIFAR-10. We’ve also run experiments on CIFAR-100.

As an update, we are currently working on dynamically changing the “sparsity map” during training (changing which weights are set to zero and frozen), something which would be equivalent to learning the structure along with learning the weights. For that, in our new version the connections will be allowed to grow or decay during training. As @michaelklachko correctly pointed out, in the current implementation the zeroed out weights are randomly chosen when the model is initiated, which is far from optimal.

Please keep the contributions coming, I feel they are extremely relevant and helpful to our ongoing research on the topic and would love to discuss collaborations.

Some side notes:

A model with a high degree of weight decay, whatever norm is used, can also push the weights to zero during training and promote sparsity - however, we are looking at structure learning methods which are preferably not gradient based and do not rely on the main loss function. Dropout techniques (such as weight dropout) are powerful regularizers and a related topic, but not the same. In dropout you still need to store all the weights, so memory requirements are the same as regular dense models (calculations can be done on sparse matrices, however you still need the full weight matrix in memory). But most importantly, sparsity in dropout only happens during training and not at inference time.

Research related to pruning is also very relevant. The main difference is that pruning is usually done after a full round of training - the models are first trained as dense models, and then pruned. Our end goal with this work is to have a sparse network during and at the end of training, leading to networks that can do faster predictions with less power requirement.

3 Likes

Thanks for the WTA AE reference, @michaelklachko, I will go over the paper. You are correct, it would be really interesting to see that comparison.

2 Likes

Hang on. It is a hot mess right now. I’ll post it as soon as I have it cleaned up a bit.

2 Likes

Hey @marty1885, by ‘kwinner’ do you mean the orig. WTA AE model (i.e. both spatial and temporal sparsity)? By the way, another good method to increase noise robustness is gradient regularization: https://arxiv.org/abs/1904.01705

I’m going to take a look at this in the next few days (impact of sparsity on noise tolerance and power consumption).

@lucasosouza, do you have any more details on what you’re working on? Perhaps we can collaborate.

2 Likes

@michaelklachko I ensured both lifetime and spatial sparsity. But unlike WTA AE’s solution, I ensured the sparsity using boosting. Which ensures sparsity across batches.

@lucasosouza I’m still working on cleaning up the code. On what subject could we collaborate? Maybe with some knowledge and know-how from Numenta. We can solve the adversial attack problem once and for all.

2 Likes

Would be great to collaborate, @michaelklachko @marty1885 Let me write down some thoughts and research directions we are going with this and perhaps we can write down a joint plan.

@marty1885 Regarding your dropout vs k-winners + boosting suggestion: the only issue I see is that dropout models increase regularization and improve accuracy in the testing set, while our model does not. We specifically show it improves accuracy when noise is added, but not in the regular test set.

If we can get to a working model that performs better than the dense + dropout model in the regular test set, then it could be advertised as a replacement for dropout. I know you have done a lot of experiments in this, I’m curious to hear your thoughts.

2 Likes

We would love to get some feedback in the nupic.torch implementation, it is open source, same as nupic.

This is the latest paper presented at Uncertainty and Robustness workshop at ICML’19. It includes results on CIFAR-10 as well.

I have a doubt about this paper.
The sparse_weight seems to reset a fixed part of the weight to zero.
Dropconnect randomly sets the weights to zero in each epoch.
What is the difference between this fixed and random, why is this fixed zero method improving the resistance?

Hi @beginner

In dropconnect, at each round during training, a random set of weights is set to zero. During inference, all weights are used - it is still a dense model, since all connections have a weight attributed to it. A common interpretation of dropout techniques (but not the only interpretation) is that it allows you to learn several different models with one single network, so you are actually learning an ensemble of smaller networks that shares some parameters.

In the paper you cited, weights are sparse at initialization and at inference. But most important, what leads to robustness is not the sparse weights alone, but the combination of sparse weights and sparse activation functions (k-winners with boosting).

4 Likes

The weighted sum, cosine angle and the central limit theorem.
Where the angle is smallest between the weights and the input you will get the maximum output for a fixed length vector input and the central limit theorem applies. To get the same output at a larger angle all the weights have to be multiplied by a scalar greater than one. Then when you go to calculated the central limit theorem you find the variance has proportionally increased.

If you deliberately pick the weighted sum with maximum output then in a certain way you are doing the best that you can with regards to the error correction the central limit theorem can provide.

Isn’t that slightly tricky to understand? It explains (if you work out the details) why the more information you store in a weighted sum the less effect the central limit theorem has.
It is brutal that there is no tutorial about this on the internet and that rather suggests that AI researchers simply don’t know/understand.
Actually I’d like to read such a tutorial myself since I only have the outlines.

So as you store more <pattern,response> items in a weighed sum, the vector length of the weights increases while the average magnitude of the response scalars remain about the same. The central limit theorem reduction in noise is increasingly cancelled out by the increase in the weight vector length. Also the average angle of the patterns increases away from a central point.
One interesting thing is that learning many <pattern,response> items where the patterns are close to each other and also the response scalars are near only slightly increases the length of the weight vector. This is rather positive for learning both decision regions and paths in higher dimensional space as opposed to just point responses.
Of course the weighted sum has its normal problems of strong additive spurious responses, however higher dimensional space is a big place and almost all vectors are orthogonal. And also the linear seperability issue can be dealt with by placing nonlinearity before the weighted sum.

@lucasosouza BTW, this is my C++ implementation of the paper.

I’ll upload the PyTotch version soon.

2 Likes

This paper that came out a few days ago may be of interest to @subutai: https://arxiv.org/abs/1907.04840

3 Likes

You could imagine that if all the weighted sums in a layer were wired up to a common input vector many of them would be redundant.
Unable to find any worthwhile different thing that another wasn’t doing.
The effectiveness of sparsity in such a situation would simply be a reflection of the redundancy.
Just a thought.

A weighed nonlinarity is a weak learner and you can sum those together to get a strong learner. In a conventional artificial neural network there are only n nonlinear terms per layer. However there are n squared weight terms.
How much better to pair each weight with its own nonlinear term?

Well, yeh I don’t know how much better. It could be anywhere from - not worth the extra computional effort - to you’ve been doing it all wrong for decades.