How Can We Be So Dense? The Benefits of Using Highly Sparse Representations

lucasosouza · July 8, 2019, 7:04pm

In dropconnect, at each round during training, a random set of weights is set to zero. During inference, all weights are used - it is still a dense model, since all connections have a weight attributed to it. A common interpretation of dropout techniques (but not the only interpretation) is that it allows you to learn several different models with one single network, so you are actually learning an ensemble of smaller networks that shares some parameters.

In the paper you cited, weights are sparse at initialization and at inference. But most important, what leads to robustness is not the sparse weights alone, but the combination of sparse weights and sparse activation functions (k-winners with boosting).

SeanOConnor · July 11, 2019, 12:13am

The weighted sum, cosine angle and the central limit theorem.
Where the angle is smallest between the weights and the input you will get the maximum output for a fixed length vector input and the central limit theorem applies. To get the same output at a larger angle all the weights have to be multiplied by a scalar greater than one. Then when you go to calculated the central limit theorem you find the variance has proportionally increased.

If you deliberately pick the weighted sum with maximum output then in a certain way you are doing the best that you can with regards to the error correction the central limit theorem can provide.

Isn’t that slightly tricky to understand? It explains (if you work out the details) why the more information you store in a weighted sum the less effect the central limit theorem has.
It is brutal that there is no tutorial about this on the internet and that rather suggests that AI researchers simply don’t know/understand.
Actually I’d like to read such a tutorial myself since I only have the outlines.

SeanOConnor · July 11, 2019, 3:01am

So as you store more <pattern,response> items in a weighed sum, the vector length of the weights increases while the average magnitude of the response scalars remain about the same. The central limit theorem reduction in noise is increasingly cancelled out by the increase in the weight vector length. Also the average angle of the patterns increases away from a central point.
One interesting thing is that learning many <pattern,response> items where the patterns are close to each other and also the response scalars are near only slightly increases the length of the weight vector. This is rather positive for learning both decision regions and paths in higher dimensional space as opposed to just point responses.
Of course the weighted sum has its normal problems of strong additive spurious responses, however higher dimensional space is a big place and almost all vectors are orthogonal. And also the linear seperability issue can be dealt with by placing nonlinearity before the weighted sum.

marty1885 · July 11, 2019, 3:30pm

@lucasosouza BTW, this is my C++ implementation of the paper.

I’ll upload the PyTotch version soon.

fcred · July 14, 2019, 12:24pm

This paper that came out a few days ago may be of interest to @subutai: https://arxiv.org/abs/1907.04840

SeanOConnor · July 15, 2019, 5:34am

You could imagine that if all the weighted sums in a layer were wired up to a common input vector many of them would be redundant.
Unable to find any worthwhile different thing that another wasn’t doing.
The effectiveness of sparsity in such a situation would simply be a reflection of the redundancy.
Just a thought.

SeanOConnor · July 17, 2019, 2:38am

A weighed nonlinarity is a weak learner and you can sum those together to get a strong learner. In a conventional artificial neural network there are only n nonlinear terms per layer. However there are n squared weight terms.
How much better to pair each weight with its own nonlinear term?

Well, yeh I don’t know how much better. It could be anywhere from - not worth the extra computional effort - to you’ve been doing it all wrong for decades.

shiva · September 11, 2019, 1:20pm

hi. I cant run this code, please help me.when I installed requirements.txt

this error appear:
ERROR: htmpaper-how-can-we-be-so-dense 1.0 has requirement numpy==1.13.3, but you’ll have numpy 1.16.5 which is incompatible.
ERROR: htmpaper-how-can-we-be-so-dense 1.0 has requirement torch==1.0.1, but you’ll have torch 1.2.0 which is incompatible.
and when I install numpy==1.13.3 and torch==1.0.1 this two errors are appear:
ERROR: librosa 0.7.0 has requirement numpy>=1.15.0, but you’ll have numpy 1.13.3 which is incompatible.
ERROR: htmpaper-how-can-we-be-so-dense 1.0 has requirement to

rch==1.0.1, but you’ll have torch 1.2.0 which is incompatible.

i don’t know what should i do.please help me to run this code

rhyolight · September 11, 2019, 7:51pm

Thanks for the report. I will look into this by the end of today.

rhyolight · September 11, 2019, 10:36pm

Did you install this by running python setup.py develop?

If so, try updating the librosa requirement in requirements.txt to librosa==0.6.2. Does that help?

shiva · September 14, 2019, 1:25pm

yes it was great.
thanks a lot

SeanOConnor · September 21, 2019, 4:15pm

The central limit theory behavior of the weighed sum seems to be little known, unknown or simply ignored by neural network researchers. I personally think it is a key property to be aware of.
Anyway with the sparseness operator you describe you basically lose n-k noise terms. There is a difference then between the output variance of the weighted sum if you do or don’t use sparsity. You could look into that I suppose.

SeanOConnor · September 22, 2019, 7:41am

A couple of the metrics on a weighted sum in a neural network:
1/ The length of the weight vector (which determines the output variance for iid input noise.)
2/ The minimum and maximum angles between the input vectors (resulting from the training examples) and the weight vector. The closer those angles are to 90 degrees the worse the signal to noise ratio of the weighted sum. Near to zero and you will actually get some error correction. That is because near to 90 degrees the length of the weight vector must be very large to get even a moderate output value, the dot product is zero at 90 degrees. Close to zero the weight vector length is small and most of the noise from other angles is averaged out.

Weighted sums of weighted sums can be converted back to a single weighted sum. It should be possible to “freeze” one of those ReLU Numenta sparse neural networks using one particular input from the training data or test data. Work out the single weighted sum for one of the output neurons. And start figuring out some metrics on it. See if that weighted sum is ever reused etc.
Metrics are :

SeanOConnor · September 22, 2019, 7:48am

You can probably even see from the weights what the network is looking at.

marty1885 · September 27, 2019, 6:05am

I finally got back to dealing with this. I have uploaded my implementation and adversarial attack experiments. Please fave a look if anyone is interested.

pechyonkin · October 8, 2019, 2:17am

@subutai I am trying to implement the k-winners layer for the fastai library from scratch.

The plan is to try and train a larger network on a large dataset and examine effects of the sparsity introduced in the paper.

However, I got stuck at trying to decompose equation (6) from the paper into code, could you please clarify how to proceed with the running average of duty cycle?

Some of the questions:

what is the value of \alpha?
what does \alpha \cdot\left[i \in \operatorname{topIndices}^{l}\right] component precisely mean? Do we multiply alpha by indices of maximally activating neurons? Sorry for asking, but this notation is not explained in the paper.
What is the initial value of the duty cycle? By looking at the code base it seems that it is zero, can you please confirm this?

hsgo · October 8, 2019, 11:02am

It just determines how easily the duty cycle changes. It’s not an exact value but it ranges between 0 to 1.

I’ll just explain this in a pseudo code:

if i in topIndices then
    d(t) = (1 - α) * d(t - 1) + α
else
    d(t) = (1 - α) * d(t - 1)

I’m not entirely sure about this, but I think it must be the sparsity.

pechyonkin · October 8, 2019, 7:27pm

It seems that here you are updating the duty cycle using an older version (Equation 8) from earlier paper The HTM Spatial Pooler—A Neocortical Algorithm for Online Sparse Distributed Coding, which uses time steps to calculate the duty cycle.

However, in “How Can We Be So Dense”, the duty cycle is updated using Equation 6, by using an \alpha parameter, whose value is not provided in the paper. In addition, the right-most term (the one with square brackets) is not clear: notation is not explained and I am not sure what it is doing.

I am trying to implement the paper from scratch to explore training a bigger model and testing its noise resilience, so could you please clarify:

the value of \alpha as per Equation 6 from the paper;
what the term \alpha \cdot\left[i \in \text { topIndices }^{l}\right] is actually doing;
should I try implementing Equation 6 as it is presented in the paper, or should I use your own implementation that uses Equation 8 from the earlier “The HTM Spatial Pooler” paper?

rhyolight · October 8, 2019, 7:28pm

We start with 1000.

pechyonkin · October 8, 2019, 7:30pm

What about the value of \alpha?

Is @hsgo correct in assuming that [i \in \text { topIndices } l] is just an indicator function with values either 0 or 1?

Topic		Replies	Views
Evading adversarial attacks with sparse representations and its implications NuPIC	2	650	January 13, 2021
Compare spatial pooling with other sparse methode Lounge question	5	465	November 14, 2019
Research on NN sparsity Lounge	10	609	February 19, 2023
The biological reason for sparsity? Numenta Theory	4	737	July 6, 2020
Numenta Technology Demonstration: Sparse networks perform inference 50 times faster than dense networks, with competitive accuracy Machine Learning	6	1216	November 12, 2020

How Can We Be So Dense? The Benefits of Using Highly Sparse Representations

Related topics