How Can We Be So Dense? The Benefits of Using Highly Sparse Representations

Hi @beginner

In dropconnect, at each round during training, a random set of weights is set to zero. During inference, all weights are used - it is still a dense model, since all connections have a weight attributed to it. A common interpretation of dropout techniques (but not the only interpretation) is that it allows you to learn several different models with one single network, so you are actually learning an ensemble of smaller networks that shares some parameters.

In the paper you cited, weights are sparse at initialization and at inference. But most important, what leads to robustness is not the sparse weights alone, but the combination of sparse weights and sparse activation functions (k-winners with boosting).

4 Likes

The weighted sum, cosine angle and the central limit theorem.
Where the angle is smallest between the weights and the input you will get the maximum output for a fixed length vector input and the central limit theorem applies. To get the same output at a larger angle all the weights have to be multiplied by a scalar greater than one. Then when you go to calculated the central limit theorem you find the variance has proportionally increased.

If you deliberately pick the weighted sum with maximum output then in a certain way you are doing the best that you can with regards to the error correction the central limit theorem can provide.

Isn’t that slightly tricky to understand? It explains (if you work out the details) why the more information you store in a weighted sum the less effect the central limit theorem has.
It is brutal that there is no tutorial about this on the internet and that rather suggests that AI researchers simply don’t know/understand.
Actually I’d like to read such a tutorial myself since I only have the outlines.

So as you store more <pattern,response> items in a weighed sum, the vector length of the weights increases while the average magnitude of the response scalars remain about the same. The central limit theorem reduction in noise is increasingly cancelled out by the increase in the weight vector length. Also the average angle of the patterns increases away from a central point.
One interesting thing is that learning many <pattern,response> items where the patterns are close to each other and also the response scalars are near only slightly increases the length of the weight vector. This is rather positive for learning both decision regions and paths in higher dimensional space as opposed to just point responses.
Of course the weighted sum has its normal problems of strong additive spurious responses, however higher dimensional space is a big place and almost all vectors are orthogonal. And also the linear seperability issue can be dealt with by placing nonlinearity before the weighted sum.

@lucasosouza BTW, this is my C++ implementation of the paper.

I’ll upload the PyTotch version soon.

2 Likes

This paper that came out a few days ago may be of interest to @subutai: https://arxiv.org/abs/1907.04840

3 Likes

You could imagine that if all the weighted sums in a layer were wired up to a common input vector many of them would be redundant.
Unable to find any worthwhile different thing that another wasn’t doing.
The effectiveness of sparsity in such a situation would simply be a reflection of the redundancy.
Just a thought.

A weighed nonlinarity is a weak learner and you can sum those together to get a strong learner. In a conventional artificial neural network there are only n nonlinear terms per layer. However there are n squared weight terms.
How much better to pair each weight with its own nonlinear term?

Well, yeh I don’t know how much better. It could be anywhere from - not worth the extra computional effort - to you’ve been doing it all wrong for decades.

hi. I cant run this code, please help me.when I installed requirements.txt

this error appear:
ERROR: htmpaper-how-can-we-be-so-dense 1.0 has requirement numpy==1.13.3, but you’ll have numpy 1.16.5 which is incompatible.
ERROR: htmpaper-how-can-we-be-so-dense 1.0 has requirement torch==1.0.1, but you’ll have torch 1.2.0 which is incompatible.
and when I install numpy==1.13.3 and torch==1.0.1 this two errors are appear:
ERROR: librosa 0.7.0 has requirement numpy>=1.15.0, but you’ll have numpy 1.13.3 which is incompatible.
ERROR: htmpaper-how-can-we-be-so-dense 1.0 has requirement to

rch==1.0.1, but you’ll have torch 1.2.0 which is incompatible.

i don’t know what should i do.please help me to run this code

2 Likes

Thanks for the report. I will look into this by the end of today.

1 Like

Did you install this by running python setup.py develop?

If so, try updating the librosa requirement in requirements.txt to librosa==0.6.2. Does that help?

1 Like

yes it was great.
thanks a lot :pray:

The central limit theory behavior of the weighed sum seems to be little known, unknown or simply ignored by neural network researchers. I personally think it is a key property to be aware of.
Anyway with the sparseness operator you describe you basically lose n-k noise terms. There is a difference then between the output variance of the weighted sum if you do or don’t use sparsity. You could look into that I suppose.

A couple of the metrics on a weighted sum in a neural network:
1/ The length of the weight vector (which determines the output variance for iid input noise.)
2/ The minimum and maximum angles between the input vectors (resulting from the training examples) and the weight vector. The closer those angles are to 90 degrees the worse the signal to noise ratio of the weighted sum. Near to zero and you will actually get some error correction. That is because near to 90 degrees the length of the weight vector must be very large to get even a moderate output value, the dot product is zero at 90 degrees. Close to zero the weight vector length is small and most of the noise from other angles is averaged out.

Weighted sums of weighted sums can be converted back to a single weighted sum. It should be possible to “freeze” one of those ReLU Numenta sparse neural networks using one particular input from the training data or test data. Work out the single weighted sum for one of the output neurons. And start figuring out some metrics on it. See if that weighted sum is ever reused etc.
Metrics are ::nauseated_face:

You can probably even see from the weights what the network is looking at.
:nauseated_face::nauseated_face:

I finally got back to dealing with this. I have uploaded my implementation and adversarial attack experiments. Please fave a look if anyone is interested.

2 Likes

@subutai I am trying to implement the k-winners layer for the fastai library from scratch.

The plan is to try and train a larger network on a large dataset and examine effects of the sparsity introduced in the paper.

However, I got stuck at trying to decompose equation (6) from the paper into code, could you please clarify how to proceed with the running average of duty cycle?

Some of the questions:

  1. what is the value of \alpha?
  2. what does \alpha \cdot\left[i \in \operatorname{topIndices}^{l}\right] component precisely mean? Do we multiply alpha by indices of maximally activating neurons? Sorry for asking, but this notation is not explained in the paper.
  3. What is the initial value of the duty cycle? By looking at the code base it seems that it is zero, can you please confirm this?

It just determines how easily the duty cycle changes. It’s not an exact value but it ranges between 0 to 1.

I’ll just explain this in a pseudo code:

if i in topIndices then
    d(t) = (1 - α) * d(t - 1) + α
else
    d(t) = (1 - α) * d(t - 1)

I’m not entirely sure about this, but I think it must be the sparsity.

It seems that here you are updating the duty cycle using an older version (Equation 8) from earlier paper The HTM Spatial Pooler—A Neocortical Algorithm for Online Sparse Distributed Coding, which uses time steps to calculate the duty cycle.

However, in “How Can We Be So Dense”, the duty cycle is updated using Equation 6, by using an \alpha parameter, whose value is not provided in the paper. In addition, the right-most term (the one with square brackets) is not clear: notation is not explained and I am not sure what it is doing.

I am trying to implement the paper from scratch to explore training a bigger model and testing its noise resilience, so could you please clarify:

  • the value of \alpha as per Equation 6 from the paper;
  • what the term \alpha \cdot\left[i \in \text { topIndices }^{l}\right] is actually doing;
  • should I try implementing Equation 6 as it is presented in the paper, or should I use your own implementation that uses Equation 8 from the earlier “The HTM Spatial Pooler” paper?
1 Like

We start with 1000.

1 Like

What about the value of \alpha?

Is @hsgo correct in assuming that [i \in \text { topIndices } l] is just an indicator function with values either 0 or 1?

1 Like