Research on NN sparsity

cezar_t · February 13, 2023, 11:22am

Hi I wish to bring attention to work of D.C.Mocanu with very promising results on training large sparse networks on CPU (even on single thread numba or cython) and training times comparable with those of dense networks trained on same task on GPU

Here-s just one of the papers - Truly Sparse Neural Networks at Scale from the list above

It seems quite related to the relatively recent turn towards sparsification of large NNs made by Numenta.

JarvisGoBrr · February 13, 2023, 2:12pm

i’d like to know what kind of datastructure they use for storing the weights and efficiently updating them, thats the major issue for me.

all datastructures I tried to implement fail in performance when compared to the good ol’ big square matrix with mostly zeros.

cezar_t · February 13, 2023, 9:21pm

The method they use (SET from Sparse Evolutionary Training) not only updates weights but also structure.

In a previous paper they tested existing various sparse matrices and none was suitable for both backpropagation and changing structure yet sparse matrix conversions were very fast. https://arxiv.org/pdf/1901.09181.pdf

This is one of the repositories I found GitHub - dcmocanu/sparse-evolutionary-artificial-neural-networks: Always sparse. Never dense. But never say never. A Sparse Training repository for the Adaptive Sparse Connectivity concept and its algorithmic instantiation, i.e. Sparse Evolutionary Training, to boost Deep Learning scalability on various aspects (e.g. memory and computational time efficiency, representation and generalization power).

cezar_t · February 13, 2023, 9:24pm

Also pure performance in FLOPs might be distracting. I mean they train a dense network in GPU and a 1-2% sparser one (yeah 100x fewer parameters) in CPU in the same time (which would imply CPU/sparse is way less efficient) yet they have comparable performance on various tests.

PS see first paper table 2 at page 7 - there-s comparisons between full dense MLPs (trained on GPU) and their SET sparse versions with 50-100times fewer weights, trained on CPU. And slightly better at same amount of overall time spent on training

SeanOConnor · February 14, 2023, 5:30am

I will read those papers.
There is also this information about Fourier Analysis of deep neural networks.
https://techxplore.com/news/2023-02-scientific-ai-black-year-old-method.html
The FFT is actually a dot product method. 100% linear. Just the way it is taught with complex numbers would make you not realize. And like I said many times a ReLU neural net is a system of dot products that are switch connected and disconnected to and from each other. With x>=0 as an independent switching predicate. Like in your house you decide to switch on or off a light, which is the switching decision, and the contacts in the switch allow a 120 volt sine wave through or stop it on the basis of the decision.
The computer science confusion about switches is there is one fixed unchanging voltage in digital switches in CPU chips that is connected and disconnected. What is switched is one fixed unchanging voltage, by happenstance due to the design of digital CPU circuits, it is not a general property of switches. Hence the failure to understand ReLU as a switch.

EEProf · February 14, 2023, 1:48pm

Gottah love this guy’s middle name…and he’s an EE no less. Nice work, his home page at Twente:

SeanOConnor · February 17, 2023, 8:38am

You would think an electrical engineer would be primed to understand ReLU as a switch and not use SReLU. Such is the brainwashing though…

CollinsEM · February 17, 2023, 2:15pm

Almost all performant NN implementations rely on some form of accelerated linear algebra library to do the matrix products. So I would hazard a guess that they are either using a sparse matrix library (e.g. cuSPARSE) or some other custom operation.

As for how they come up with the structure for the weights, it appears as though they are using some form of weight pruning during the training stage. If I understand the paper correctly, they seem to be averaging the updates from each worker thread and then pruning back to the prescribed sparsity level. The updates are then distributed back to the worker threads.

So if they are able to express the weight matrices in a sparse format, they could potentially leverage existing sparse linear algebra libraries.

SeanOConnor · February 18, 2023, 3:41am

I’m not saying that sReLU does not result in better performance, it probably does.
If you view ReLU as a function sReLU is one of a set of possible improvements that may naturally occur to you.
If you view ReLU as a switch there is a set of possible improvements that may naturally occur to you.
The 2 sets are not the same. Though there is a little bit of intersection.
Personally I think the switch viewpoint gives the greater improvement.

The FFT, the WHT and other nlog(n) fast transforms are the one of the greatest forms of linear algebra acceleration known, especially if you want full connectivity.
There are even sparse FFT and WHT algorithms if you wanted to explore that, though for sure they will not provide consistant one to all connectivity.

cezar_t · February 18, 2023, 1:16pm

Just chimed in to mention there-s support for SET(*) in numpy-ml

I don’t know if this is the compiler optimized version, papers mention either cython or numba tests.

(*): Sparse Evolutionary Training

SeanOConnor · February 19, 2023, 2:38pm

One kind of switching scheme is to realize that the output of 1 neuron in layer N is connected to all the neurons in layer N+1 via a single weight for each neuron instance in N+1. You are projecting like this vector pattern onto the summing functions of the neurons in N+1, of varying intensity according to the output of the neuron in N.
If the single neuron in N is ReLU then either the pattern is not projected at all (x<0) or projected with intensity of x (x>=0).
Why not then actually project an alternative pattern when (x<0) rather than nothing? You need to provide an extra set of weights to do the negative projection.
You could create a dense neural network that way, and maybe redefine the exact meaning of a layer.
However you could create multiple tiny width 4 layers with such forward weight switching and combine their outputs with a fast transform, specifically the Walsh Hadamard transform which provides one to all connectivty with no information loss. I find that most effective.
With multiple width 2 layers of that type you start to get the sort of information loss that plagues ReLU nets, and results in messy ad hoc solutions like sReLU, leaky ReLU etc.
Width 8 and above layers are difficult to squash into efficient machine code because of register availability pressure.

Topic		Replies	Views
Poster Overview: How Can We Be So Slow? Realizing the Performance Benefits of Sparse Networks Current Research	9	946	May 5, 2022
Numenta Technology Demonstration: Sparse networks perform inference 50 times faster than dense networks, with competitive accuracy Machine Learning	6	1216	November 12, 2020
Anyone can explain why Numenta latest algo optimizes Deep Learning 100x? Machine Learning	15	1165	May 15, 2023
OpenAI Paper Review: GPU Kernels for Block-Sparse Weights Current Research live , journal-club	3	1060	August 20, 2019
How Can We Be So Dense? The Benefits of Using Highly Sparse Representations Related Papers	81	4413	December 5, 2019

Research on NN sparsity

Related topics