Hi I wish to bring attention to work of D.C.Mocanu with very promising results on training large sparse networks on CPU (even on single thread numba or cython) and training times comparable with those of dense networks trained on same task on GPU

The method they use (SET from Sparse Evolutionary Training) not only updates weights but also structure.

In a previous paper they tested existing various sparse matrices and none was suitable for both backpropagation and changing structure yet sparse matrix conversions were very fast. https://arxiv.org/pdf/1901.09181.pdf

Also pure performance in FLOPs might be distracting. I mean they train a dense network in GPU and a 1-2% sparser one (yeah 100x fewer parameters) in CPU in the same time (which would imply CPU/sparse is way less efficient) yet they have comparable performance on various tests.

PS see first paper table 2 at page 7 - there-s comparisons between full dense MLPs (trained on GPU) and their SET sparse versions with 50-100times fewer weights, trained on CPU. And slightly better at same amount of overall time spent on training

I will read those papers.
There is also this information about Fourier Analysis of deep neural networks. https://techxplore.com/news/2023-02-scientific-ai-black-year-old-method.html
The FFT is actually a dot product method. 100% linear. Just the way it is taught with complex numbers would make you not realize. And like I said many times a ReLU neural net is a system of dot products that are switch connected and disconnected to and from each other. With x>=0 as an independent switching predicate. Like in your house you decide to switch on or off a light, which is the switching decision, and the contacts in the switch allow a 120 volt sine wave through or stop it on the basis of the decision.
The computer science confusion about switches is there is one fixed unchanging voltage in digital switches in CPU chips that is connected and disconnected. What is switched is one fixed unchanging voltage, by happenstance due to the design of digital CPU circuits, it is not a general property of switches. Hence the failure to understand ReLU as a switch.

Almost all performant NN implementations rely on some form of accelerated linear algebra library to do the matrix products. So I would hazard a guess that they are either using a sparse matrix library (e.g. cuSPARSE) or some other custom operation.

As for how they come up with the structure for the weights, it appears as though they are using some form of weight pruning during the training stage. If I understand the paper correctly, they seem to be averaging the updates from each worker thread and then pruning back to the prescribed sparsity level. The updates are then distributed back to the worker threads.

So if they are able to express the weight matrices in a sparse format, they could potentially leverage existing sparse linear algebra libraries.

I’m not saying that sReLU does not result in better performance, it probably does.
If you view ReLU as a function sReLU is one of a set of possible improvements that may naturally occur to you.
If you view ReLU as a switch there is a set of possible improvements that may naturally occur to you. The 2 sets are not the same. Though there is a little bit of intersection.
Personally I think the switch viewpoint gives the greater improvement.

The FFT, the WHT and other nlog(n) fast transforms are the one of the greatest forms of linear algebra acceleration known, especially if you want full connectivity.
There are even sparse FFT and WHT algorithms if you wanted to explore that, though for sure they will not provide consistant one to all connectivity.

One kind of switching scheme is to realize that the output of 1 neuron in layer N is connected to all the neurons in layer N+1 via a single weight for each neuron instance in N+1. You are projecting like this vector pattern onto the summing functions of the neurons in N+1, of varying intensity according to the output of the neuron in N.
If the single neuron in N is ReLU then either the pattern is not projected at all (x<0) or projected with intensity of x (x>=0).
Why not then actually project an alternative pattern when (x<0) rather than nothing? You need to provide an extra set of weights to do the negative projection.
You could create a dense neural network that way, and maybe redefine the exact meaning of a layer.
However you could create multiple tiny width 4 layers with such forward weight switching and combine their outputs with a fast transform, specifically the Walsh Hadamard transform which provides one to all connectivty with no information loss. I find that most effective.
With multiple width 2 layers of that type you start to get the sort of information loss that plagues ReLU nets, and results in messy ad hoc solutions like sReLU, leaky ReLU etc.
Width 8 and above layers are difficult to squash into efficient machine code because of register availability pressure.