Obstacles to widespread commercial adoption of sparsity in the ML industry?

cezar_t · March 25, 2022, 3:14pm

There are two issues here that are not necessarily (or entirely) interdependent: sparsity of representation and sparsity of computation.
Representational sparsity, which means using SDRs al lover the place instead of dense float vectors is indeed a tough nut to crack within DL paradigm.

Computational sparsity however it’s a different thing. In general it simply means that by having a relatively large model, each inference operates upon a relatively small subset of model’s parameters.

Of course at certain stages within the sparse computing paradigm there has to be one or more SDR-s to select the subset of “active” parameters at any given stage, but that doesn’t change fundamentally how backpropagated NNs work, so at least in theory the way GPU-s would have to do their work should not have to be very different from how they already do:

Let’s say somewhere in a NN you have a 1000 values wide hidden layer sandwiched between a 100 parameters predecessor and another 100 values wide successor layers. That’s 100x1000 weight matrix “in” and another 100x1000 one “out” on which the machine needs to do matrix multiplication.

In general, depending on how many problems & complexity the model needs to solve, a wider hidden layer - e.g. 20000 instead 1000 - performs better but inference for that layer would get 20 times more expensive and training via backpropagation probably even more than 20 times.

If we could use a 20000 bit wide at 5% sparsity “selector SDR” which simply selects 1000 lines from the 20k on which both forward inference and back propagation are performed then in theory instead of

for line in weight_matrix:  
   compute(line)

we’ll have:

for  line_id in sparse_SDR:
    line = weight_matrix[line_id] 
    compute(line)

(edit: The above selection would be only slightly more expensive than the simple 1000 line matrix multiplication yet operating with a 20000 line model.)

I don’t know how GPU-s work but that looks like a small step to make in hardware to reach a giant leap of how DL could be done.

PS and I think all the fuss within avoiding catastrophe thread can be reduced to this simple selector SDR idea.

PS2 And considering the trend of integrating GPU and CPU without separating their memory space, e.g. in ARM SOCs, most notably Apple’s M1, there might be even easier to implement by dynamic memory mapping between GPU & CPU

Topic		Replies	Views
Research on NN sparsity Lounge	10	609	February 19, 2023
Anyone can explain why Numenta latest algo optimizes Deep Learning 100x? Machine Learning	15	1165	May 15, 2023
Poster Overview: How Can We Be So Slow? Realizing the Performance Benefits of Sparse Networks Current Research	9	946	May 5, 2022
Numenta Technology Demonstration: Sparse networks perform inference 50 times faster than dense networks, with competitive accuracy Machine Learning	6	1216	November 12, 2020
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems Machine Learning	3	859	April 16, 2021

Obstacles to widespread commercial adoption of sparsity in the ML industry?

Related topics