There are two issues here that are not necessarily (or entirely) interdependent: sparsity of representation and sparsity of computation.
Representational sparsity, which means using SDRs al lover the place instead of dense float vectors is indeed a tough nut to crack within DL paradigm.
Computational sparsity however it’s a different thing. In general it simply means that by having a relatively large model, each inference operates upon a relatively small subset of model’s parameters.
Of course at certain stages within the sparse computing paradigm there has to be one or more SDR-s to select the subset of “active” parameters at any given stage, but that doesn’t change fundamentally how backpropagated NNs work, so at least in theory the way GPU-s would have to do their work should not have to be very different from how they already do:
Let’s say somewhere in a NN you have a 1000 values wide hidden layer sandwiched between a 100 parameters predecessor and another 100 values wide successor layers. That’s 100x1000 weight matrix “in” and another 100x1000 one “out” on which the machine needs to do matrix multiplication.
In general, depending on how many problems & complexity the model needs to solve, a wider hidden layer - e.g. 20000 instead 1000 - performs better but inference for that layer would get 20 times more expensive and training via backpropagation probably even more than 20 times.
If we could use a 20000 bit wide at 5% sparsity “selector SDR” which simply selects 1000 lines from the 20k on which both forward inference and back propagation are performed then in theory instead of
for line in weight_matrix:
compute(line)
we’ll have:
for line_id in sparse_SDR:
line = weight_matrix[line_id]
compute(line)
(edit: The above selection would be only slightly more expensive than the simple 1000 line matrix multiplication yet operating with a 20000 line model.)
I don’t know how GPU-s work but that looks like a small step to make in hardware to reach a giant leap of how DL could be done.
PS and I think all the fuss within avoiding catastrophe thread can be reduced to this simple selector SDR idea.
PS2 And considering the trend of integrating GPU and CPU without separating their memory space, e.g. in ARM SOCs, most notably Apple’s M1, there might be even easier to implement by dynamic memory mapping between GPU & CPU