Numenta has shown sparsity has order(s) of magnitude effects on price-performance at least under some conditions. What obstacles remain before sparsity is widely or even universally adopted in commercial applications of deep learning?
The short answer is that the cases where sparsity would help are not the common case faced in deep learning deployments.
- It is rare to invest in FPGAs (required for the biggest speedups), as you would be committing yourself to a very specific development model that requires a bunch of domain experts to operate. Also the frameworks that everyone uses aren’t built for FPGAs, so you lose out on a lot of common infrastructure and whatnot.
- Unstructured sparsity doesn’t lead to speedups on GPUs, because sparse memory accesses / computations aren’t much faster than dense ones, so you may just end up loading the whole thing and zeroing out entries.
- Structured sparsity loses even more accuracy than unstructured sparsity. Sometimes you will see this one used though, in the form of block sparsity in certain parts of the network. Still, it is typically beneficial having dense computation in a lot of the network.
- The deep learning frameworks don’t put much priority on sparse computation, so support and optimizations are spotty. Kind of a chicken and egg scenario here.
Just to add my 2 cents, instead of fully switching to sparsity DL models rather use a hybrid approach of partially sparse computations (like modifying attention) and some newer ones proposing sparsely activated subnetworks of the entire network. So it’s not like the industry totally ditches on sparsity - it’s simply that dense computations are much easier and more convenient as pointed out by cfoster…
Another common example is companies “sparsifying” models for deployment leading to much higher performance on CPU at inference time - this is all unstable though, and still at its cutting edge (unless things have changed drastically and I’m not aware )
I think using non-linear number systems could result in learned sparsity.
Basically where the training mechanism indicated an increase for a weight, instead of doing that directly an associated integer counter would be incremented.
Likewise decrements for decreases.
The idea then is to set the actual weights according to a function applied to each integer counter.
or use an exponential version (instead of squaring) to mimic the Multiplicative Weights Update algorithm.
It is maybe not well explained it the paper here, which is why you have to think for yourself in this case:
And perhaps not well implemented here:
Cute idea, anyway I’m retired and don’t want to go back to all that obsessive thinking, I’m quite happy just blogging about cakes and baking.
And the video is here:
Anyway intuitively the idea is to make the weights very spiky, most small, those that have been incremented more than average very large.
I kind of image it would also make training quicker as you can ramp up to large magnitude weights faster than with conventional additive changes.
Personally, I found that whole thing a bit strange.
The primary benefit of sparsity is that it fundamentally changes how things are represented.
The computational efficiency is nice, but it’s kind of a side effect?
I think that more ppl would pay attention to it if they focused on the increased noise-resistance of sparse representations. In particular, to my knowledge, numenta never tested their sparse networks for resistance to adversarial attacks…
In this figure numenta applied random noise to the input.
In an adversarial attack, the noise would not be random. The attacker would reverse engineering the network and then careful choose which bits to corrupt. The attacker can control the output classification while only corrupting the inputs by an imperceptibly small amount.
Resistance to random noise is a promising start. But the next step, demonstrating resistance to adversarial attacks, is the subject of DARPA grants.
I agree. This is the most striking feature of this entire direction: the SDR represents values in a way that is quite unlike anything we are familiar with in von Neumann computing. We learn computing entirely based on small numbers. 8/16/32-bit integers, 32/64-bit floats, 256 ASCII characters, a million or so Unicode characters. SDRs are a total paradigm shift at the most fundamental level.
What would computation look like if every ‘object’ was an SDR? SDRs for integers, floats, dates, characters, colours, locations, etc. And it doesn’t stop there: SDRs to identify sounds, words, images, anything you can give a name to. What would computing look like then? I don’t know, but I think that is the path to AGI.
If anyone knows of of writing on that topic please share.
If y’all want a read on how modern DL works, and the principles underlying why sparsity is generally not efficient, I recommend this article.
Fundamentally, with the Von Neumann style architectures we have today, memory and compute are far apart, so the way to do anything efficiently is: loading large, contiguous ranges from memory, transporting those to the compute elements, and doing many computations on those at once (for example, dense matrix multiplications). A good word to keep in mind is arithmetic intensity, which is the ratio of arithmetic operations done to bytes of memory accessed.
Well, yeh sparsity does not mesh well with GPUs etc. for a number of reasons including memory caching.
You can have dense sparsity if you are willing to cast aside enshrined convention and go on a little mathematical journey.
There are two issues here that are not necessarily (or entirely) interdependent: sparsity of representation and sparsity of computation.
Representational sparsity, which means using SDRs al lover the place instead of dense float vectors is indeed a tough nut to crack within DL paradigm.
Computational sparsity however it’s a different thing. In general it simply means that by having a relatively large model, each inference operates upon a relatively small subset of model’s parameters.
Of course at certain stages within the sparse computing paradigm there has to be one or more SDR-s to select the subset of “active” parameters at any given stage, but that doesn’t change fundamentally how backpropagated NNs work, so at least in theory the way GPU-s would have to do their work should not have to be very different from how they already do:
Let’s say somewhere in a NN you have a 1000 values wide hidden layer sandwiched between a 100 parameters predecessor and another 100 values wide successor layers. That’s 100x1000 weight matrix “in” and another 100x1000 one “out” on which the machine needs to do matrix multiplication.
In general, depending on how many problems & complexity the model needs to solve, a wider hidden layer - e.g. 20000 instead 1000 - performs better but inference for that layer would get 20 times more expensive and training via backpropagation probably even more than 20 times.
If we could use a 20000 bit wide at 5% sparsity “selector SDR” which simply selects 1000 lines from the 20k on which both forward inference and back propagation are performed then in theory instead of
for line in weight_matrix: compute(line)
for line_id in sparse_SDR: line = weight_matrix[line_id] compute(line)
(edit: The above selection would be only slightly more expensive than the simple 1000 line matrix multiplication yet operating with a 20000 line model.)
I don’t know how GPU-s work but that looks like a small step to make in hardware to reach a giant leap of how DL could be done.
PS and I think all the fuss within avoiding catastrophe thread can be reduced to this simple selector SDR idea.
PS2 And considering the trend of integrating GPU and CPU without separating their memory space, e.g. in ARM SOCs, most notably Apple’s M1, there might be even easier to implement by dynamic memory mapping between GPU & CPU
Indeed, which is what I mentioned above However,
I highly doubt that - reducing computation load is great and all, but its not much in terms to actually getting to AGI.
that’s what basically NNs do, create an alternative high-dimensional representation with the simple fact of it being dense.
I don’t see however how that (sparsification) helps in achieving AGI at all…
This type of approach is becoming more common under the heading of “sparse mixture of experts” models. In fact, the Numenta researchers just did a paper review on it here:
Many of the largest transformer models trained today use such a paradigm. Dense representations with sparse/conditional computation.
There is always the option of compiling a trained sparse neural network as though it were C code. Rather than pin the operations within a fixed computational structure that is probably less than optimal.
So I’m going to stick my neck out and say ANN and SDRs have nothing in common, but might just work in tandem.
In pseudo-anatomical terms, the ANN relates to a network of neurons or columns, but an SDRs relates ti individual synapses on neurons. The weights in an ANN might relate to favoured paths as SDRs are matched.
But my computational argument is that ANNs represent nothing, they only recognise. There is no conceivable way to take ANNs that recognise the numbers 1 and 2 and ‘plus’ and combine them to get a recogniser for 3, but SDRs for 1 and 2 and plus might, given the right framework, generate an SDR for 3.
ANNs recognise, SDRs represent. That’s a critical difference in the hunt for AGI.
The little that was known about the weighted sum as an information storage device from the 1960s through the 1980s has long been forgotten about. Now we are just aquaplaning on a sea of assumptions. Which is not good for self-driving cars for example. I wish someone would knuckle down and apply basic scientific methodology to artificial neural networks and their component parts and then write a clear book about it.
Amen to that. And a chapter on what they can’t do and why, so we know where to focus on other approaches.
You are in luck; it’s been done!
Note the clever bit about what they can’t do.
This is why I love this forum. So much nonsense punctuated every so often by reality. Reminiscent of what Julius Tou used to start talks with back in the 80’s: “We were doing AI when AI wasn’t cool” playing (no pun intended) on the eponymous country song at the time.
This paper provides a pretty solid explanation in debunking alternate theories of how DL works: [2110.09485] Learning in High Dimension Always Amounts to Extrapolation Very very interesting ideas.
I said the representations produced from ANNs - there is a huge difference in what you are saying. Obviously, SDRs aren’t comparable to ANNs at all.
That’s a pretty big misunderstanding right there. I can easily concat both representations which can be perceived as
3 by another ANN. (This is not going into what numbers are exactly and how they arise from nothingness yadda yadda),
but rather pointing out that as you so eminently pointed out,
given the right framework
works the same for Neural Networks too.