Poster Overview: How Can We Be So Slow? Realizing the Performance Benefits of Sparse Networks

Numenta Director of ML Architecture Lawrence Spracklen gives an overview of the poster he presented at the SNN Workshop on July 8th and 9th, 2021. In this poster “How Can We Be So Slow? Realizing the Performance Benefits of Sparse Networks” by Lawrence Spracklen, Kevin Hunter and Subutai Ahmad, we present the techniques Numenta has developed to achieve a 100x inference task speedup from sparsity and discuss how many of the learnings could be applied to develop fast sparse networks on CPUs.

Link to poster and abstract: SNN Workshop 2021: How Can We Be So Slow? Realizing the Performance Benefits of Sparse Networks


As a follow-up, here is the paper published by @khunter, @sprack and @subutai:

And here is a presentation @subutai gave at SigOpt Summit 2021:


Great presentation.

I’m curious why the densification by overlapping technique which makes the FPGA performance boost possible, does not work on GPU-s too?

Here-s what I’m referring to: Complementary Sparsity

1 Like

Hugely impressed with the results, particularly the performance. Some thoughts and observations.

Page 7, figure 6, 8-core CPU (no detail on memory and channel qty) which appears to mean that AVX is totall irrelevant as it’s a memory bound compute issue. (ref 625x theoretical) The correct scale is lost due to the lack of detail. Running on a laptop with a small CPU cache and potentially with only a single memory channel populated is >10x different to a Threadripper/EPYC with 8 channels depending on how the code is also made parallel.

Also, if the model would fit in one of the larger EYPC CPU caches the increase in performance would be significant enough to change some of your conclusions. Try running the model so that it’s in CPU cache only to try and quantify this difference, which is influenced by the CPU strangler known as DIMM.

Switching the CPU to a Xeon 8275CL (also with no memory channel details) then eliminates consistenty comparison for Fig 8. Single CPU setup or dual ? NUMA performance hit ?

Intel X8275CL 35MB cache / 6 memory channels / 24 core
EPYC 7763 256MB cache / 8 memory channels / 64 core

The cache can make a 10x difference on some code.

Figure 13 (c) “CPUs” - is this then implying more than one CPU and then a third CPU hardware configuration ?

If the FPGA could implement a fully pipelined model the results would have been over 100x faster again for aggregate throughput but single response latency would be far slower than a GPU. The FPGA’s are just not big enough yet…

The FPGA results give a speech recognition rate of 15.8 days audio per second, which if spread over say 12 hours “awake” time that means 1 month per second. This would then imply that the FPGA can recognise a lifetime of word audio recognition in under 16 minutes. Why do we think this is slow ?

Also, the energy to recognise a lifetimes audio is then 54Wh or less energy than is stored in my flashlight. (based on 215W spec sheet of U250 and my flashlight with 3 x 26650 19.5Wh cells). The typical Alexa device consumes that in just over a day on standby (at 2W)… hmmm… so… The compute recognition is energy then 0.00392% of the device footprint.

Perhaps in future you will be able to compile a trained sparse neural network into an optimal sequence of machine language instructions, like a programming language.
Anyway for sub-optimal reasons there is a booklet on the WHT on

1 Like

I think that particular model fits in cpu cache even on many laptops.

The baseline dense version of the network contained 2,522,128 parameters, while the sparse network
contained 127,696 non-zero weights, or about 95% sparse.

Both activations and weights are quantized to 8-bits

What I don’t understand (but who am I), is how enough of those complementary systems can be found without risk of overlap, and how the operations to verify the non-overlap is not more expensive than calculating the matrices themselves.

I suppose there are mathematical and statistical techniques that allow this. Maybe someone can shed some noob-friendly light on this?

Subutai’s presentation is at a SigOpt conference - all I got about that is that it is an optimization tool they used to optimize the sparse networks.

And the sparsification/optimization counts as “training” since it is performed once while computing the dot product affects inference time every time the model is evaluated.

Question: does this complementary sparsity approach work on “ordinary” accelerators like GPUs/TPUs/NPUs? The posted CPU and FPGA speedups are impressive but I’m somewhat skeptical those can provide the same kinds of speedups at the relevant compute scales.

Re the “Two Sparsities” paper, Ivan Godard of Mill Computing, Inc. has asked a question regarding the compaction algorithm. Godard is the founder of Mill Computing and has deep instruction set architecture competency, so it would probably be a good idea to contact him about hardware sparsity.

1 Like