Numenta Director of ML Architecture Lawrence Spracklen gives an overview of the poster he presented at the SNN Workshop on July 8th and 9th, 2021. In this poster “How Can We Be So Slow? Realizing the Performance Benefits of Sparse Networks” by Lawrence Spracklen, Kevin Hunter and Subutai Ahmad, we present the techniques Numenta has developed to achieve a 100x inference task speedup from sparsity and discuss how many of the learnings could be applied to develop fast sparse networks on CPUs.
Hugely impressed with the results, particularly the performance. Some thoughts and observations.
Page 7, figure 6, 8-core CPU (no detail on memory and channel qty) which appears to mean that AVX is totall irrelevant as it’s a memory bound compute issue. (ref 625x theoretical) The correct scale is lost due to the lack of detail. Running on a laptop with a small CPU cache and potentially with only a single memory channel populated is >10x different to a Threadripper/EPYC with 8 channels depending on how the code is also made parallel.
Also, if the model would fit in one of the larger EYPC CPU caches the increase in performance would be significant enough to change some of your conclusions. Try running the model so that it’s in CPU cache only to try and quantify this difference, which is influenced by the CPU strangler known as DIMM.
Switching the CPU to a Xeon 8275CL (also with no memory channel details) then eliminates consistenty comparison for Fig 8. Single CPU setup or dual ? NUMA performance hit ?
Figure 13 (c) “CPUs” - is this then implying more than one CPU and then a third CPU hardware configuration ?
If the FPGA could implement a fully pipelined model the results would have been over 100x faster again for aggregate throughput but single response latency would be far slower than a GPU. The FPGA’s are just not big enough yet…
The FPGA results give a speech recognition rate of 15.8 days audio per second, which if spread over say 12 hours “awake” time that means 1 month per second. This would then imply that the FPGA can recognise a lifetime of word audio recognition in under 16 minutes. Why do we think this is slow ?
Also, the energy to recognise a lifetimes audio is then 54Wh or less energy than is stored in my flashlight. (based on 215W spec sheet of U250 and my flashlight with 3 x 26650 19.5Wh cells). The typical Alexa device consumes that in just over a day on standby (at 2W)… hmmm… so… The compute recognition is energy then 0.00392% of the device footprint.
Perhaps in future you will be able to compile a trained sparse neural network into an optimal sequence of machine language instructions, like a programming language.
Anyway for sub-optimal reasons there is a booklet on the WHT on archive.org. https://archive.org/details/whtebook-archive
I think that particular model fits in cpu cache even on many laptops.
The baseline dense version of the network contained 2,522,128 parameters, while the sparse network
contained 127,696 non-zero weights, or about 95% sparse.
…
Both activations and weights are quantized to 8-bits
What I don’t understand (but who am I), is how enough of those complementary systems can be found without risk of overlap, and how the operations to verify the non-overlap is not more expensive than calculating the matrices themselves.
I suppose there are mathematical and statistical techniques that allow this. Maybe someone can shed some noob-friendly light on this?
Subutai’s presentation is at a SigOpt conference - all I got about that is that it is an optimization tool they used to optimize the sparse networks.
And the sparsification/optimization counts as “training” since it is performed once while computing the dot product affects inference time every time the model is evaluated.
Question: does this complementary sparsity approach work on “ordinary” accelerators like GPUs/TPUs/NPUs? The posted CPU and FPGA speedups are impressive but I’m somewhat skeptical those can provide the same kinds of speedups at the relevant compute scales.
Re the “Two Sparsities” paper, Ivan Godard of Mill Computing, Inc. has asked a question regarding the compaction algorithm. Godard is the founder of Mill Computing and has deep instruction set architecture competency, so it would probably be a good idea to contact him about hardware sparsity.