Numenta Technology Demonstration: Sparse networks perform inference 50 times faster than dense networks, with competitive accuracy

Hi all,

Today we announced a technology demonstration with results showing 50x speed-up on inference tasks in deep learning networks using sparse algorithms, with no loss in accuracy. We ran sparse and dense networks on Xilinx FPGAs using the Google Speech Commands dataset. This proof of concept, highlighting the benefits of sparsity to scale deep learning, is part of our ongoing effort to apply neuroscience principles to machine learning systems.

If you’re interested in learning more, you can read the press release or the accompanying white paper

And of course if you have questions about the work, let us know here on the Forum!

Christy

12 Likes

I thought a mix of HTM and Deep Learning wouldn’t conflict with each other instead of working so well…! Congrats

I’m also a IC engineer so I’m curious about a few questions. First, are the benchmark running in FP32, FP16, FP8 or fixed-point? And second. Which NN accelerator you are using? I’ll assume you are using Xilinx’s Vitis AI core? Or are you using a custom IP core? My university lab is currently developing both a DNN accelerator and a HTM accelerator. Maybe I want to have a look into merging the two.

Also, more importantly. Is this just training a dense network and then pruning it? Or are you training a sparse network from ground? And does k-winner and boosting have effect of the network? If you are training from scratch, how much faster/slower is training a sparse NN vs a dense one?

Congrats again!

3 Likes

This is a semi-ignorant conjecture but here goes:

Starting with sparse networks can work to migrate the “golden ticket” onto the existing connections.

Breaking down the Lottery Ticket Hypothesis

Breaking down the Lottery Ticket Hypothesis Distilling the ideas from MIT CSAIL’s intriguing paper: “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks”.

3 Likes

Congrats on the result, and great to see @lucasosouza’s research paying off (and many others in the team I’m sure).

The part that probably impresses me the most is that only the network needed to be modified, rather than specialised compression or feature engineering on the source data.

The throughput comparison compared with GPUs looks promising too, given an Alveo U250 is only about double the cost of a Tesla K80.

1 Like

Hi Marty! I will try to answer some of your questions, but I will defer some of the hardware questions to @khunter

The benchmark is running on 8 bits fixed-point. For this particular project, using the GSC dataset, we are training a network in which the sparse topology is fixed from the start and not changed throughout training. That is the same setup from the How Can We Be So Dense paper, with some modifications to fit the hardware, like block sparsity and better tie breaking in k-winners.

On the Imagenet project, we have been extensively researching on how to find better sparse topologies, be it finding a good topology at the start of training (foresight pruning), during training (dynamic sparse), or after training (pruning). We experimented with most of state of the art algorithms, but in the end we came up with our own that is better adapted to the hardware restrictions and incorporates insights from HTM. @mrcslws is leading that research and can talk about it better. We had a few research meetings on this topic, but the work hasn’t yet been published.

Training a sparse network with GPUs is actually a bit slower than training a dense one, since we don’t get the benefits of sparse matrices product in SIMD architectures and there is the added cost of KWinners, that ranks the activations at each forward pass. The benefits highlighted in the announcement are for inference only. We are thinking about how to speed up training too, but that is for a future endeavor.

1 Like

We actually found dynamic sparse algorithms to perform better than iterative pruning, in terms of final accuracy in the Imagenet dataset.

Dynamic sparse algorithms also train faster. Since we grow and prune connections during training, we only need to train it once. Iterative pruning, as proposed in the Lottery Ticket paper, requires several cycles where you fully train a dense network, prune it, and then re-train. I wrote a little bit more about this here.

1 Like

For inference we are using Vitis AI for dense, and custom IP for sparse. In both we are quantizing the model to 8 bit integers. Kevin

2 Likes