I could not find any discussion about Numenta’s latest news:
I would like to know how this compares with the performance of BERT on GPU - any ideas?
I could not find any discussion about Numenta’s latest news:
I would like to know how this compares with the performance of BERT on GPU - any ideas?
The key line is this:
Using the new Intel Xeon CPU Max Series, Numenta demonstrates it can optimize the BERT-Large model to process large text documents, enabling unparalleled 20x throughput speed-up for long sequence lengths of 512.
I don’t know how they got the 123x
figure from, probably the smallest models but all this is prety standard. Check out something like NeuralMagic:
They fully deployed Huggingface pipelines (often used in industry) along with CV models too with similar speedups. Additionally, they hold the SOTA for sparsifying LLMs [2301.00774] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
I’ve disapproved of Numenta’s new direction for quite some time - which aims to cash out by partnering with companies like Intel and doing bog standard stuff like sparsification which the industry already adopted ages ago, rather than working on TBT/HTM.
That would be quite nice. AFAIK GPUs can’t really exploit sparsity as well as CPUs.
But, recent papers have come up with more efficient sparse-dense-GEMMs.
I think it remains to see, but I would expect atleast a benchmark against NeuralMagic if Numenta is serious about its research for sparsifying DL models.
It’s an improvement above competition processor AMD Milan running the same, unmodified model.
The intel’s embedded AI hardware alone can be attributed with ~8x improvement the rest is due to Numenta’s optimisation/sparsifier, which unlike other CPU sparsifiers, have to consider the underlying accelerator hardware
Model size should not affect the ratio. I guess Xeons like that can have sufficient RAM even for pretty big models. Bert Large is only 345M parameters.
Another big question is whether this acceleration is inference only or training too. If I recall well, Numenta showcased its sparsifying techniques in training too.
Their page claim a 3x improvement on BERT (inference only)
That’s against a GPU, and Numenta’s benchmark is also inference-only
Intel doesn’t have any AI hardware at all - though you may be on the right track here, certain optimizations may exploit hardware specific features. An independent benchmark is definitely in need here before any conclusions can be reached.
Model size does play a significant part as more complex circuits are harder to sparsify. Maybe it won’t matter at the ~0.4B scale but then again, can’t be sure without comparing against other methods.
Highly doubt its for training - they always mention inference. Calculating gradients over sparse matrices is extremely difficult - hence why people prefer to sparsify post-training
tomato - tomato.
In a remarkable example leveraging Intel’s new Intel Advanced Matrix Extensions (Intel AMX), Numenta reports a stunning 123X throughput improvement vs. current generation AMD Milan CPU implementations for BERT inference on short text sequences, while smashing the 10ms latency barrier required for many language model applications. BERT is the popular Transformer-based machine learning technology for Natural Language Processing (NLP) pre-training developed by Google.
See chapter 3:
https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf
For as small as the tile sets are, extracting the sparse values to be operated on makes sense.
It’s fuzzy marketing, as always whenever chip companies are involved.
Firstly, AMX introduces 2D registers (Advanced Matrix Extensions - Wikipedia) and support BF16 and INT8. That’s about it as far as I can gather.
Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids - Phoronix does a nice set of analyses for AMX out in the wild - you can see the max performance improvement is often in the ~1.5-2x
region - far from the claimed 10X
and 8.6X
in their marketing material.
Why? Because they compare AMX+BF16 vs FP32+AVX-512. Typical. The above benchmark compares with the same datatypes, hence why the speedup isn’t so drastic.
Most importantly, the TMUL (Tile Matrix Multiply Units) are just standard for performing GEMMs - they’re called “tiles” because that’s how Intel refers to their new registers are internally.
Meaning they’re still vanilla Matrix Multiplication units, optimized simply for Dense-Dense GEMMs. But since Numenta banks on sparsity, they don’t really offer significant speedup (Obviously, they will help - but not by some insane 10x-20x
figure)
TL;DR Marketing is pure bull, Intel’s AMX has little to do anything here and its mostly Numenta’s methodology and optimizations. I’m happy to change my opinion though anybody else has any other sourced of information - Intel’s docs are pretty sparse on this.
PS: With AMX, Intel Adds AI/ML Sparkle to Sapphire Rapids - The Next Platform For those interested in the “specifics” Intel shares about AMX
@neel_g
At this point I must file your comments in the SWAG* catagory.
Make sure to get back to me when you understand how Numenta is working with the sparse values and using the matrix math unit to to the heavy lifting. Working demonstration code would be nice.
Until then, you have no valid basis to make any claims for or against how it is all working.
*Scientific Wild Ass Guess
Intel’s announcement also mentions Numenta’s result Intel® Xeon® CPU Max Series - AI, Deep Learning, and HPC Processors
Intel makes discrete GPU cards now:
They did an event here in Seattle, live training a (small) spiking neural network model and giving away beer.
I would like to see the performance of the normal dense BERT model running on latest GPU compared with Numenta’s sparsified version running on latest CPU.
I agree it would be good to see a benchmark against Neural Magic too.
Neural Magic has chosen to open sourced their code.
Even if this test evaluates sparsity only on inference, I wouldn’t be surprised they (both Numenta and Intel) aim for implementing it at training level.
Calculating gradients&sparsity is hard specially for GPUs because they are poor at making dynamic choices on which sub-slice of a matrix multiplication to perform or not.
Let’s not forget that all DL/LLM model architectures evolved to be optimized for the underlying hardware, not the other way. (with the exception of reducing parameter bit size)
Without a reliable means to parse & compute only a fraction of parameters (e.g. 5%) in both forward and backwards passes, NN-s evolved towards deep networks - lots of relatively narrow layers on top of each other.
Ability to sparsify processing would allow widening the layers themselves which opens up new pathways to explore.
Comparing only raw FLOPs might be akin to pointing out how much higher resolution and frame rate has a $200 phone’s camera compared with a $2000 100kpixel thermal imager: It opens a different spectrum