Anyone can explain why Numenta latest algo optimizes Deep Learning 100x?

One would think Numenta would keep working on non-DL (like HTM 10 years ago) solution according to 1k brains theory. DL is a dead end on the way to the AGI, or I’m missing something about current Numenta work…

Some of your questions are addressed in this interview with @Subutai Ahmad that was just posted to Numenta’s YouTube channel.

2 Likes

extraordinary claims require extraordinary evidence.

As far as I get from their vague statements in the video, its pretty much sparsification of DNNs and those speedups are already commercialized/OSS which offer around 10-30x speedups, especially when inferencing. So I doubt its a novel improvement - most likely the numbers are inflated and its just another commercialized sparsification system which is nothing new.

If that’s the overall direction Numenta is going to take in future - commercializing small advances just to make a quick buck, I’d say I’m sorely dissapointed.

2 Likes

I think they’re probably talking about their recent publication:

Two sparsities are better than one: unlocking the performance benefits of sparse–sparse networks

Abstract

In principle, sparse neural networks should be significantly more efficient than traditional dense networks. Neurons in the brain exhibit two types of sparsity; they are sparsely interconnected and sparsely active. These two types of sparsity, called weight sparsity and activation sparsity, when combined, offer the potential to reduce the computational cost of neural networks by two orders of magnitude. Despite this potential, today’s neural networks deliver only modest performance benefits using just weight sparsity, because traditional computing hardware cannot efficiently process sparse networks. In this article we introduce Complementary Sparsity, a novel technique that significantly improves the performance of dual sparse networks on existing hardware. We demonstrate that we can achieve high performance running weight-sparse networks, and we can multiply those speedups by incorporating activation sparsity. Using Complementary Sparsity, we show up to 100× improvement in throughput and energy efficiency performing inference on FPGAs. We analyze scalability and resource tradeoffs for a variety of kernels typical of commercial convolutional networks such as ResNet-50 and MobileNetV2. Our results with Complementary Sparsity suggest that weight plus activation sparsity can be a potent combination for efficiently scaling future AI models.

LINK: Two sparsities are better than one: unlocking the performance benefits of sparse–sparse networks - IOPscience

2 Likes

Quite timely video - they read my mind :wink: Yes, watched it earlier today. Commercialization of some ideas on existing hardware/software - quite understandable, and in future they will build new algos/hardware to incorporate their ideas more fully.

Don’t be just yet :wink: It’s pragmatic approach, considering how much resources (people, software and hardware) are used by DL. I wonder how they sparsified DL though. According to the video mentioned above, they do want to eventually implement 1k brains theory ideas, but I suspect there’s not much concrete in terms of specific software to show right now.
Doing backprop after each training instance is fed into DNN is an overkill, not observed in the brain. 10-100x is peanuts considering we have a new kind of NN with much more structured data (which Subutai mentions) and no backprop.
We have built a prototype algo that achieves exactly that and will post a link to a demo. Highly structured, no gradient descent, near constant pattern recognition time (!), training mechanism is the same as recognition and no complex computation per se. Sparsity is built-in, temporal nature of any data type too. Faithful following to the principles of the mammalian (and bird?) brain. We estimate 100-1000x in software as minimum…
Are you familiar with HTM as it’s implemented a while ago ?

The two sparsity types are in fact one and the same. All the talk about sparsity is triggered by unnatural (as in “not natural” :wink: density of DNN which is just a linear algebra approach to a narrow task of image (read sequence) classification.

It’s nothing new - they utilize some traditional techniques with a few tricks.The real speedup comes from the FPGAs - traditional papers optimize for GPUs and TPUs since they are more aplenty and cheaper.

AFAIK, TBT is still being implemented and HTM is a poor take on a recurrent NN. The properties are unquestionably interesting, but the fact remains that it doesn’t really have any real results - a topic which I’ve griped about on this forum multiple times, and possibly no one wants to hear it again :wink: you’re welcome to browse my comments on other threads…

1 Like

I believe a considerable portion of HTM enthusiasts are like me, people who are not confident they understand math well enough to dive into deep learning but still want to do something meaninful. HTM, Assembly calculus, Adaptive ressonance theory and others are easy to grasp and have that gut feeling of actually understanding, the hope is that if we plug those pieces just right, we may end up with something that works. but I have to admit, nothing worked well enough so far.

This lack of applicability does not surprise me.

The cortex has evolved to do a certain tasks and it does them very well. The things cortex does generally don’t apply if you are not doing the same tasks; detecting changes and memorizing affordances and experiences for keyed recall at some future time.

Please note that speech is an affordance.

Most of the “special sauce” of intelligence is driven by subcortex and to the best of my knowledge, nobody is spending much time to re-create those functions. I suspect that if you were, you would find that the cortex and cerebellum are exactly what you need to turn it into an AGI.

Now I’m curious

I would note that oftentimes the perception of math in DL is more of a barrier than the content itself. I’d urge you to give it a try - trust me, its simple enough for even high schoolers to grasp. One doesn’t really need to prove convergence or properties of certain algorithms to get a low level understanding of NNs themselves.

I agree, but HTM/TBT are far from even simulating the basic mechanisms found there. the sheer space of all possible interactions are simply too complex to model, so the entire field is bottlenecked by a unified theory explaining every system together.

Note that DL itself also doesn’t have a unified theory, but because they’re fundamentally a mathematical object, thus are much easier to work with, allowing relatively sharper stabs to be easily made by approaching the problem from a different perspective. Diffusion models is one of my famous examples - taking inspiration from the Laws of Thermodynamics in physics to model processes as a differential equations - the core behind DALL-E-2, and the now popular Stable Diffusion.

While I agree that the biological approach has merit, I cannot help but think how effective DL would be in simulating the highly complex function evolved by cortical columns. It would certainly be interesting (with the required technology) for DL models to simulate neuron spikes - learning those mechanisms implictly (there’s a paper in that direction, but we’ll need NeuraLink type advances to simulate it properly and collecting much more data)

Do you mean, one-shot learning?
Or is it the whole sequence learning thing?
Or maybe the surprise detection?
Or maybe the whole hex-grid sparsity thing?

I am not familiar with any DL models that do all that.
Or for that matter, any of that.
Why not just use the model that does all that inherently?

I just want to add that as far as I know, networks in the brain are actually very shallow, you can go from the lateral geniculate nucleus to the prefrontal cortex in just a few synapses, to me it seems that the power the brain have comes from the clever placement of recurrent connections and modulatory signals and not from “mathematical objects” in the sense we are used to in deep learning.

the brain seems to use a fundamentally different paradigm that doesn’t need gradients or global optimizations or convergence, It just stores and retrieves data in clever ways.

it may be that we can reproduce what the brain does using DL but I believe it would be very inneficient.

1 Like

no, simply replicating the activations in the brain exactly as they are given the corresponding input :slight_smile: It would provide an effective prior for Large language models to pre-train and and then be fine-tuned to text or other multi-modal domains.

There is also a paper which I believe which does exactly that except on a few cortical columns or neurons in isolation - it was cited somewhere along this thread long ago.

I definitely agree with that - the approach as a whole is inefficient. But it stands to reason that we might be able to make it more efficient in the future, due to the lack of viable alternatives currently. There are very few techniques that can even go head to head with gradient based methodologies - most notably, GA/ES algorithms are more compute efficient for simpler environments but immediately lose out on optimizing more complex systems.

1 Like

I suppose you could view the brain as an extremely wide neural network that doesn’t have the width squared compute time scaling problems of a conventional artificial neural network layer.
Also one to all connective random or sub-random projections should be possible and easy in the biological brain. If those are used for dimension reduction you should get averaging and also a situation where neurons can work together locally to cancel out errors globally via the one to all connective behavior.

What’s the latest on Numenta sparse nets?

1 Like