Why do members here think DL-based methods can't achieve AGI?

It’s very interesting to observe members here discussing about TBT (which I will declare beforehand I know very little of, along with neuroscience).

I am concerned, however, with the parallel of approaches taken by Deep Learning with methods that actually scale and work vs. Numenta’s - even then I am still a strong supporter of biologically inspired networks being the way to possible ASI.

I think it’s better If I open the question and ask people why they think DL is far from AGI - and understand (or refute) accordingly. As a primer, I spotted this paper https://www.biorxiv.org/content/biorxiv/early/2021/10/26/2021.10.25.465651.full.pdf which is highly comparable to the Mixture-of-Experts architectures DL seems to be converging to.

I would love to elaborate more, but would be more interested in first seeing the differences and judging whether DL and HTM/TBT are actually similar or not…

Numenta’s methods scale and work. That’s what they’re papers show. You can also run them yourselves as the code is on github.

DL has borrowed too much from theoretical neuroscience, so much that I think we should press back on the dichotomy of DL vs neuroscience, but I get that a lot of people do view the world that way. I think of it like a search problem. Big DL breakthroughs motivated by neuroscience have got us to a good spot, then DL will search around that point essentially randomly trying new things. Numenta’s approach might just be regularizing/nudging that search trajectory to stay closer to biological understanding/considerations.

Our brains are the product of evolution, which has been doing massively parallel hill climbing R&D for billions of years (i.e. a lot of entropy was burned in making us!). Just a little over 100 years ago the Wright brothers took a lot of inspiration from a similar, yet simpler, design of evolution (bird wings) to create planes. If planes required cheating off of evolutionary design to get us close enough to connect the final dots, then brains probably will too (well they already have, but will likely continue to do so).


Does DL really do anything other than learning central tendencies - a branch of statistics?
These can be spread over many dimensions but it is still the same task.

Are learning and discriminating central tendencies sufficient for whatever it is you chose to define as AI?

If you chose to add other behaviors to this branch of statistics are these added behaviors sufficient for whatever it is you chose to define as AI?

1 Like

Reinforcement learning / dynamic programming / monte carlo methods all rely on stat’s too. Evolution from a gene’s eye view can be seen as stat’s too. I.e. those around today were good for the average organism that they found themselves in over the eons. This is the way Richard Dawkins talks about them in The Selfish Gene & The Extended Phenotype.

I like to ask, “How much of learning is just memorizing?” I don’t know the answer :smiley:


Yes, it’s a be all and end all. I call that pattern discovery and it doesn’t have to be statistical. The only issue is scalability in complexity of discoverable patterns.

Ok, how goes that lead to attention, valence, “common sense,” and the myriad animal traits often levied as missing in criticisms of current efforts on machine intelligence?
I would rank a common lizard or amphibian as being more intelligent (adaptive and flexible) than any current DL model.
When you get up to the level of a chimp you set a very high bar for machine intelligence.
You and I have gone around on the topic of imbedded judgement that the brain is accomplishing with the limbic system but in the few human cases with damage to these systems the remaining intelligence has the kinds of malfunctions that is commonly portrayed as nightmare AI in bad science fiction works.

You know, create peace by killing all the humans?

1 Like

Yes, we discussed it before, so I will be repeating here. My comment was on GI, which I think is synonymous to unsupervised learning. Because RL (conditioning) and supervision (instincts) are by definition not general. Those are separate functions, which can be optional add-ons. Not everything in the brain is GI, if that needs to be said. Heck, sometimes it feels like GI is an unwanted bastard here, especially in lower animals.

Second, I was talking about pattern discovery as a function, not the way DL implements it. Which is actually quite shallow, it clearly doesn’t scale well in complexity of discoverable patterns. But even that has multiple forms of attention already. My own approach is not statistical, compared to that ANN and BNN are pretty similar, and similarly handicapped.


Thanks everyone for the replies! Now this is a scientific forum, so I prefer most of our controversial points to be backed by papers/citations (and supply them when needed) so please don’t take it in the wrong way if I demand sources :slight_smile:

It is really interesting to see that people make claims here without being fully invested in Deep Learning advances, for instance - @bkaz thinks that DL is inherently shallow, yet things like GPT-3 are consistently proved to have reasoning capabilities.

Another way I think these misconceptions arise are due to the way these models are marketed - for instance, very few tout GPT3’s multiplication and division abilities showing that it does represent common-sense reasoning to some degree, even codex/Co-Pilot if anyone has used it can generate customly stacked code to often perform complex tasks (as I have noted personally).

Also, models like DALL-E consistently create images that haven’t existed (and have been proved so) along with showing commonsense reasoning to a level, being able to find the most sensible image with so little data.

I really agree with you here, except for the little point that the architecture in question was not biologically motivated - from the very inception of MoEs, it was an architecture to combat catastrophic forgetting and allow complex transfer learning over several domains.

I was simply remarking in how DL converges on the same biological patterns - the difference being in a way that is practical and works in a wide amount of areas. This is flexibility Numenta’s TBT/HTM still has to demonstrate.

Another observation I find quite interesting is that people here expect to achieve AGI from text only - and are disappointed when transformers trained on these huge corpora don’t show any giant leaps in humanity - even if they do help in other widely complicated tasks like AlphaFold. Expecting AGI from a single modality is like expecting a blind man to count.

However, my main point would be that in the spirit of the “Bitter Lesson” The brain is simply a very complicated function - that much is clear on loads of evidence; Then the universal approximation thereom provides a theoretical guarantee that NNs would work - just not whether they would be the first, or the last behind other methods.

Theory often doesn’t agree with reality, but in NNs the trend is clear - they approximate the brain very well. If someone still things that GPT3 is overfitted and simply repeating what it knows - I can attach further proof.

Mostly I am interested in the intersection of these domains, and to see exactly what other scientists think of DL as a whole…


There are four categories of AI: supervised, unsupervised, reinforcement, and heuristics/hard-coded. The brain does all of them and so it stands to reason that AGI will require all four as well.

  • The cerebellum does supervised learning.
  • The cortex does unsupervised learning.
  • The basal ganglia does reinforcement learning.
  • The brainstem has hardcoded & heuristic knowledge, which was learned through evolution.

Deep learning is based on a technique called “error backpropagation”. There is evidence that backpropagation does happen in the cerebellum, however the cerebellum is a few layers deep, not the 50+ layers that deep learning uses. The layers of the real cerebellum do not repeat / they are not stacked. Furthermore, there is scant evidence that backpropagation is happeneing anywhere else in the brain.

Edit: for more see


firstly, the boundaries between all those fields have been slowly breaking down - LMs alone do the first 3, with “hardcoded knowledge” bootstrapped from experience/data.

Secondly, Deep Learning doesn’t aim to emulate the brain but replicate it. It’s a function mapper essentially, and since the brain is a function it can be mapped. The whole point is that DL presents more convincing capabilities than other methods like GOFAI.

Lastly, evolution has often been inefficient as have been other biological methods. We can take inspiration from them, but ultimately develop better tools. Like no bird can travel at Mach Speeds, carry tons of cargo and use to deliver payloads as well as go to space. We take biological inspiration and improve upon it - thinking that the brain is an exception to this requires quite some sophisticated evidence.


The theorem says that a NN is capable of approximating arbitrary functions.
Then what function should they approximate?

In DL the function being approximated is usually either:

  • The correct classification of an input, for supervised learning.
  • The expected value of future rewards, for reinforcement learning.

I strongly disagree:

  • DL-NN are not being trained to approximate the brain. They are being trained to solve useful problems, so why would they spontaneously start approximating the brain when they were not trained to do so.
  • DL-NN’s activity/representations fail to have many of the expected properties of a real brain.

Technically yes, but you can’t define that function without knowing how the brain works and so you can not possibly train a DL-NN to approximate it.

I agree with you on this point but it does not mean that DL-NN are capable of AGI.


Because they can be - we have already tried to simulate cortical columns accurately.

But the banger is, even if we are able to produce a model that interacts just like a human and reasons like one by reproducing exactly what we do correctly, arguing it is not AGI would be a hard case. Just like GOPHER, if a model says to you that it itself is indeed a p-zombie, even if the consideration is just statistics claiming some level of reasoning exists would be correct.

For example, read what LaMDa, a SOTA chatbot produces:- Do large language models understand us? | by Blaise Aguera y Arcas | Dec, 2021 | Medium Now you can say all its acknowledging is just statistics - nothing more. someone somewhere claimed that they too are a “p-zombie” which the model copy/pasted, but that’s beside the fact that we ourselves are just like that too.

The model’s objective is not to conserve energy - just to be as accurate as possible. We are using sparse models now with the advent of MoE’s (GLAM) but the brain uses its sparse structure mostly for efficient computation.

If we impose this restriction (which we do for pruning and other distillation based methods) that’s exactly what comes out of the box.

Very correct - but I don’t need to define a function say x+1 if I just give you this sequence [0-->1, 2-->3, ...]. Data is a very strong proxy for the function, which is what DL uses. And so far, methods have been wildly successful to use data for mapping humans.

Take for example YouTube - I doubt many here use it frequently but a lot of people are absolutely hooked on to it for hours on end ( :stuck_out_tongue_winking_eye: ) a simple model just taking the data as a proxy understands their desires and interests possibly better than their family members just by the simple act of keeping them engrossed for a long time.

I simply request everyone to keep an open mind for everything. not to be stuck up on outdated opinions/biases as it leads to a much more purer discourse :slightly_smiling_face:


That’s not the whole picture, yes they can be more efficient but also: the mathematics of sparse representations are fundamentally different than for dense representations.
See: [1601.00720] How do neurons operate on sparse distributed representations? A mathematical theory of sparsity, neurons and active dendrites
EDIT: also see: Hinton 1984 Distributed Representations https://stanford.edu/~jlmcc/papers/PDP/Chapter3.pdf

The GLAM model which you linked to does not use sparse representations, at least not in the same sense that HTM does:

We leverage sparsely activated Mixture-of-Experts (MoE) in GLaM models.
Each MoE layer consists of a collection of independent feed-forward networks as the ‘experts’.
During inference, [our model] dynamically picks the two best experts for each token.
The final learned representation of a token will be the weighted combination of the outputs from the selected experts.


Astute observation, MoEs are sparsely activated subnetworks which are quite similar to Hawkin’s recent paper on the topic which I linked above.

The reason DL does not do true sparse computations is because they aren’t accelerator friendly, designed for dense computations. You can indeed use a vanilla CPU as SET does ([1901.09181] Sparse evolutionary Deep Learning with over one million artificial neurons on commodity hardware)

Again, experiments are naught until they scale up which SET doesn’t; hence we use a sparsely activated subnetwork (which also utilize block-sparse attention rather than vanilla cross+self-attention) as a trade-off to maximize sparsity vs. computations/size for using them on accelerators.

Surprisingly, this hybrid model leads to much better results - Maybe they indicate the brain is not as sparse as you think? who knows.

Can I get a citation? It would have to be a recent GPT3 paper, because the ones I’ve read do not show this. I would go so far as to say that the authors are not even making this claim:

GPT-3 required a super dataset & super computer to train, yet the ANLI questions that trip it up (and other statistical contemporary NLP methods based on DL) are consistently ones involving reasoning. So reasoning is what GPT-3 actually does worst at. Plus, how are you going to get a 10x improvement in compute & data (which the scaling laws of the “few shot learners” paper demonstrated)? The compute was already 10s of millions of dollars, and the data couldn’t even be properly cleaned as is. They started training, found a f*ck up, and couldn’t even afford/justify restarting their experiments. I have also played around with the API, it is not impressive. Since you’re talking in terms of AGI as well, where’s the newly reasoned facts coming out of GPT-3?

I think this is wrong on several levels. Who cares in a benchmark against human level AGI that a computer program can create new images? Dall-e also used CLIP to rank 100s of images produced by Dall-e to ensure that the best guesses were shown first. Dall-e alone was trained on massive unsupervised dataset (250 million image-text pairs), as was CLIP (400 million). Sensible images? It can’t even put hard borders around generated text in images. Commonsense reasoning? Can you give some examples of it deducting, inducting, abducting, etc?

I know what MoEs are. This is to my point about DL practitioners searching around unconstrained by neuroscience. How can we be certain that a path thru MoEs leads to human level general intelligence, like we can be certain that understanding neuro would? You could ask the same question of adding backprop. How many MoEs/backprop-like discoveries until we get there? What if our default architecture is too different such that we can’t look to biology anymore? That’d be a shame. It’s much easier to cheat off nature’s crazy expensive design than to make one of our own (we’re not as smart as many would have us think).


Sensor fusion is a issue for current DL models. E.g. Tesla recently went full RGB inputs and omitted lidar data because of this very problem (that they’ve been working on for a while…).

The bitter lesson assumes a LOT compute. A lot of methods are “enough” when assuming crazy amounts of compute (and data): Evolution, AIXI, logical theorem proving, etc.

Does the universal approximation theorem even allow you to deduce that skip connections (e.g. Resnet) would be necessary in deep enough networks?

This is just an assertion. I’m a DNN practitioner and would never say this. If you’re holding up GPT-3 as your prime example, you’re going to have to try harder to prove your point :slight_smile: It does repeat what it knows. That’s why it is biased in language usage surrounding women, religions, etc. With these statistical methods one must remember: garbage in, garbage out.

1 Like

How about the last 60+ years of AI practitioners over-estimating their abilities and under-delivering? It took 2 bike bro’s in the late 1800s to figure out flight.

I do like your point a lot though, don’t get me wrong! I guess my take would be that we need to keep copying off of nature’s design before trying to improve it. We can obviously do both approaches at the same time :slight_smile:

We haven’t beaten bird’s energy efficiency though have we? Assuming that, some metrics of our airplanes are still not better than biology’s. There’s a lot of biological constraints we should be able to toss aside once we figure out how the brain does its magic.


This article discusses sparsity in the weights / connectivity between neurons.
That’s not quite what I was talking about, I was talking about sparsity in the patterns of activity.
But it is correct that neurons are very sparsely connected!

I had a different take away from “the bitter lesson”:

  • Brute force will eventually be a viable solution: to simulate the brain wholesale instead of trying build efficient / abstracted models of it. It works because all of the components work and with a big enough computer we can simulate all of those components.
1 Like

Ask any blind person why they are intelligent just because they lack vision ?
The lack of any one (or more) senses does not mean that the system is incapable of intelligence. Vision in AGI is like a glittering object to a magpie, looks nice, sells well, gets funding but is an easy win with no real intelligence, it’s just an unnecessary input for AGI. Again, ask any blind person.

GPT-3 has a defined window size (recurrent depth), which limits the complexity as to the input-output relationship. If GPT-3 could really do math, why is is unable to perform 4 and upwards digit math with any reliability ?

Math is a recurrent process that needs recursive temporal state. Just think of basic long division, it’s just a simple “set” of patterns that we apply to any size number. GPT-x does not do the math that way.

Take a look at the C4 data (1TB) set and you would see why GPT-3 appears to be able to perform math, because the input set contains all the data for 3 digit math. The model lacks the ability to perform unlimited recursion for long division for example. Basic recursive patterns.

Beware of a system learning these types of patterns and think they can do all math and not just the application of a pattern https://www.cuemath.com/learn/math-tricks/

Does the creation of custom hardware (NVIDIA Bluefield-4 / Ampere Next Net) really mean scalable ? Any problem can be scaled by hardware development.

Upto a point and then they will need custom hardware (e.g. past experiments with implementations on FPGA’s).

The upcomming hardware over the next decade eclipses what an AGI system actually needs, just look at the spec for Bluefield-4 (due 2024) let alone the likes of Cerebras with 850k cores and 2.6tn transistors that are operational now.

Scalability is a totally invalid criteria/argument because market economics will make anything scalable if it works.

The real question is which code works best to learn.

Yes, when a response is " No, I prefer Crete as it is much bigger, and more interesting, especially the Palace of Knossos." when asking for a comparison to an Archipelago in the Antarctic. The response in context is non sensical. If that was what an AGI came out with I would be really, really scared. Very much like Wizard of Oz, until you look behind the curtain.