Why do members here think DL-based methods can't achieve AGI?

Apart from it being the worlds largest consumer of industrial robotics.

Given the quality of “scientific” justification needed to have the compliance of most of the population in the last two years, it would seem the AGI will not need to be very smart to convince society of the why even if the why is very dubious.

Proving or disproving requires a formal representation, which is lacking from a lot of science. It is more about intuitions, statistics, and Occam’s razor than most would like to admit.

True, but it seems likely that if an AGI emerges it will not be a copy of the brain.

We learnt a great deal about flying by observing nature, early aircraft are inspired by nature but technically very different, today there are flying devices that are much closer to birds (e.g. flapping wings). I imagine human like AGI being a curiosity developed after the emergence of AGI.

Consier AlphaGo, bootstrapped on human played games, yet inferior to AlphaGo Zero (does not use human games). If we want to master Go we would be better to learn from AlphaGo Zero’s insights rather than worry about AlphaGo Zero playing Go in human style.

DL-based could be interpretted to mean many-layer ANN trained using gradient descent. Given the empiriclal results and investment they are likely to be a part of future improvements in AI and contribute to new learning algorithms that use many different optimization techniques.

Another general remark, which may not apply to some people here, when referring to DL, in the past,I’ve thought of this as systems like large CNN but deep reinforcement learning (DRL) is way more sophisticated. DRL can include recurrent neural networks and predictive models so the potential seems much greater. This is not to say that any current architecture is going to achieve AGI. If you look at the evolution in the sophistication of the DL architectures over the last 10 years, I don’t see this slowing down any time soon and the progress is impressive.

4 Likes

I think whatever pops out as a consequence of this experimentation will be an alien, frozen state of something exhibiting intelligence. Will it be general, will it be self-understanding with the ability to introspect, make decisions, and change itself over time?

I guess we’ll see, but I suspect our bottleneck on a DL-based approach will continue to be the offline nature of its training… and again, to emphasize, whatever we do produce will be completely alien from us and our way of “thinking”, a ephemeral, stateless entity, existing entirely within the box we place it in, trapped like a ghost in a machine.

Impressive it will be. AGI? Let’s wait another decade and see what happens.

1 Like

In DRL there is both an offline and online aspect. It is the online aspect (closer to search/optimization) that amplifies RL results and explains a great deal of their success. The “bleeding edge” of research in DRL is on adaptive/autonomous systems e.g. Yann Lecun’s recent talks. In these types of systems the objective function is no longer static e.g. based on self-supervised learning using predictions in space/time.

Alien: absolutely. Frozen? Probably not. Make decisions: of course! Stateless: no, we would build it to preserve state because it’s more useful that way.

Introspection: problematic. How would you go about proving scientifically that you, your friend and your dog do or do not have introspection?

The essence of intelligence to find solutions to problems in aid of survival faster than evolution or genetic adaptation and to pass those solutions on to progeny through learning instead of genetic inheritance.

2 Likes

Any state mechanism will be bolted onto the inference engine, rather than embedded as it is with ourselves. So its internal weights (all 100tn of them) will likely remain frozen, barring another leap up in technology to allow simultaneous inference, feedback, and weight update… the thought of that though is pretty terrifying to me, in that it would become unpredictable while still being alien (though still confined to its box). At least humans, with our physical state machines attached to muscle outputs providing some level of interpersonal queues and feedback allow us to relate to each other… this ‘thing’ wouldn’t be, not for a while anyway.

2 Likes

Not sure what you mean here. The cost of computing gradients with respect to parameters is only between 1x and 3x the cost of the original computational graph. Considering this is true no matter how big the graph is (ie how many parameters in the network, how many steps of recurrence, how difficult the underlying problem is) this isn’t “too expensive”, it’s linear with the original cost, with a small factor of overhead. Alternative approaches like Hebbian learning take orders of magnitude longer to converge, if at all.

The cost of modern deep learning has nothing to do with backprop, and everything to do with the number of training points that are required, both via large batches and many training steps. However, we don’t currently have any promising general-purpose methods that can learn faster, so we’re stuck with it for the time being.

2 Likes

That is called policy iteration and is a basic principle of how DRL implements AlphaZero. Consider how people reacted to move 37 in the second game of AlphaGo vs Lee Sodol - it was thought to show creativity and beauty in the machine’s “thinking”. Updating systems online has been studied for decades, it is a hard problem, but there are methods like adaptive model predictive control that can be recast as DRL problems (this is what Yann LeCun is doing to some extent).

This is not clear to me. In pattern matching type approaches there is the possibility of one shot learning for a classifier. Consider an architecture like ART even when iterating it does not require the same amount of data as a gradient descent approach.

I don’t see how these things can be separated out - to use the method of gradient descent (i.e. backprop) requires a large amount of training. One iteration of backprop might be relatively cheap but the method assumes you will do many iterations, which makes it expensive.

3 Likes

Good points. To be honest, I have had trouble finding comparisons between ART and gradient-based methods on tasks I’m familiar with. If you’ve got references for that, I’m all ears.

Regarding your second point, there is a difference between backpropagation—which is a method of efficiently computing exact error gradients of a function (ie applying the chain rule while caching the linearization points)—and the particular loss functions, optimizers, and architectures that make use of those error gradients. Disentangling these is important, because the latter set of concerns are the dominant factors determining data efficiency from my vantage, rather than backpropagation itself.

To be clear: SGD, RMSProp, and Adam are optimizers that use gradients, language modeling and adversarial objectives are loss functions for which you could compute gradients, and convnets, transformers, sparse graph networks are architectures that can be updated via gradient information. These are
the choices that can empirically dramatically change the number of steps needed to converge to good solutions.

1 Like

Here’s a great thread highlighting the sort of phenomenon I’m describing:

3 Likes

I’m not aware of direct comparisons between ART and DL. For ART to do things that are at the typical scale of DL (e.g. image recognition) I think it would need to compare an architecture using ART e.g. the SMART architecture. Neurala probably has the data.

Thanks for the clarifications. My assumption is that backpropagation involves the propagation of the error (and the calculation of that error) i.e. it is of no use without an algorithm that iterates and the choice to use backprop limits the possible algorithms.

One way to think of a simple ART implementation would be as a two layer network with backpropagtion of an error, rather than a gradient. This allows for oneshot learning. Because different categories have different weights it does not suffer from the catastrophic forgetting of DL i.e. it can jump to an optimal weight value rather than slowly descend a gradient toward the optimal. This is an over-simplification.

3 Likes

Just to point out, GPT4 need not have 100Trillion params (scaling laws estimate around 5-50T, assuming they hold) and that’s perfectly doable - inviting the “millionth of a manhattan project” fanatics which I sort-of agree upon; Nor does it require a “Supercomputer” in the traditional sense, as current supercomputers far far outperform the GPU clusters used at ClosedAI.

Also, OAI has expressly stated that GPT4 more likely will be only a little bigger - but trained properly on better data, better techniques etc.

you need not update all the weights (demonstrated by MoE style architectures) and sparing a couple of backward passes for some few shot supervised training would take milliseconds on a GPU cluster.

Mostly, I disagree with that. RL envs are often very simple, algos that are SOTA on some envs don’t outperform others, stochasticity make reproducibility a nightmare and all-in-all the field has devolved into a nightmare scenario. IMO RL is the furthest from AGI, and its lack of consistent results puts in an outlier category in a field that values results first and foremost.

If someone is interested in how transformers can also be applied to offline RL, check this out: [2201.12122] Can Wikipedia Help Offline Reinforcement Learning?

1 Like

Can you give an example of an RL algorithm that does not rely on online optimization? It does not matter if most implementations are not at the leading edge of research, it is the leading edge of the research that matters in terms of progress. That RL is outperformed by some algorithms on some tasks is not of much interest - nobody is claiming that RL outperforms all algorithms on all tasks. Monkeys outperform humans on some tasks - I doubt that makes you want to be a monkey?

1 Like

While I’ll be the first to agree and rant about RL having simple environments and very poor reproducability, I don’t understand how that leads to it being the furthest from AGI.

As far as I can tell (I’m in the early years of a PhD in RL)
RL seems to be the almost the only field with an embodied thing that makes decisions in space/time in an environment. It also seems like the only field seriously integrating online learning methods into the algorithms themselves. Both embodied and online seem essential, at least for my definition of AGI.

Sure, most models including my own use backprop (which I consider a serious problem), the required sample sizes are way too large, and unsupervised learning is almost certainly going to be an essential component. But I’m not aware of other major fields that are even working on the right problem. (Maybe evolutionary/genetic learning?)

4 Likes

I am not sure what specific examples you want me to provide - offline RL is a vast topic with many papers on Arxiv, and the paper I linked above is also offline :man_shrugging:

Not to sound rude, but I don’t perceive anything I “like” or consider “essential” to guide me in distinguishing viability of methods. That said, RL’s intelligent behaviour is mostly by guessing a highly rewarding action and building upon it. Some work try and explicitly encourage curiosity and world modelling (Like Schmidhuber’s AAC) but whenever explicit comes up, I become wary - because the path then sooner or later edges on to subtly giving domain knowledge in pursuit of higher scores on the benchmark.

As much impressive Epsilon greedy strategies are, seeing anything learnt is just randomly encountering some action that leads to a reward and working on it doesn’t seem intelligent per se. They have no generalizability (though MCTS optimization methods maintain some) nor do they show intelligent behavior. Its less about being sample efficient, and more about understanding dynamics - which as I said world models move towards, but explicitly is a huge pitfall.

Transformers implicitly do a bit of meta-learning, generalization and understand some dynamics to an extent. They show a little reasoning behavior (better than traditional systems) but have a long way to go before achieving anything human-like. Yet they can perform more tasks, understand those same tasks better by using multiple modalities effectively (It seems everyday there is a new paper) and have uniform architecture, much like the brain.

There are some which argue that transformers may model like the brain as well, though that may as well just be a correlation due to the similarity of tasks…[https://twitter.com/JeanRemiKing/status/1496425017474695169, https://www.nature.com/articles/s42003-022-03036-1]

Lastly, I don’t get the hype around online learning. Its not too difficult? few shot learning exists. even then, a couple of backward passes aren’t problematic. What the beef is; is beyond me. Fine tuning is such an old, tried and tested phenomenon.

True, there is the field of “offline RL” but an AGI is going to need to adapt and explore, so in this thread I don’t see the relevance.

At least keep the contradictions to separate paragraphs :wink:

Grossberg’s work on autonomous systems comes to mind. It depends what you mean by other major field, Grossberg’s approach is not a major field in and of itself but it fits within computational neuroscience.

Good luck with the PhD.

2 Likes

well, if it helps explicitness is often associated with GOFAI methods, embedding knowledge bases in their agents and all sorts of other nasty tricks which are ambiguous, but I consider simply Cheating :wink:

I would prefer everything be built by scratch, meta-learning to its purest essence. :slight_smile:

Not necessarily, it doesn’t need to be online to reach its bootstrap phase - after which is can be easily deplyed in an online learning setting…

1 Like

I have not read anyone in this thread claiming that there should be no offline process. You seem to be debating this with someone who has not participated in the thread. That you now agree the online process is essential I think we are back on the main thread.

1 Like

Who’s your dissertation advisor? Just curious, I won’t tell them you are slumming on this board :wink:

1 Like

I personally don’t think DL is necessarily incapable of achieving AGI, but I do think that without some major breakthroughs in the field probably it won’t

1 Like