Is billions of parameters (or more) the only solution?

cezar_t · January 13, 2023, 7:27am

One caveat with this appreciation is humans might be as bad or worse if exposed to nothing else but text.
More precisely the failures you mention might be caused by the fact its input isn’t produced by a “real” agent’s interaction in a real world therefore the text it consumes and regurgitates isn’t anchored in an actual experience.
Yet it still works despite that, mapping only text level references.

Not saying this is really the case, just a possibility.

neel_g · January 13, 2023, 4:37pm

I feel that is a bad definition - because by that logic, LLMs which can already in-context learn so much are technically ‘brain-like’. IMO brain-like capabilies have a much wider breadth than just that - but I do agree its a crucial component.

I think you’re misunderstanding; achieving meta-learning is not something backprop would be helpful for - for instance, current LLMs already meta-learn to an impressive extent, hence the whole hype about ChatGPT despite it being years old. Recent evidence in fact shows that LLMs may implement gradient descent when in-context learning which is quite interesting.

The difference would be its not 1 learner, its millions. You don’t learn from one modality, you learn priors by processing several in parallel. That’s where the scaling part steps in.

I have a better variation. The brain processes sensory input, w.r.t the world or environment in which it resides. Most brains do that on Earth. However, researchers quickly realized that iterating in the real world with a robot brain is fundamentally unscalable - hence the entire NLP revolution.

The brain models our world through all our different modalities. LLMs model the one-dimensional world of tokens and predict it very effectively. Essentially, trying to perform the same objective as their biological counterpart.

Why then can it twist stuff which has never existed into new and exciting ways? There’s a reason why LLMs score 0% on even the most sophisticated plagiarism detectors in the world.

It does produce stupid output sometimes - but the point with scaling is that it produces stupid otuput less as we scale. If you hook up an animal brain to a large corpus, it would be closer to GPT1 than anything. Everything would sound stupid until you plug in a more complex brain.

Absolutely nothing about LLMs you say is backed by any references. I’d look forward to them if you have some.

You put it much better than me. LLMs have little to no reason to be grounded in the “real-world”. The same way our imagination is rarely grounded - because it doesn’t need to be. Our brain doesn’t punish for ungrounded imagination, but does punish when does imaginations don’t align with the environment it currently is in to fulfill its goals.

When you apply RLHF, you ground the LLM somewhat in normal human context & conversations and thus imbue some priors - hence came the whole ChatGPT hype (its just GPT3+RLHF). It behaves more human-like and natural which is what makes it more convincing.

cezar_t · January 13, 2023, 7:25pm

Ok talking won’t take us far.

Would you (or know anyone who would) try to solve a simple RL task via LLM prompt engineering (or one/few shot/meta learning whatever it be called) ?

CartPole for example has a very short state space (4 numbers, 0-50 resolution is more than sufficient), a very simple action space (LEFT or RIGHT) and a relatively simple reward info.
What I know for sure the number of learnable time steps needed to solve it is less than 100 (occasionally can be as low as 20) so that makes for 6 tokens/timestep x 100 = 600 example tokens preceded by the “programming” story.

But I can provide samples of learning in 20-30 timesteps to keep the “programming” compact.

Various rules could apply e.g. prompting it or not about the fact it has to solve the cartpole but that might be an interesting after experiment - e.g. if prompt contains explaining it the physics of the cartpole would help it learn faster or not, or even better have it explain on what grounds it makes the correct decision.

PS if successful it would worth publishing I guess

neel_g · January 13, 2023, 9:59pm

Like Actor distlliation? It leverages the meta-learning of a frozen LLM to handle unseen RL environments. Thought technically GATO does it too - AD performs it in a more efficient and explicit manner.

Yep, it shouldn’t be hard to try on your own with ChatGPT. Just sample Cartpole once, and if you want you can instruct it with natural language or give it multiple examples on how to solve the env

Jose_Cueto · January 13, 2023, 10:18pm

I think you are shallowly interpreting my simple statements. An NN is fundamentally a function, remove its params like any function then it will obviously diminish its range output that is simple math. Therefore it will not work the way it is suppose to work.

The brain is biological it’s got chaotic chemical interactions and electrical interactions in it that it cannot easily be concluded like a static system, we may not know it’s full capabilities yet.

Jose_Cueto · January 13, 2023, 10:20pm

The more I read about meta-learning the more this definition sounds stupid to me sorry.

Can you state the problem statement of a meta learner, its expected inputs and its expected outputs instead? Perhaps show how “learning to learn” in an equation so that you can make your point about it better?

cezar_t · January 13, 2023, 10:23pm

Thanks for the reference however that paper uses the term Algorithm instead of Actor, are you sure it is the one you wanted to suggest?

PS anyway it’s interesting, although at first look it does a different thing.

Jose_Cueto · January 13, 2023, 10:42pm

I don’t understand what you mean here, what do you mean by “inherent link”?

DL maximizes the likelihood of model params given a dataset doesnt this make params first class objects in DL? Shrink these params and the approximation power shrinks as well.

neel_g · January 13, 2023, 11:37pm

If it was an equation, then you wouldn’t need an NN at all…?

Yeah, its just a bit of waffled terminology - it actually has a different name but AD seems to be the one that’s the most convenient to use. Same way as foundational models I guess

As in even if you have a single neuron, that’s still DL - “Deep” I suppose. There isn’t any set threshold over which networks become deep. We call them deep when they consume a sizeable portion of our compute resources, but its utterly subjective

dmac · January 14, 2023, 2:22am

A PID controller can also solve the cart-pole problem. The parameters of the controller need to be tuned ahead of time and it does not learn at run time.

Interestingly, the spinal cord and brain stem also use feedback-based controllers to enact muscle control and balance. For more info see How The Spinal Cord Generates Behavior .

thanh-binh.to · January 14, 2023, 8:39am

@dmac yes, classical controller PID can solve this problem well.
In the meanwhile I read some works using TD learning in Striatum for RL. It is only idea and need more optimization works to bring it to run robustly.

cezar_t · January 14, 2023, 9:17am

Sure a PID can solve it and that is beside the point.
It still can be framed as a ML problem in order to verify and understand capabilities of an AI algorithm.

E.G some folks asked GPT-3 simple math (addition I think) examples in order to check whether it actually learned simple math operations from its large training dataset. The advantage of math is that you can pick arbitrary numbers that almost sure were not seen in its training data so a solution would imply that it does more than just regurgitating/rearranging its training data in a statistical manner.

Deftware · January 14, 2023, 10:24am

I’ve been into brains and machine learning (more brains, for purposes of machine learning) for the last 20 years. What still surprises me to this day is how everyone completely disregards, overlooks, and omits the intelligence of much smaller creatures with much smaller brains. Even insects exhibit powerfully adaptive behaviors that we still cannot replicate even with orders of magnitude more parameters than they have. That should be a huge red flag as to how misled the mainstream’s approaches to machine learning are.

The answer isn’t scale, it’s design/architecture. Backpropagation isn’t the solution either, it’s a red herring. It can do stuff, it looks like it could potentially do amazing stuff, but that’s only possible with exorbitant amounts of compute and massive parameter counts. It’s impractical and never going to be useful in an online-learning solution.

Real machine intelligence will be the product of clever architectures, that don’t rely on any backprop. Backprop is weak, slow, and expensive. Does that sound like something that can adapt quickly to sustain its survival to you?

Lastly, when people compare the number of neurons/synapses in a brain, they’re not taking into account that a biological brain includes many neurons and synapses exclusively for biological things that are not relevant to intelligence.

thanh-binh.to · January 14, 2023, 12:28pm

Completely online learning needs more time, but I believe that the combination of self-learning and online learning makes sense

Jose_Cueto · January 14, 2023, 7:59pm

Reminds about the concept of computational equivalence by Stephen Wolfram.

bkaz · January 14, 2023, 9:24pm

Any hint on what those clever architectures might be? I am guessing you think they should rely on some version of Hebbian learning instead? I am not deeply involved, but my impression is that backprop won over Hebbian in ML because it’s better in static easily defined tasks, where GPU pipeline can operate without interruption. While Hebbian is far more flexible, thus efficient in open-ended online learning. But this flexibility comes with networking overhead costs, stateful neurons operate in largely asynchronous fashion.

Casey · January 14, 2023, 10:58pm

I agree true AGI would be far superior to ML, but it might be overkill. I think ML will stop improving once the data sets are basically the whole internet, so like the internet in a bottle. That’s like a 2x multiplier for progress. Whereas true AGI wouldn’t have that limit, so it might be like a 1000x multiplier. Unless society drastically changes, that’s bad.

Jose_Cueto · January 14, 2023, 11:01pm

I know what you mean but data is changing all the time and ML/data doesnt give the answer mostly to the WHY questions. Again data is changing so there’s data decay to consider which makes the x/multiplication operation alone insufficient - AGI deals with forgetting better IMO and is first class feature.

bkaz · January 14, 2023, 11:50pm

No it won’t, it can learn from generative processing, like Alpha0. And then there is endless physical world exploration, DL is not limited to language.

david.pfx · January 15, 2023, 12:14am

Don’t keep us in suspense? How did it go?

Simple example: you can add 101010101010 to 202020202020 easily in your head. Can GPT?

Topic		Replies	Views
How much do we need to simulate Engineering question	0	301	October 3, 2021
Direct Fit to Nature: an evolutionary perspective on biological and artificial neural networks General Neuroscience	7	876	February 28, 2020
100-Step rule Numenta Theory	3	3092	September 24, 2016
The Platonic Representation Hypothesis Lounge	1	109	November 2, 2024
Jeff on Lex Fridman podcast Talks and Events	12	1669	August 14, 2021

Is billions of parameters (or more) the only solution?

Related topics