Is billions of parameters (or more) the only solution?

Jose_Cueto · January 7, 2023, 1:39am

Hi,

Does the brain use billions of parameters (analogous to deep learning) to model things? Or is it very efficient that it only uses relatively few parameters? Are there any known evidences?

Reason I asked is that I’d like to get inputs about whether using billions or gazillions of parameters (regardless of architecture) is the only way to model our complex world.

Casey · January 7, 2023, 4:53am

The brain has like 100 trillion synapses. A lot of it isn’t really necessary for general intelligence, but even if it’s 0.1% for intelligence, that’s still 100 billion synapses.

Prakash · January 7, 2023, 5:18am

Forget about analogy that we all talk about. Think about the inputs we tend to get all day, higher input → higher analogies. Then what is the limit for storing these analogies, as again it depends on exposure to amount of inputs.

Jose_Cueto · January 7, 2023, 8:40am

Yes but how does it relate to function approximation using billions of parameters and are there any evidences?

Jose_Cueto · January 7, 2023, 8:42am

I dont know what you are talking about. Can you be more specific?

ANN is roughly a function approximator that uses millions or billions of parameters to model more complex real objects. I was asking about if this is the same case as the brain or is it an extremely efficient infromation compressor unlike anything else.

Jose_Cueto · January 7, 2023, 8:47am

As a thought exercise, this can also mean that DL is ridiculously far from the brain’s learning power and I know that this is common sense but having this percentage kind of answered the question and that the brain is probably doing something else. But waht would that be? What could be better than our knowledge of approximating a function? This is what I’m very interested about because everyone (mainstream) kind of accepted that gradient descent is the best algorithm for learning nowadays.

Casey · January 8, 2023, 12:31am

I don’t know. I’m guessing the brain has more localized processing (several thousand synapses per neuron, and e.g. maps of stuff). It’s also much more complicated, for example:

Within each of those parts is a similarly complex circuit, not to mention everything else going on like sub-cellular processing.

The numenta theory of the cortex being about locations could elegantly account for much of the brain’s complexity. There’s evidence for location-related cells (grid cells) being throughout the cortex, although that’s far from proven I think. Somehow, locations would be the foundation for general intelligence.

I think ML just mimics humans. It can be smarter than individual humans, but never smarter than the human race as a whole. Mimicking intelligence is very different from real intelligence. Right now, it’s better since we don’t know how intelligence works. It could become very powerful, and maybe we don’t even need real artificial general intelligence, but my point is it’s just a different kind of problem.

Jose_Cueto · January 8, 2023, 2:39am

Thanks your answers are very much appreciated.

It baffles/worries me that the mainstream and most accepted ML architecture nowadays is about function approximation to a planet-scale or perhaps universal scale just to achieve some kind of intelligence - this is number 1 concern. Since it’s function approximation then the bigger the number of parameters the more continuous the approximation and more chance to model a static world. But I dont have proof about this but by intuition this function approx will fail instantly and catastrophically due to the chamging world.

The second conern from me is that since the brain has massive number of connections, then why at least the ML community is fixated with achieving Generalization. Isn’t these connections enough to memorize the changing world?

My third concern is that function approx is probably the universal method of modeling some world BUT that doesn’t mean nature does it the way we do it (e.g. gradient descent). But function approx is very appealing because we already know about it. What if there is a shortcut to model the world without using function approx but we dont know yet this computational mechanism? That’s why I’m asking about if the brain uses billions of parameters or so.

This led me to the questions of;

Does biological learning scale with the size/number of neurons?
Is there any experiments/studies about a fully functional brain that is physically and unusually smaller in size, maybe lacking of its known core parts?

david.pfx · January 8, 2023, 4:01am

No, ML mimics things humans have done, captured and labelled, not what they could or can do. And nowhere near AGI.

Another Tesla story: the car knows about stop signs (seen lots of those) and people (likewise). But the first time a car ever saw a person carrying a stop sign, it was only quick thinking by the driver that avoided another fatality.

You don’t have to play with Dall-e or OpenGPT for long to realise that for every truly amazing thing it produces, there is something just plain dumb going on. Ask for a spaceman riding a horse and you’re just as likely to get a horse riding a spaceman.

I don’t know if Numenta is on the right track or if AGI will happen in my lifetime, but I sure as hell know we’re not there yet, or even close.

Casey · January 8, 2023, 4:32am

Intelligence might (so maybe learning too), but it’s complicated so we don’t really know.

Some people with epilepsy have half of the brain removed as children, and they end up without much mental disability (I don’t know about their level of general intelligence, but an article says they can make small talk and you can forget that they lack half their brain). But some relatively small brain injuries can have huge effects. Also, people with microencephaly (a developmental disorder causing smaller heads) often have intellectual disability (but also other things like seizures, so the issue might not be size).

neel_g · January 8, 2023, 12:22pm

Not even close. I don’t blame forum members here (since this is not close to a DL forum at all) but the current DL/scaling paradigm moves towards something else - meta-learning.

The aim is that you approximate this dynamical process of learning - like how a static differential equation models dynamic behavior. To achieve meta-learning, you don’t need an absurd amount of parameters - likely you need diverse data from different modalities. LLMs exhibit outstanding meta-learning abilities despite not being explicitly trained for that.

Right now, the bottleneck is actually research effort - multi-modality architectures are being explored, but its takes quite a bit of time. We’re still waiting on GATO 2 for instance, but in that waiting period exciting new stuff has come up. Turns out, science it quite slow

Hawkin’s posited in his TBT book that it does - he said that fundamentally, the core difference between an animal’s and humans’ neocortex is simply the amount of cortical columns. We have many more of those.

It makes sense that you do need a certain amount of parameters to achieve full, human-level meta-learning - but it stands to reason smaller networks would be able to do the same, just on a much simpler environment and level.

Jose_Cueto · January 9, 2023, 12:07am

You sound like a DL expert great! Do you have any paper that proves this? I know it’s intuitive but how does multimodality compresses the parameters needed to model a chaotic world? Is there a guarantee that multimodality in DL will not face a new problem to solve such as data dimension explosion how will it handle it?

If it’s just another function approximator (I’m not underestimating it) then it will likely fail as it will need to brute force the possible parameter values that it will need to know.

From my understanding what GATO has mainly brought to the table is multimodality of input data and its representation and this does not solve the fact that the underlying DL is brute force automatic parameter tuning.

Jose_Cueto · January 9, 2023, 1:36am

I disagree, I’m no DL or ML expert but here’s the question?

Where did the energy for keeping/processing those parameters go?

It must be somewhere hidden or at some level ignored by meta-learning. Unless of course if under the hood it does not do brute force like gradient descent does to be able to converge, then this is a new algorithm a different learning architecture.

neel_g · January 9, 2023, 5:34pm

Yup. I don’t link papers here because I don’t think this is the correct forum, but I’ll be happy to oblige if you want to back up my most assertive statements.

https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[2009.07118] It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
Andrea Madotto | Language Model as Few-Shot Learners for Task-Oriented Dialogue Systems is brief and interesting blogpost
and of course, [2005.14165] Language Models are Few-Shot Learners

No, that’s not a concern at all. In practice, even for the most complex tasks you don’t need more than 2k dimensions in a layer at-most. It goes back to why Glove and other word/sentence embeddings were in [256, 512] range. We don’t know, but empirically they just work as well as ones in the thousands.

I don’t get what you mean. Gradient descent doesn’t brute-force the entire possible parameter space - infact, that’s the entire problem it was supposed to solve.

Again, if you actually attempt to brute force parameters, let’s say you are bounded in [0,1] per parameter for simplicity, and you bucket it to 0.00001 precision (thought in practice, precision is much lower with fp16 and you can push fp32 to 7 decimal digits).

This discretization gives us 100000 possible values per parameter, and we’ll ignore the biases now for simplicity. Thus, your entire space of possible parameters for GPT3 is: 175000000000^{100000}

So yea, just a tiny bit away from brute-forcing it

Hm? The energy… is spent on GPUs. What?

I believe there were some recent papers arguing meta-learning is basically the language model learning gradient descent encoded in its weights. I haven’t read those, nor does there seem to be any particular consensus among the community but I guess that’s always on the table.

I feel it just learns a more complex and nuanced algo that’s kinda near full-meta-learning, but now quite. Eventually, scale would get us there but we’d have to do it smartly to be compute and data-efficient.

Jose_Cueto · January 10, 2023, 2:05am

Thank you for the papers Im going to have a read.

Probably for a different discussion but why?

Ok. I wonder if this applies still for multimodal.

By brute-force I mean it’s very high resolution in searching for param values. Of course it wouldn’t search the whole space because it will find the param values assuming it converges but if there is some learning machine that is as good as the brain do you think it uses this type of search algorithm? I don’t think so because this again like mentioned above is prone to catastrophic failing in a dynamic world - params are tuned for a dataset at very high resolution, then suddenly some inverse relational data comes in while learning the new world, then it breaks.

Yes!

No I was talking about conservation of energy. Meta learning/multimodal on top of the same DL arxhitecture conserves the enrgy of processing those billions of params unless one blindly ignores it.

hsgo · January 10, 2023, 5:15am

Just a nit-pick, it should be 100000^{175000000000}.
175000000000^{100000} is basically nothing compared to this number.

complyue · January 10, 2023, 7:47am

Check out this if you haven’t - Is Your Brain Really Necessary?

Information Theory doesn’t concern about interpretation at all, I would suggest that the brain’s magic is really about model building from massive sensory inputs, then simulate/predicate future inputs from the model.

With just 2 parameters of slope + intercept, an abstract “line” covers infinite number of "point"s, technically this is “information compression”, but it’s nonsense unless you can interpret the slope+intercept in telling whether a point is on the line described by it, before fancier utilizations of the model. How many "parameter"s are enough to interpret the 2 slope+intercept? I would think about things beyond just "parameter"s.

I’m more interested in how those math ideas can come into a human’s mind, and that doesn’t seem like about "parameter"s.

neel_g · January 10, 2023, 4:32pm

Cuz there aren’t many DL researchers here, so I feel its a lot of effort finding papers to cite quite obvious stuff. Otherwise each post of mine would involve dozens of citations.

Yep. It generalizes to any vector which would be the compressed, abstract representation of some data source because its per unit (like a token of natural language)

Probably not, but there is not reason to assume the brain has found the most optimal algorithm - we’re still evolving. What is has found is true trial&error - and evolutionary searches are notoriously slow for converging to any optima. Backpropogation could potentially be a much better algorithm for learning and attaining superhuman capabilities.

I… still don’t get you.

haha, thanks for correcting me! That is such an insane number tbh. I know we talk about optimizing trillions of parameters like its making ice-cream, but the enormity of the search space is really awe-inspiring.

Casey · January 10, 2023, 7:21pm

That’s how it evolved, but not how it learns. It also does a lot more than finding optima, or anything which can be easily expressed with math. I don’t think it makes sense to compare ML and the brain, because they do such different things.

neel_g · January 10, 2023, 8:09pm

I meant in the context of evolution finding the optimal learning algorithm. For all we know, backpropogation could be the best one out there.

So,

is something I completely agree with.

Topic		Replies	Views
How much do we need to simulate Engineering question	0	301	October 3, 2021
Direct Fit to Nature: an evolutionary perspective on biological and artificial neural networks General Neuroscience	7	876	February 28, 2020
100-Step rule Numenta Theory	3	3090	September 24, 2016
The Platonic Representation Hypothesis Lounge	1	107	November 2, 2024
Jeff on Lex Fridman podcast Talks and Events	12	1667	August 14, 2021

Is billions of parameters (or more) the only solution?

Related topics