Is billions of parameters (or more) the only solution?

So where will it get information about the physical world, if not from documents and images captured or created by people?

It has no way to construct models of the world. It’s just a fancy search engine.

1 Like

Absolutely. Could not have put it better myself.

The missing piece is models. All animals create models of the world they live in based on evolved templates and sensory input. We have absolutely no idea how they do that, or how we might replicate it.

2 Likes

I think your write-up actually helped me understand a lot more about how DL is perceived the way it is. Cognitive biases aside, I feel the fundamental problem is simply this - the average person outside the DL field (who isn’t really invested in every single jigsaw puzzle and how it fits together) sees that branch of AI more of a mixed bag of scaling and some backprop thrown into the mix.

Most people are so diverged from what the current “cutting-edge” research is at (nor can I blame them) that its easy to hook onto claims in popular media that every single DL researcher is working on the same stuff, and that backpropogation + simple neural networks is the most popular path to AGI;

I think its a bit like Quantum physics - nothing makes any sense whatsoever so the general public often assumes no one knows anything about it. Obviously, we still don’t have all the answers but experts in the fields have a much better intuition and hypotheses why certain stuff is the way it is.

We’ve barely scratched the surface in terms of modalities. Models that are cross-modal exhibit higher data efficiency and compute efficiency because they’re implicitly forced to meta-learn.

In short, even Youtube alone could provide enough data for AGI - but the focus is more on diversity & quality rather than quantity. Reinforcement learning trajectories, GATO-styled would make up a significant portion, language would always be there and images+audio would be next

Video is a popular one because its explicitly world modelling and closest to the task humans perform; but it does require quite a bit of work to be done there to reach its full potential

There is definitely some extremely impressive research on LLMs deployed and able to perform complex tasks on the edge. But the physical world is a better learning environment for an AGI bootstrapped on the internet first IMO

1 Like

Sorry, not an answer. Animal brains have sense organs they use to construct models of parts of the real world. ML has neither sense organs nor models, just a vast trove of pictures and images, all collected or created by humans. Forget ‘modalities’ and ‘meta-learn’, show me an ML producing something that is not just a better search engine (of human artefacts).

1 Like

I’m pretty sure that I’m one of these “average” persons. Also it’s also easy to guess that even the average person in the DL field is not invested in every jigsaw puzzle. I know personally ML/DL people who are making a lot of money because of their skills but even them, most of them admit that they do not have good intuitions of the underlying computation. After all they don’t have to understand it until explainability is part of their job. Right?

Another thing to note is that in this forum I have observed for quite a while people who are invested very much in brain theories or the likes of it, I’m not one of them I’m merely a computational loving guy. In saying this, these people most probably have created strong constraints about how the brain works and that’s why when they are presented with simple algorithms such as backprop, they cannot easily reconcile it with their brain realities. I admire this behavior because this ensures algorithms made will be biologically constrained but I’m not saying it’s always helpful, Numenta does this by the way, they strictly constrain their theories within the realms of neuroscience.

If I may suggest, if you think that some of the people here are dumbing about DL, then please challenge them things that you know, disprove them by asking questions and presenting facts. It will be more helpful, otherwise it’ll be your words against other people’s words - but this is ok if the discussion is meant for brainstorming.

1 Like

Well, generally, apparently LLMs can not do simple maths

What does that mean? (1st one)
They-re pretty fine with adding (maybe multiplying?) 2-3 digit numbers but not more.

But what does that mean (2nd one)
Are they just wasteful gigantic statistic idiots incapable of “actual intelligence” or reality is more complex than that?

An that’s a much more nuanced issue.

For example forget your case requiring me to notice repetition prone to algorithmic simplification. Take 2 random 4 digit numbers and add them. I know it sounds simple but do not cheat. Close your eyes. I myself would have a problem even remembering two 4 digit numbers.

We have immediate answers for single digit addition & multiplication but we lack a one shot answer to adding arbitrary big numbers. As LLMs, we do not know.

Unlike LLMs, as Piaget brilliantly put it, we also know what to do when we do not know the answer: “Take a pen, a paper, bla, bla, …”
We know to search for an algorithm/recipe we learned to add numbers, recall and apply it.
And what is funny is whatever follows the “Ups! I don’t know” moment can be put in text, and LLM are pretty good at learning the next “good” word for an existing text.

Because I wouldn’t be surprised that if instead of saying:

  • “What is the sum of 67214 and 55421”

we begin with:

  • “Can you describe an algorithm to mentally add two arbitrary numbers?”
  • “yes, bla, bla…”.
  • “Ok then apply the algorithm you described to add 67214 and 55421”

we might get a surprising result. Even without “cheating” with external memory in the form of pen and paper.

They resemble us (as in emulating natural intelligence) in certain key aspects while they lack to incorporate other important aspects.

The backprop crowd still abhors recurrence - that’s what transformers were designed to avoid. When they’ll figure out they can apply it, a lot of things will change.

Sure there are still missing pieces to that, one of the most important being an algorithm to answer the question “how do I know that I do not know?” . But that’s probably easy.

The megalithic size/training data puts them at a disadvantage here, it makes them look like they have all the answers!

Otherwise put, one big step in LLMs will not be provided by further scaling but by incorporating recurrent prompt engineering.

2 Likes

Yeah, claiming ML people know & understand what is going on in the domain as a whole is bonkers. The vast majority of them don’t even know the dozens old brilliant ideas Jürgen Schmidhuber had decades ago while claiming they are their new brilliant ideas

2 Likes

I have a simple counterpoint - RETRO is pretty much an explicit “search engine” as you call it, powered by LLMs. Yet it has none of the meta-learning, few-shot capabilities of other models. This idea that LLMs are just search engines is quite a laughable one - its like saying the brain is just a bag of electricity. You won’t get AlphaFold level generalization capabilities with a fancy search engine. Its not a trivial feat to predict protein folding for a huge bank 200 million+ proteins with enough accuracy to be usable in the real world (Like the famous case of AlphaFold predicting how the COVID-19 spike worked 7-9 months before labs were able to experimentally verify it)

Well, devs are always different to researchers. A CS grad may be able to program a webapp very well, but that doesn’t mean they can necessarily understand all CS research. Skills and interests matter here.

That’s pretty much every single of my post for the past few months :stuck_out_tongue_winking_eye:

In short, tokenization and the inability to allocate computational resources dynamically (you spend the same FLOPs per token irrespective of the complexity of the problem)

Which is why it isn’t able to learn arithmetic easily. It can learn somewhat (for instances, there is less than 0.1% of all possible 2D/3D arithmetic problems in the dataset according to the GPT3 paper - but it still generalized across 2D/3D almost perfectly)

We still haven’t solved dynamic computation for LLMs, but we have found a temporary solution - CoT: https://arxiv.org/pdf/2211.09066.pdf

image

Basically, the model doesn’t have a mechanism to allot dynamic computation, so we move some of the computation to the prompt itself. This allows it to augment memory and computation by breaking the problem into parts.

The neatest thing is that a model can enter into CoT by itself too using <work> sentinel tokens - so quite autonomous method with 0 drawbacks whatsoever (every task possible gets a huge % boost)

Incidentally, that’s kinda what I’m working on

Ok, I’m not going to get political about this here but note that the “brilliant” ideas of Schmidhuber are still a point of contest - and a small percentage of the wider deep learning community takes it seriously. In the end, its drama between two famous personalities on a very subjective and touchy issue - not easily black and white.

If schidhuber’s ideas were so great, they would have actually scaled and work - and he wouldn’t have become such a huge meme for claiming every single DL model is a “special” case of his work. Conversely, Lecun et al. could have cited him here and there - but I do kinda agree that his impact isn’t that big.

In the end, every tech bumpkin and scientist has ideas. Its how you execute and present them that makes them successful. Just because you had the idea doesn’t mean the authors who made it work outright plagiarized you…

2 Likes

The Schmidhuber part was a joke. Still, the guy is fine, as it’s fine to forget or simply miss his ideas :slight_smile:

2 Likes

My hypothesis is that the best of ML is still just a super search engine with a gloss on formatting the output. No-one seems willing to falsify it.

My sum was chosen to use numbers that would likely not be found in a search, and yet would yield easily to quite a simplistic algorithm. That’s all, nothing subtle.

My stronger hypothesis is that a better search engine does not lead to AGI. What is required IMO is the ability to create models, from which predictions can be made of things that are not found in the source data. A model of numbers is required before arithmetic.

1 Like

Well, I would say it’s necessary. A “search engine” is just an indexed storage for knowledge.

Even ourselves are search engines.
What do I do when I don’t know what to do? I start searching in what I already know for paths towards what I apparently do not know. And it can be a laborious search, because I didn’t build yet a suitable index.

2 Likes

I have a simple challenge to this, in addition to my points I made above. Try:

12401324093153895731902487329056439024739587423975392759129489214921849218492184921849128428412395730275
+ 1

In ChatGPT. The issues still stand - tokenization on numbers is absolute shite, there is no dynamic computation happening, the model simply can’t store a number this big in its STM.

But, it still learnt a simple circuit in its billions of parameters - when you add by 1 or a small number, No matter how big a number, only the last few digits change. Which is something it can easily handle (remember, it learns perfect 2D/3D arithmetic despite seeing only 0.1% of total possible examples)

The simple fact is LLMs learn millions of these circuits (the induction heads work I cite above explores exactly that) and they clearly learn algorithms

1 Like

Speaking of numbers I remember someone said ChatGPT is stupid since when asked

  • which number is bigger 1253 or 1237 ?

It answered “they are equal”. Which isn’t totally wrong if we were shown two apple piles, each having the respective number of apples we would have replied the same answer. (with a doubt, ok, transformers should learn proper not-knowing. which is quite lacking in the training data :smiley: )

1 Like

I tried 4 times, and it answered the correct one in different phrasing. But ChatGPT fixed Temperature > 0 by default with no way to change it so it won’t be reproducible (With enough iterations, eventually it would indeed get the incorrect answer) - try GPT3 with T=0 and that will get you the desired answer deterministically.

1 Like

Maybe you have shallowly interpreted this search engine analogy.

It’s not rocket science to me that these neural networks-based models are like search engines. WHY? They all have a common goal, max/min an objective function. HOW? They search for the right parameters/coefficients. WHAT? Yes they are simply functions with massive number of parameters as their core architectural component. Knowing this is very important especially in designing new breed of algorithms.

In my first question I asked about billions of params as the only solution because obviously these params need to increase to represent important structures in data. The more complicated the data gets the more structure it needs - the more randomness the more structure, more params because it cannot compress it. Does the brain suffers the same thing? Maybe not, I dont know. If it doesn’t then what could be its search algorithm?

Searching here is simply an observation/description by the viewer and not a goal by the engine.

1 Like

I hate to say this but I think you are missing the fundamental intuition of NN learning (hope not) when you said I misunderstood this.

I’ll tell you WHY. The process of learning by NNs is similar to function approximation problem. There are a classes of functions out there that can output the desired output of the task (prediction). Those functions parameters/coefficients are the ones approximated by the NN through backpropagation adjusting them based error gradients. So the NN is basically a cartoony version of the approximated function.

Now from the definition of “learning how to learn” (which is not helpful) I could easily say “approximator of an approximator” by substitution assuming that definition is not a joke. While reading more about meta learning, I learned that there are many ways to do it, gradient descent is not the only way. There are bunch of ways out there and thus it’s a distribution, it’s not black or white, meta learning can blah etc.

Schmidhuber in his podcast with Lex Fridman mentioned about a meta learner consuming a model to be able to learn its internal params/structures, hence that model must be consumable. See? In this instance of a met learner, it’s approximating the model by guessing its most important priors and thus it can do few shot learning. Those priors are built by existing large networks and thus needs large params most importantly, meta learners may reduce the number of params but that is only relative to the models it’s trying to learn. It’s only an improvement on top of an expensive DL architecture/s. Pretty much like EVs.

1 Like

Billions of parameters the only solution ? As per Deftware : Ask an Elephant about excess or a leaf cutter ant.

Biology can’t wire optimally as required and pre-loads the system with high synaptic probability (i.e. excess to try and ensure every possible synaptic join can form - which it can’t)

Computers can effectively wire as required - optimally (I think the cart pole experiments by cesar_t potentially touch on this).

GPU’s are used (which distorts the math approach) because there are not a lot of ‘very good’ coders who are also very good at math - these are very rare people who normally end up in banks and not doing academic research (I’m guilty of hiring one lead AI guy in the UK over a decade ago into a non academic world).

Hardware plays a distorting role where the memory throughput of a GPU due to the very wide memeory bandwidth (bit’s per request) makes reading everything faster than a single server with 8 memory channels reading randomly. This distorts the perspective of many as to what should be coded and how. I have 22 (decade) old servers (6 memory channels each) that can perform random memory requests only at the same bandwidth rate (MB/sec) as a single high end GPU card - I then have 1TB or more of memory attached to play with.

Only 1 GPU equivalent and 1TB is not a lot at all in todays bleading edge (maybe 1000x smaller and 10^6 slower), but consider what you can do with just over 1bn updates per second when those updates are optimally targetted and not broad brush nudge type updating. The network is indexed at the neuron and dendrite synapse level and builds an evolving network - similar to the hash index paper concept a few years back to increase efficiency massively. Then itteratively pulse the network with senses and follow up waves to evolve an area of attention, which may or may not be the final answer.

“Build it they will come” is the field of dreams of hardware, only “code it they will build”. Hardware follows software around, not the other way round. Look at the full custom wafer type builds going on.

DL methods seem to emulate a proxy for the cerebellum (instincts that lacks emotional feedback itterations - i.e. cortex forecasts that will hurt… inhibits the instinct, cerebellum tries something different) so we end up with some dumb answers with no feedback filter. Not to say the DL approach is wrong as such, just missing the rest of the package.

You have obviously never encountered challenges with patent law. Just imagine getting HTM to work properly and numenta sitting back saying, wow, that’s cool, shame we did not do that, we only had the idea…

1 Like

I agree, this is a different category (I wondered how long before it came up: RETRO is not one of them). In this there is no trove of labelled human-created documents or pictures as the training set. Instead the training data is algorithmic or mechanistic in Nature. AlphaGo, AlphaFold, ANNs that play video games, General Game Playing, the link to Grokking, and so on.

This is obviously a domain of a different character and yes, it does seem that within this domain that something more ‘intelligent’ is going on. But think about games like chess: there are no original moves, only original positions that must bear a resemblance to some previous position, either one by a human player or by the program itself. So we’re back to search engine again, aren’t we?

But the successes are few, each is for a single narrow domain with sharp borders, and despite early promise I don’t see signs that they generalise.

1 Like

We can use HTM theory to cast light this.

Dendrites have an activation threshold, somewhere between 8 and 20 activate synapses.
Let’s call this number “T”.

For a group of neurons to detect a pattern in their inputs and represent it in a way that other neurons can detect: there need to be at least “T” many neurons, each with “T” many synapses.

So every time a signal is transmitted between two groups of cortical neurons, there are at least T^2 many synapses.

1 Like

I think the same. And further I would suggest that the very missing architectural part of AGI, seems to be roughly “the structure” of knowledge. That design of “knowledge-structure” should come along with algorithms to deal with semantics each concrete configuration (of knowledge) represents.

So the essential problem turns out to be: How for an AGI agent, to gain the ability to “create” knowledge from empirical input, then “interpret” the knowledge to purposefully make decisions from simulating arbitrary versions of future?

1 Like