Is billions of parameters (or more) the only solution?

Papers don’t guarantee proof, especially in AI, but I understand what you mean here. I do think IMHO that at least presenting a simple analogy through an equation (no need for paper) is helpful. After all, there is probably a high chance that people that come here have some computer science/mathematics background which basically eats linear algebra for breakfast at one point in their lives. For example, I use the term “function approximator” because I think ANN is such without talking about how it actually does it. Now, what is meta-learning with respect to ANN? From what you’ve mentioned above and the papers that I’ve scanned through, meta-learning is basically a function approximator of function approximators. It didn’t get rid of the first-order function approximators and this is what I meant about no energy was lost because it’s still doing the first-order function approximation except that its main concern is a higher-order function approximation.

I’m very interested in this part here hence my original question. Current ANN advancements didn’t come to fruition without scaling the number of its parameters that need to be tuned by backpropagation. Of course one can argue in this statement by saying that in some applications, scaling of parameters (e.g. billions of params) is not necessary. I’m talking about the google-like DL’s of today that are being used commercially.

Now, I don’t like to convince myself that backpropagation + massive parameter scaling is the ultimate solution because even the most advanced ML/AI today is very far from achieving AGI therefore Game Over. Or maybe I’m wrong about this, this algorithm is enough but the problem is more in the representation of inputs.

If the brain doesn’t use a similar algorithm then it must be using a much better one, something that does not search for parameter values at very high resolution. Because intuitively high-resolution search can become fault-intolerant, stochastic gradient descent for example is the opposite it’s noisy but moves faster - maybe something like this the brain is doing but with the noise-tolerance of HTM. I don’t know, just an intuitive guess for me.

1 Like

Number of parameters isn’t the problem. First brain seems to have way more synapses - which should be taken as a sign they-re useful. Second reason is gigabytes are cheap.

The problem is when you have a learning paradigm so inefficient that it needs to weigh ALL of them on inference and finely adjust ALL of them on learning. Not only once but many times for every data sample. Not few samples for each category, but thousands at least.

PS
is needed an AI/ML equivalent of indexes in databases. Without indexing data all computing would be nonsense - every database search would have to comb all storage and every data item stored would have to rewrite all other data items already in - that’s what DL/NN actually does.

2 Likes

I kinda think the fault is of mathematical method in this case, Turing’s universal computing model (essentially λ-calculus) assumes a closed global state, that can only change according to the “head”'s write operations, and there is only a single “head”, so the computation model is essentially serial (or “single-threaded”).

Various process calculi seek to model concurrent/parallel computations in “reasonable” ways, by taking into account the “open-communication” among many "closed-computation-agent"s.

Though I haven’t seen any concurrent/parallel computation model as successful as Turing’s, our mind seems to favor “intuitive consistency” so much, but mathematically-provable “consistency” seems rather hard if possible at all, unless the situation can be reasoned in a serial manner, i.e. “single-threaded” reasoning.

Isn’t that a limitation as well as a feature of the brain? With math solely being a product of the brain.

I think this is a good analogy, but I would like to add that data indices only solved it for the “read” aspect, the same problem remains in the “write” aspect (as the “index-data” along with the corresponding “principle-data” get concurrently updated in parallel), i.e. the mathematical model of data-consistency per database systems, asserts far less guarantees than actually needed in real world cases. Dead & Live Locks are still pervasive in modern computer-driven systems, including databases.

2 Likes

Well, Turing machines is what I’m stuck with, my focus is towards trying to use them instead of building other architecture simply because I lack knowledge or funds to experiment with hardware.

Parallelism isn’t that bad, as long as needed communication bandwidth is reasonably low.

Locks/synchronisation I also do not find as an actual issue… Unlike databases where consistency of every memory location is inescapable, a NN weight (or “synapse” equivalent in other models) is both the statistical result of multiple adjustments and a tiny voter in the final decision. Concurrency errors should be both rare and irrelevant as small random changes in the model.

Database indexes - yes, are more expensive to change yet still are very far from writing the whole index on every insertion/update. Often a tree structure requiring updating paths towards modified/new leaves.

1 Like

At the surface yes, but simce we don’t exactly know how the brain actually uses its synapses to learn the world, I’ll keep my doubts that the need for billions of params in DL is still not optimal.

1 Like

That’s because you’re misunderstanding it. Meta-learning aims to approximate the core process of learning itself - I’ve used the analogy of differential equations several times. It basically amounts to learning a dynamic process.

Scaling fundamentally is on the right path. However, scaling different architectures and techniques leads to vastly different results. Look at UL2 for instance - it leads to a flat out 2x compute efficiency. This is a small factor, but could be a game changer, all by tweaking a simple technique to match up with the theory.

SGD in fact is quite noise tolerant because you usually approximate the gradients over a minibatch and split it across devices. That’s basically the whole point of SGD and how its convergence guarantees work.

NN parameters are also quite tolerant too - you can wipe out nearly 60% of BLOOM’s weight for sparsity and still able to obtain negligible drops in perplexity. That’s because LLMs converge to sparse representations without any external aid due to a variety of reasons (I don’t know any easy explanations, but there’s a lot of literature out there on such emergent phenomena); It is clear though that this sparsity is leveraged quite well internally by the LLM.

Yeah, I feel a lot of the time I spend here is just giving a tl;dr and putting members up to date :slight_smile:
There are plenty of techniques that perform that with a very small computational cost and overhead [1] [2]

Right now, the focus is on meta-learning. If we achieve full meta-learning, then all you need is a really large ctxlen and a way to store and retrieve vectors really fast (which you can do by allocating GPU memory). In that way, its a learning paradigm which seperates the parts - the network remains frozen and updates its weights only when we’ve reached the limits of our database. It all depends on the “In Contex Learning” (ICL) abilities of the model. If a model is able to meta-learn to a large extent, then it would already be near a superhuman system.

2 Likes

I’ll add that the brain has many different parts (and cortical columns), so the number of synapses for a given function (e.g. vision, representing visual objects with emphasis on motion, 2d, whatever, or that except for a small patch of the sensor) is a small fraction of the total.

1 Like

You just said it :grinning:, how come I misunderstood it. I guess you just want everyone to be an ML engineer. Let’s see.

Learning the params and optimizing an optimizer (a function) is basically a function approximating another function in simple terms.

1 Like

My simple answer to my question that seemed to touch areas of interest and attracted ML authority. I know it may be wrong that’s why I asked for inputs and excuse me that ML is not my first language :slightly_smiling_face:.

Let’s remove the need for billions of params in some DL systems, will it still work? Definitely NO.

Let’s remove some neurons from the brain proportionally to the params we removed from DL. Will that part of the brain still work? I don’t know. WHY? Because nobody knows how the brain actually models an object. But interesting paper above about epilepsy is worth reading.

Is billions of params the only solution? I think yes for now because we don’t know how the brain models objects, we just assume it is the best solution as implied in the responses - the brain has more synapses. But anyone like me can also assume that the brains synapses are enough to memorize objects, and generalization is just an emergent behavior.

This gives me hope that someday there will be a much better algorithm than gradient descent as it is very inefficient with params scaling. Please don’t tell me it’s efficient otherwise you are ignoring the billions of dollars and energy used in this AI era spent mostly for research.

Hopefully the better algorithm would be the brain’s algorithm or a subset of it and AI community will have other level 1 architectural options when building there AI systems - this was my hope for HTM.

1 Like

Well… I’m not a fan of DL yet your argument isn’t quite useful. By saying DL dosen’t need so many parameters because we don’t know how synapses are used in the brain, you’d still have to:

  1. provide some plausible hypothesis upon what else are synapses used for instead
  2. prove that DL models, despite having parameters with quite different/alien mechanics/behavior than the one assumed for synapses should be subjected to the above unknown mechanics you assume it makes a brain intelligent.

If we admit brains and DL NNs are very different creatures I see little reason to force any architectural choices on one just because it’s implemented by the other.

There are some important resemblances sure but… not everything is necessary to match.

1 Like

No that’s not what I mean. I mean DL is DL, it is what it is currently and I reserve the other possibility that in the space of optimization algorithms, there might be some other algorithms that use a little fewer params or brute-force’ish updates to its params.

Of course, I agree.

I’m amazed by the advancement of DL-based ML algorithms of today and the reason why I am questioning it is because we may find brain algorithms by learning or understanding why these ML algorithms of today are progressing and working well. Again that is why I asked about billions of params is the only solution because the reality of today is that the most successful models (relatively) by the likes of google’s models are trained with billions of params - I’m simply trying to benchmark my ideas against this. Because maybe we may really need these params and its optimization algorithms (e.g. gradient descent and family), and the problem is not its requirement of massive resources, it maybe that our current computer architecture is just not designed to run these massive DL models as first-class programs.

With this, I’m actually trying to introspect into first principles as to how the SP algorithm works. I’m not talking about the code I’m talking about what fundamentally it is doing and hopefully I can compare it well with current DL algo such as gradient descent. For now, I think voting algorithm is inherent/fundamental in HTM - also important to understand for TBT implementation IMO.

1 Like

In terms of synapse-level learning, the brain probably won’t offer anything superior because it’s more limited by physics. Where it’d outshine ML would probably be in larger scale things, like complex systems of circuits etc. It might be best to still use gradient descent, rather than the brain’s learning. It seems like synapses generally learn by hebbian learning, basically.

1 Like

Meta-learning is learning the process of learning.
You said,

Therefore, you imply that the process of learning itself… is a “function approximator”. That’s what I was referring to when I said you\re misunderstanding the concept meta-learning.

Yes. The same goes for the brain as you might’ve seen this recent news which made headlines: Neurons in a dish learn to play Pong — what’s next?

They use 800,000 biological neurons - you can wire up a shallow Neural Network consisting of less than 100k neurons on a raspberry pi and still outperform the biological counterpart.

DL systems have no inherent link to parameters. They have to the complexity of the function they’re approximating. Right now, the Human brain is extremely big - so large that we can’t even come close to obtaining that large of a NN - so scale can’t really be ruled out until its verified apples-to-apples.

I’ve said it before, when @Casey brought it up - you simply can’t compare the brain to backpropogation. So any estimates of “efficiency” are subjective - which is great if you like opining on subjects, but I don’t see the point in chit chatting about something which no one has any idea about.

Probably because there is more than one way to model a complex environment. I suppose you’ve interacted with ChatGPT and its counterparts - its interesting how the same emergent phenomena in neuroscience align so particularly well with DL models.

1 Like

Not any kind of learning, only various architectures of backprop? It can’t get away from backprop because the training base contains nothing else?

2 Likes

FWIW my working hypothesis is:

  1. Current ML is not AI, it’s mostly just a fancy search engine. It searches a multidimensional space for matches on text/images. It takes a lot of parameters to define interesting regions in that space. Practical ‘AI’ adds an output layer of engineered code to do something useful with the search results.

  2. Brain is quite different. It feeds sensory inputs in parallel into columns (of neurons) executing computational algorithms. The connections from one column to another are a data representation, possible an SDR. The final output is always a motor action, based on an input and a goal. Short term storage is synaptic, longer term something different, maybe molecular (DNA?), convertible back into the SDR?

IOW there are no lines of similarity, and trying to draw them will fail.

2 Likes

Yeah, you failed, so must everyone else.

2 Likes

Not really - I suppose every researcher’s wet dream is to approximate human learning by tracking a baby’s every move as it grows up. You can learn through any source, and that source agent can be a backproppped one or a human or an alien. Right now, the RL seems to be a much easier medium to obtain humans in their element learning a new environment than tracking babies :wink:

Just curious, what is your take on Jean-Remi King’s work? (Brains and algorithms partially converge in natural language processing | Communications Biology), (https://arxiv.org/pdf/2206.01685.pdf) and (Deep language algorithms predict semantic comprehension from brain activity | Scientific Reports) which has impressive stats, seeing how its a tiny GPT2

So you think that the brain computes algorithms, but Neural Networks don’t? Why then do they converge to elegant, reverse-engineerable algorithms and circuits? (https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking) and (https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)

Seeing how our understanding of the brain is so vague, I can’t help but pin down your working hypotheses to weak opinions with little to no substance behind.

I’d especially love a few citations on the

Because that’s the farthest abstraction of what NNs do

2 Likes

Brain still is the fanciest search engine. It doesn’t need thousands or millions of samples for every new… thing it learns in order to figure out where to place it on the “world map”, such that is later able to reliably retrieve and model it.

1 Like

Definitely not mine, I think human learning is grossly deficient. I meta-learn my algo.

I guess it’s possible in principle, but will take forever in practice. Backprop is notoriously incremental, so it’s kind of like evolution taking billions of years to achieve human-level GI. Ok, backprop is a lot faster than evolution, but you are talking about it learning some existing learner. That may work for a human, because they already work, but not for actually novel learning method.

As for hacking humans, it’s far easier to backprop neuron-level brain activity than human behaviour. In science, DL is progressing bottom up: first molecular dynamics, then protein folding, now protein complexes, in the future: organelles, cells, organoids, fruit flies, etc.

1 Like

These are fair questions. Thank you for asking them.

The purpose of my post was not to present a theory, it’s barely even a hypothesis. But it does focus on some rather sever shortcomings of current offerings.

Re your quotes: my hypothesis is that GPT2 is just a crazy-clever search engine, with a bit of gloss on top. That everything GPT2 outputs is a copy of something written by some human somewhere in the millions of training documents, with minimal reformatting and reworking achieved by hand-written software-engineered code. These references do not support any other view.

Re brain: no, my key point is that the brain processes sensory input, not training documents, and generates motor output based on a language ‘engine’ and retained memory. Brain can easily create entirely original but perfectly formed, logical and comprehensible text; GPT can only output what it has been trained on with minor tweaks and routinely produces totally stupid output. I can give you examples of that if you care, but you’ll find them easily enough.

So that part of my hypothesis is consistent with available data. Can you say otherwise? Can you falsify this claim?

1 Like