Why you should forget about hardware and focus on the code

How much compute do you really need above what is already being developed and techically viable to implement ? i.e. by the time you have some code ready to run, will you really have any issue scaling it or will your code be hugely inefficient and still need way more hardware ?

This is also part of the increasing hardware overhang conundrum for AGI development as a whole.

On a simplistic perspective as humans we might listen to words at a rate of say 1 word per second for 6 hours per day (some more, some less), which is 21,600 words per day. Over a year thatā€™s 7.9 million words and by the time we learn to speak we have heard way less than 30 million words. Not 30 million words repeated 500 times, but only heard once and when we grow up we can barely remember 1,000 of them (in a sentance we heard).

If we encode each word as a 32bit number we have an input data set that is way under 1GB if we add in punctuation and even temporal relativity.

For those who will say about seeing and vision, letā€™s assume we are talking about someone who is blind. When we sleep different things happen and those words are repeated in a manner but we donā€™t recieve any new inputs. When we sleep REM also only lasts for a relatively short duration and is proportionately a small fraction of the input duration. We donā€™t sleep 100x longer than we are awake, but happy to work with models that effectively do.

Take the new hopper architecture out of NVIDIA, particularly the top end GH100 and consider what it really means.

The GH100 has 80GB of memory per board arranged with 12 x 512bit memory channels that has 1.9TB/sec local bandwidth or the ability to scan the full memory at 24Hz. That 80GB of memory can also be added to a shared memory clustered with 18 links at 900GB/sec per board so a cluster can have an aggregate (quoted) bandwidth of 57TB/sec with 256 GPUā€™s (1:2:2 switching reduction).

If you take those 256 boards, each scanning the full memory at 24Hz itā€™s an aggregate of 491TB/sec for a 20.5TB memory space. If thinking about the shared memory pool at 57TB/sec the scan rate is just under 3Hz (but this is sending the whole data machine-machine). The main interest here should be the implied very low message latency implications and not the full bandwidth capability in the context of dealing with the whole memory at 10-30Hz (assuming vast inefficiency that you need to process all the memory all the time).

The off board speed is slower than on board but the ability to have a shared memory pool of 20.5TB becomes very interesting and far bigger than the majority of models in existance. Not forgetting that the GH100 has 18,432 FP32 cores so that 256 GPU cluster has 4,718,592 cores available. Still smaller than the 34+ million cores in the exa scale compute arena, but still.

Granted those 4.7 million cores will initially cost more than the majority of us will make in our lifetimes, however compute getā€™s cheper very quicklyā€¦

How much compute do you really need for AGI ? How much compute do you need for your code run a fraction of AGI ? Codeā€¦

2 Likes

It should be that way in scientific terms but real world, resources count.

I agree with the general idea here except the fact there has to be some way for the ā€œmindā€ to:

  • integrate actual experience(s) with ā€œwordsā€. This is quite a vague requirement
  • generate its own inner ā€œwordsā€ out from experience(s).

What would that mean.
First why I put quotes on ā€œwordsā€. Because the term plays us a misdirection trick by making our attention narrowing towards spoken/written words and language.

Slightly more adequate terms (at least for programmers) are ā€œpointersā€, ā€œidentifiersā€, ā€œhandlesā€.

And these ā€œidentifiersā€ are not only linguistic in nature, they are literally any recognizable thing with or without an associated word for it. A neighbor face you recognize without knowing their name, a familiar smell or melody you like but donā€™t recall where you heard it before. The dreaded qualia are simply that: recognizable things, pointers, identifiers, handles. Exactly like words.

Handles/pointers to/towards what? we should ask. Generally, everything (aka world) is made of things and only things. We cannot conceive/imagine a no-thing. So any identifier/thing is the handle which when ā€œpulledā€ recalls one or more slightly larger experiencing contexts which are a relatively small grouping of other identifiers/words/things.

Ok all the above resembles a graph-of-knowledge. Which all of us are already using to describe what we know about anything we have words or descriptions for it.
There has to be a catch why while knowing the above we werenā€™t able to engineer an ā€œartificial mindā€. The old-school symbolic AI failed to create a mind by simply describing ā€œthingsā€ and ā€œconnectionsā€ within knowledge graphs.

There is something that we are missing.
One important process we havenā€™t replicate is the one by which any-and-every-thing pops up into existence.

One very important part of what we humans consider ā€œlearningā€ is exactly this: (the means by which we) generate new identifiable things.

It could be this?

1 Like

Yes, very much so. I use the word ā€˜symbolā€™ to refer to the same idea: an identifiable something that might represent a thing, a sound, an image, property, a feature, a place, etc, etc. Iā€™m sure animal intelligence does it too, but we are unique in associating many of them with words, which turns out to be a really useful thing.

Given the vast address space of an SDR as a unique identifier, I speculate a close relationship between these symtols and SDRs.

3 Likes

The point I was trying to make is that the volume of sensory information (whatever form that may take - no particular model methodology in mind) to achive AGI should be quite small compared to a lot of the thinking and assumptions being made.

Underneath all models is a very high degreee of computational necessity just purely by virtue of the minimal level of volume needed to achieve a sufficient level of context and the inherent degree of cross association that involves.

With systems being developed with hundreds of millions of cores (CS-2 cluster) these systems will have far more cores than mini-columns in the brain. Those cores also run at a clock speed several magnitudes faster than the human brainā€¦

The main message Iā€™m trying to make is it can be easy to detract from the real problem by dissapearing into the rabbit hole of hardware (from my own experience of dissapearing into a holeā€¦). Itā€™s really all about the model/codeā€¦

2 Likes

My opinion on this headline is that: algorithmic improvements will beat incremental improvements.

Itā€™s better to find fundamentally new ways of solving tasks, than to chase optimizations. To do this you should have an understanding of computational complexity, ie big ā€˜Oā€™ notation.

For example: look at the recent advent of real-time ray tracing. Part of the solution is to use a NN to scale up the resolution of the video feed, which is an algorithmic improvement. Another part of the solution is to use specialized hardware for tracing the rays, which may seem like an incremental improvement but it has large benefits because traditional computer hardware is ill-suited for that task.

2 Likes

True but I think we need a bit to redefine what we should look for. I mean the targets/benchmarks used in ML. It would be nice to have a more nuanced definition of intelligence than current statistical-oriented benchmarks provide. A four year old not knowing what a digit ā€œ5ā€ means would fail miserably a MNIST benchmark. A dog would not even achieve a significant ā€œscoreā€ in its lifetime.

What is the right tool/benchmark to estimate ā€œintelligenceā€ that gives meaningful scores for either humans, animals and machines?

2 Likes