How much compute do you really need above what is already being developed and techically viable to implement ? i.e. by the time you have some code ready to run, will you really have any issue scaling it or will your code be hugely inefficient and still need way more hardware ?
This is also part of the increasing hardware overhang conundrum for AGI development as a whole.
On a simplistic perspective as humans we might listen to words at a rate of say 1 word per second for 6 hours per day (some more, some less), which is 21,600 words per day. Over a year that’s 7.9 million words and by the time we learn to speak we have heard way less than 30 million words. Not 30 million words repeated 500 times, but only heard once and when we grow up we can barely remember 1,000 of them (in a sentance we heard).
If we encode each word as a 32bit number we have an input data set that is way under 1GB if we add in punctuation and even temporal relativity.
For those who will say about seeing and vision, let’s assume we are talking about someone who is blind. When we sleep different things happen and those words are repeated in a manner but we don’t recieve any new inputs. When we sleep REM also only lasts for a relatively short duration and is proportionately a small fraction of the input duration. We don’t sleep 100x longer than we are awake, but happy to work with models that effectively do.
Take the new hopper architecture out of NVIDIA, particularly the top end GH100 and consider what it really means.
The GH100 has 80GB of memory per board arranged with 12 x 512bit memory channels that has 1.9TB/sec local bandwidth or the ability to scan the full memory at 24Hz. That 80GB of memory can also be added to a shared memory clustered with 18 links at 900GB/sec per board so a cluster can have an aggregate (quoted) bandwidth of 57TB/sec with 256 GPU’s (1:2:2 switching reduction).
If you take those 256 boards, each scanning the full memory at 24Hz it’s an aggregate of 491TB/sec for a 20.5TB memory space. If thinking about the shared memory pool at 57TB/sec the scan rate is just under 3Hz (but this is sending the whole data machine-machine). The main interest here should be the implied very low message latency implications and not the full bandwidth capability in the context of dealing with the whole memory at 10-30Hz (assuming vast inefficiency that you need to process all the memory all the time).
The off board speed is slower than on board but the ability to have a shared memory pool of 20.5TB becomes very interesting and far bigger than the majority of models in existance. Not forgetting that the GH100 has 18,432 FP32 cores so that 256 GPU cluster has 4,718,592 cores available. Still smaller than the 34+ million cores in the exa scale compute arena, but still.
Granted those 4.7 million cores will initially cost more than the majority of us will make in our lifetimes, however compute get’s cheper very quickly…
How much compute do you really need for AGI ? How much compute do you need for your code run a fraction of AGI ? Code…