GPT3 or AGI?

Why no one here taking about gpt3? It’s not an agi, but it’s definitely have some deep understanding about the world in some way. If we could train that monkey with loads of text data (and may be visual data in future) it could be easily the first true agi model. Do anyone have some thoughts?

1 Like

I guess the next fairly dramatic step forward will be to integrate external memory banks into artificial neural networks. That may require a round trip across the neural symbolic barrier.
Anyway it was always a very weird subject area - heavily repetitious persistence research locked into limited ways of doing things.
That is even worse now with massive egos, salaries, pride, social position creating invested positions.

4 Likes

GPT3 (Generative Pre-Trained Transformer Model 3.0), is a insanely massive neural network with 175BILLION Parameters. It was trained on 570GB of data from the CommonCrawl Data set and I think other internet data. It cost something like $4-12 million dollars to train… (Note, past tense… It has already been trained with loads of text data).

The results and applications have been very cool so far, and seems to be a step towards non-sentient AGI, BUT it has the same problems as other A.I. approaches. It actually DOESN’T understand anything that it’s doing. An example of this is that, GPT-3 will continue generating text after answering a question, because it has no idea if it has answered or not. GPT-3 was tested fairly rigorously to see if it could be successful at analogy making, which researchers Melanie Mitchell and Douglas Hofstadter, among others, believe is a very important marker of “understanding” and “intelligence”. The result is that GPT-3 was unable to to do one shot learning when it came to analogies (as humans frequently succeed at), and when it was trained with further examples, it failed to generalize what it “learned” to other examples. https://medium.com/@melaniemitchell.me/can-gpt-3-make-analogies-16436605c446

Finally, it seems GPT-3 is heavily biased, or one might even say racist, sexist and antisemitic, just like other AI models before it, because it doesn’t understand the data that it is being trained on. It only recognizes patterns in that data and generates data that follows the pattern. Unfortunately, there is a lot of bias in the data we feed it, and so we get out what we put in.

Basically, this makes GPT-3 unpredictable and thus not suited to mission critical applications where mistakes can’t be tolerated. GPT-3 is a pretty good virtual assistant for now. I think it’s the most “general” neural network created so far. Maybe it can get you started on a paper, generate the first part of your web page for you, and do other awesome interesting things, but it really does not understand what it’s doing just like every other current AI model, and should definitely be used in conjunction with humans.

It seems we still need a new approach to get to true machine intelligence, because I don’t think that GPT-4,5,6… (I.E. scaled up versions of GPT-3), are going to cut it.

7 Likes

My take is that it does somehow understand everything that it is doing, except that unlike us it’s worldview is built up on streams of “tokens” representing words instead of a stream of phenomenological experience, and motor actions and no human brain would do any better with a similar handicap.

I would be curious how its predictions would look like if it were trained with a continuous webcam movie passed through a few “translators” of visual & other sensory experience to text… like CNNs classifiers describing what is seen, overlapped with voice recognition and general sound classifiers (outputting text of whats “heard” e.g. bird songs, glass broken, honks, street noise, etc) and even inertial sensory data that outputs camera movement in text form.

Another guess I have is a significant improvement might come not from an even larger model but from using all API feedback from people using “replays” it as a form of semi-supervised tuning in order to improve its … common sense.

1 Like

That’s an interesting point… I agree with that, but can any traditional style of neural network really can be said to ‘understand’ anything without the self-prediction-feedback loop? (Same as brains do). Does GPT use anything like this? (I don’t consider cost-gradient backpropagation to fit the bill)

1 Like

I’m not sure I get what you mean by self prediction feedback loop, but from what I understand about transformers what they do is generating a prediction sequence continuing an input sequence. I see no reason why we can’t make it loop again upon its own prediction, and having that paired against (or in parallel) with “real world feedback” to keep improving that loop. Even if that would mean training/tuning it at night with the stream of its daylight inner chatter.

And I also believe we don’t actually “understand” anything except in relation with other self-generated “tokens” we build upon past self-generated tokens.
One of the biggest advantage we have is our inner model is checked/retrained against our actual, real time sensory input paired with our motor responses, and real time social interactions giving us the common-sense feedback.
Of course not the only one, but a really big one.

Any system that works strictly on predictive principles will not “understand” what it is doing. Autocomplete may know what is a likely sequence based on similar sequences in the past but that is not based on intent.

Sequences of sequences (trees) is just more of the same thing.

Autocorrect can’t write a complete original paper (such as what a grad student does) or novel book. I don’t discount music or poetry writing software but all of them must be primed with constraints and what is produced tends to be “meaningless” in term of content - no intent or ideas to convey. What “meaning” that is there is what was in the original training stream.

We can’t have experienced every situation in the past. We may have learned smaller sequences and it is likely that much of our behavior consists of stereotyped actions - the subcortical structures call these behaviors up to fill very basic animal drives.

The Markov sequence (or whatever prediction system used) does not understand - memory and goal selection based on elaboration of needs adds intention.

Memory (programmed or learned) Contains goal objects and the subcortical structures build “bridges” from “here“ to the goal state.

“Understanding” has to be forming connections to internal goal objects and access methods to those goal states - it has to be relevant to be relatable. To understand something you have to add the perceptions into your store of methods - since humans share much the same experiences, needs, and methods to achieve the goal objects, we build much the same database of perceptions. This shared database is our common frame of reference - the things we don’t have to add to our communications as it is the assumed starting point that does not have to be stated. An agent that did not have this common reference frame would be very alien to us.

An alligator can chase a prey object over a submerged car and understand the shape without understanding a car. A human diver fleeing the alligator can understand the car shape and understand that it might afford shelter from being eaten. The added car facts and sequences give a wider selection of behaviors and affordances.

2 Likes

Failure to understand ReLU is a switch and its interaction with dot products is very expensive it seems!
I was just looking at a paper where it mentions ReLU neural networks can efficiently implement multiplication. Understanding ReLU as a switch and understanding the binary multiplication algorithm as a switched system of additions you can definitely say that is true.
Paper:
https://arxiv.org/abs/2008.02545v1
Binary Multiplier algorithm:

2 Likes

The question is, how is HTM different? I feel that it may be, but cannot explain why.

1 Like

HTM, by itself, is not different.
It predicts and signals when the senses input does not match predictions.

Pure memorization will only elevate to a certain level of the semantic hierarchy.
I think you can get quite far though with extra complications like HTM, context trees or If-Except-If trees.

Then you have the jump to pure artificial neural networks, where you try to get the network to do everything. Forcing it to do everything from low level spelling predictions to high level semantics.

In between there is the largely unexplored area of deep neural networks (trained by evolution) feeding memorization algorithms. And then maybe post processing the symbolic results with a further neural network.
The main reason to do that is it allows the network workable read write access to the vast memory resources of current digital computers at high speed, which can reach to 10s of gigabytes even on a home PC (should you be doing reasonably well in life.)

5 tb hard drive 90 bucks at Costco at the moment. I expect to see that Solid State hard drives to be following that price sometime “soon.”

1 Like

The hardware is 30 years ahead of the software to run on it.
A few software breakthroughs here and there, how to get the most out of clock cycles and mass memory devices. AGI will come knocking on the door.
For example the few attempts to give artificial neural networks external memory banks have been crude and few and far between. The main question is exactly what type of addressing mechanism would be successful.
I did try with a type of associative memory which has actually has a reasonable addressing scheme but it has to look at too many large vectors in memory, to the point it is say 100 times slower that the network it is serving.
There are Neural Turing Machines but the addressing mechanism is terrible.

Probably a better answer is incremental variable length addressing.
Where the network provides only a few address bits at first and more later if necessary. Short addresses for common items to read and store, longer addresses for more specific things. And some fallback mechanism when an address provided for a read isn’t an exact match for anything.

It kind of reminds me of Huffman Coding, for some reason.

Didn’t it also take a whole week to train. Isn’t data increasing at a collosal rate too? We currently have about 33 ZB of data now which is expected to increase to 175ZB in 5 years time. What happens to other versions being built in later years? How much more data will they have to be trained on to know enough? What does this mean for explainability? Adding sparsity was shown to improve the adversarial attack situation, what will sparsity do to the space complexity. I suppose then they will just be trained against every example?

Explainable? You mean - a lot of of people said things this way? And they had these key words as context?

Somehow I don’t see that as an explanation.

Hi, there. From what I think I know, there are several techniques used in ANNs to account for outcomes which often involve the extraction of commonly occurring features thought to be important contributers to some classification (I honestly also seriously object to that as being regarded as an explanation but that has become the acceptable norm. I see it as more of a reason fitting for the data we observed). Im not sure whether this technique has been tried on GPT3 but I imagine that it will also have to undergo some sort of regulation which I imagine would involve some sort of accountability, no matter how cool and sophisticated it appears. I’d really appreciate it if anyone could share what they know about the scalability of the current XAI techniques to GPT3 and and onward given that this strategy sometimes stumbles even on smaller models that this. Or does the training help?
I saw in glancing that an extra area of parameterization is involved. I’ll only really get the chance to look at GPT3 properly perhaps in the next couple of weeks. Apparently the algorithm is quite small, it’s just trained on everything and utilises a much larger network which has an additional set of three parameterizations. I was so bummed.

1 Like

Should we find a way to explicitly represent memory?

1 Like

And now this:
(not quite GPT3 but transformer derived agent)

After a single introduction to a novel object via continuous visual perception and a language prompt (“This is a dax”), the agent can re-identify the object and manipulate it as instructed (“Put the dax on the bed”). In doing so, it seamlessly integrates short-term, within-episode knowledge of the appropriate referent for the word “dax” with long-term lexical and motor knowledge acquired across episodes (i.e. “bed” and “putting”). We find that, under certain training conditions and with a particular memory writing mechanism, the agent’s one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category, and is effective in settings with unfamiliar numbers of objects.

One question is the way artificial neural networks learn. Statistical adjustments or deep dense function composition?
Anyone who has done engineering optimization with algorithms would say the function composition approach would be completely impossible with 1 millon parameters. Statistical nudging is the only way to explain learning in current large artificial neural networks. And with Fast Transform neural networks each transform basis vector acts as a statistical channel. There would seem to be no other to explain learning in such systems. Very probably statistical channels exist in and explain learning in conventional neural networks.