(Reminder yet again that I don’t think this is an essential component of the Stuart Russell argument; I think the argument is valid even for systems with complex human-like goals and motivations. But it’s still worth discussing.)
I’m thinking that there are two main categories of scenarios where we might wind up with a brain-like AGI that has one simple all-consuming goal (albeit lots of different sub-goals that help achieve it). I don’t think it’s inevitable, but I do think it could happen.
The first category would be that the programmer purposely tries to give it a single simple all-consuming goal.
Why would they do that? First, it’s an easy and well-known way to figure out how well your system is working, and thus do automatic hyperparameter search, publish high-impact papers in NeurIPS, beat benchmarks, get tenure and high-paying jobs, etc. Second, because maybe the programmer has a certain goal that they have in mind for the system, that motivated them to build the system in the first place, possibly flowed down from management and investors. “Earn as much money as possible.” “Find a more efficient solar cell.” “Find a cure for Alzheimer’s.” “Hack into my adversary’s weapons systems.” You get the idea. In this case, the programmer would find it natural and intuitive to make the system have exactly that goal.
How would they do that? I imagine that the programmer has a way to detect progress towards the goal, and associates a reward for that in the virtual basal ganglia, and gradually ramps it up higher and higher until that goal (and its derivative sub-goals) dominates everything else.
Now, we responsible foresighted people can go tell this hypothetical programmer that whatever their goal is, they’re better off not building an AGI that single-mindedly pursues it, as the system may get out of control. Unfortunately, the programmer might not be listening to this advice, or might think that’s not true, for whatever dumb reasons you can imagine. (Just as it seems hard to keep powerful AGIs out of the hands of bad actors, it seems equally hard to keep powerful AGIs out of the hands of careless actors, or actors with confused ideas about how their systems will behave.) The problem is actually worse than mere carelessness and confusedness. Up to a point, an AGI single-mindedly pursuing Goal X probably is, in fact, the best way to pursue Goal X! It suddenly stops being the best way when the system becomes intelligent and self-aware enough to seize control of its off-switch etc. So we would be asking programmers of early-stage systems not only to “not be careless”, but actually to slow down their progress, like making less effective systems, making less money, getting less impressive results on benchmarks, etc. etc. I can easily imagine a well-intentioned person trying to have their AGI build better solar cells, deciding that the real urgent risks of climate change outweigh the speculative risk that their AGI is more capable than they realize and may get out of control. Or imagine the person at a struggling startup desperate to avoid laying off their employees, etc.
So that’s the first category, where the programmer purposely tries and succeeds in giving their brain-like AGI a single simple all-consuming goal.
The second category is when the programmer gives the AGI write access to its own neocortical connections. There’s an obvious appeal and excitement and opportunity in trying to allow an AGI to edit itself to improve its own operation. However, there has never been a human or other animal with fine-grained (deliberate and conscious) editing power over their own neocortex. We have no experience of what would happen! And I think one possibility is that they would modify themselves—not entirely on purpose—to single-mindedly pursue a single simple goal.
For example, whenever I read a description of gestation crates, I feel a very strong resolve not to eat pork. But then the resolve fades over time, and sooner or later I’m eating pork again…right until the next time I read about gestation crates, and then the process repeats. I imagine that, if I could edit my own neocortex, and I read about gestation crates, I would not want that resolve to fade, and I would go edit my neocortical models to bring back that memory and feeling very strongly whenever I am about to eat pork forever without the resolve ever fading.
This is a general pattern: Sometimes I feel a strong pull towards some goal, but it doesn’t dominate my life because other times I have different goals that pull me in different directions. If I could edit my own brain connections, then during those times when I am feeling a pull towards a particular goal, I may go in and delete drives that conflict with that goal. This would make me pursue the goal more confidently and consistently in the future … and then maybe later on I would be even more aggressive at deleting drives that conflict with that goal, until the process spirals out of control and I wind up as a single-minded extremist.
(Or maybe I would foresee this whole process playing out, and not do any of that, but rather throw my neocortical-connection-editing machine in the garbage. Hard to say for sure what would happen!)
We’re used to having contradictory goals and context-dependent goals—it feels like an inevitable part of cognition—but it seems to me like that would at least potentially be systematically driven away in an AGI with deliberate control over its neocortical connections.
So, as before, we can go take out advertisements and give speeches to warn all the AI programmers that it’s a bad idea to give brain-like AGIs direct write access to their own neocortical connections. But just like a few paragraphs above, those programmers might not be listening to us, or they might not believe us, or they might not care.