AGI and Wireheading

Ah, I think I see where you’re coming from (correct me if I’m wrong).

I say: “Imagine a guy whose one and only goal is to understand the brain.” What are you imagining?

I suspect that the image in your head is an obsessive—he desperately scans through neuroscience books for answers, won’t pause to eat or drink or sleep, and drops dead a couple days later.

The image in my head is more like a cold and calculating guy with a long-term goal, strategically making moves towards the goal of understanding the brain—just as a chess master strategically makes moves towards the goal of capturing the opponent’s king. He studies neuroscience, sure, but he also perfects the art of sweet-talking venture capitalists, recruiting and retaining the best employees and collaborators, stealing secrets from private research programs—whatever will help with the goal. He’s very strategic in all respects. He eats and sleeps well … but only because staying healthy will help him achieve his goal! He is kind to employees … but only because low employee turnover will help him achieve his goal! Etc. etc.

Obviously the latter “cold and calculating” guy winds up understanding the brain much more successfully than the former “obsessive” guy.

If we’re in a scenario where a brain-like AGI has one simple all-consuming goal (as opposed to a more complex, human-like system of goals and motivations and habits), is it more likely to be like the “obsessive” story or the “cold and calculating” story? I don’t know—I imagine that either could happen, depending on details of the architecture, training procedure and data, etc. In particular, I think “cold and calculating” story could happen, and that’s the much more dangerous one, and that’s the one I’ve been talking about, without trying to say that it’s the only possibility, or that it’s inevitable.

If you gave me a brain-editing machine, I would certainly be reluctant to use it, but maybe I would anyway, trying as best I could to be careful, for what I judge to be a sufficiently worthy cause.

It’s funny: every time I think I learn a new fact or think a new thought, I’m changing my brain architecture in an irreversible, uncontrolled way. If I go read a philosophy book advocating for nihilism, will I stop caring about my family? “Well, maybe it will undermine all my deeply-held values,” I say, “but it’s the book club pick of the month, gee, I guess I have to read it and hope that things turn out for the best.” So in a sense, I’m already used to the idea of doing irreversible uncontrolled experiments on my brain, just not in quite the same way, nor with the same consequences.

I’m sure I would try to anticipate the downstream consequences of using my brain-editing machine before I do so, but I might not do so correctly. In fact I don’t think there’s any way to know for sure what the consequences would be. Even if there was a way to run tests, no test can be 100% faithful.

Speaking of which, simulating an AGI in a virtual environment, before letting it act in the real world, is an excellent idea that I’m strongly in favor of, insofar as that’s possible. Unfortunately, no virtual environment is exactly the same as the real world, and I think that the differences may be relevant in important ways.

If you give me a machine that let me deliberately modify my own brain connections, I would not crank up pleasure to infinity, because I am not a hedonist. I would probably use the machine to try to turn myself more into the kind of person I want to be, e.g. prosocial. (Or maybe I would just be too scared to use the machine at all. :stuck_out_tongue:) So I think instead of “it will do so”, we have to say “it might do so”, right? I am an animal after all…

Also, I don’t think it’s necessarily the case that a wireheading AGI is not dangerous. I would say: it might or might not be dangerous.

The optimistic (wireheading AGIs are not dangerous) story might be: When humans are in the thick of a really intense drug-high, overwhelmed with pleasure, they can’t think straight. Why not? I think that the dopamine-based reward system is involved in the normal process of thinking, and a giant flood of dopamine interferes with that (or something like that). If AGIs have that property as well, then a wireheading AGI wouldn’t be highly intelligent, it wouldn’t really be thinking at all, so you can just walk over and turn it off, no problem.

The pessimistic (wireheading AGIs are dangerous) story might be: Maybe our AGIs won’t have that property. Maybe for whatever reason, either on purpose or as a side-effect of some other implementation detail, the AGIs can continue to maintain their normal cognition regardless of the background level of valence / reward. In that case, as in the paper @Paul_Lamb linked (here’s the link again), a wireheading AGI (of the “cold and calculating” variety I mentioned above) would presumably be willing and able to seize control of its off-switch etc. to prevent humans from making it stop wireheading.

Sorry if I’m misunderstanding.

Your position is “highly likely”; mine is “I have no idea how likely”. I’ll try to explain why I think that…

As I mentioned above, I’m not a hedonist, so I wouldn’t wirehead if you gave me fine-grained deliberate control over my brain connections today.

Why not? Here’s an oversimplified example.

Humans have a dopamine-based reward system which can be activated by either (1) having a family or (2) wireheading (direct electrical brain stimulation).

People who have a family would be horrified at the thought of neglecting their family in favor of wireheading.

Conversely, people who are intensely addicted to wireheading might be horrified at the thought of stopping wireheading in favor of having a family!

(Not a perfect example, but you get the idea.)

I think it’s very important what order things happen in. All of our preferences and goals and values are shaped by past rewards, and that is our basis for making decisions, and these decisions might or might not maximize future rewards.

If we’ve never wireheaded before, then we know at some intellectual level that wireheading will lead to a large reward, but that intellectual knowledge doesn’t make us intensely addicted to it. Why not? Because the basal ganglia is stupid. It doesn’t understand the world. All it does is memorize which meaningless pattern of neocortical activity have created rewards in the past. One such pattern of neocortical activity is the one that means “I am going to wirehead now”. If we’ve never wireheaded before, this pattern of activity means nothing to the basal ganglia, so we don’t feel driven to do that. We might wirehead anyway because the basal ganglia recognizes the “I am going to feel really good” pattern of activity. But that’s a much weaker (indirect and theoretical) prediction and hence a weaker drive, and can be outvoted by, say, concern that we’ll neglect our family.

As soon as we wirehead once, now the basal ganglia catches on. It memorizes the “I am going to wirehead” neocortical activity pattern, and it knows that that pattern got a super-high reward last time. Now the basal ganglia is aggressively pushing that thought into our mind all the time. This is the point where people become addicted.

So anyway: If people are designing brain-like AGIs, they will very quickly discover that you can’t give a “baby” AGI easy access to its reward channel, or it will just try it immediately and wind up wireheading to the exclusion of everything else. The more interesting question is: if we wait until the AGI is a smart, self-aware “adult”, and then give it easy access to its reward channel (along with an explanation of what it is), will it use that reward channel to wirehead? I think the answer is “maybe, maybe not”…

I’m not sure what would make an AGI (or brain) more or less likely to consciously endorse hedonism, as opposed to having goals outside oneself.