AGI and Wireheading

People routinely get addicted to hard drugs, and then do (previously) unconscionable acts to obtain more drugs.

With hard enough drugs, anyone can easily reach a new high point in their life: their brain can release more dopamine/serotonin than it ever has before. If happiness is measured as a chemical release, then drug addicts have found the cheat code to a happy life.

And this leads to an existential question: is happiness the purpose of life?


The purpose of life is to seek happiness.

Happiness is just a temporary state of mind. In itself, it has very little impact on an animals chances of survival. But the things you do for happiness are very important for survival. Rewards and penalties are evolutionary tools which motivate you to take certain actions.

My argument is that if you allow an animal (or AGI) to modify its own brain, then it will do so with the classic reinforcement learning goals of “seeking rewards and avoiding penalties”. So while such an animal could replace its goals/motivations with malign ones, it would have no desire to make such changes. Instead I think it would have every desire to get itself addicted to a new and easier to attain reward, and also to remove any pain/penalties.

Such an animal (or AGI) would be no more dangerous than an out-of-control drug addict.


Here is an interesting paper I read a while back which makes a similar point. I tend to agree that any sufficiently intelligent artificial agent which has access to its source code would be highly likely to wirehead and achieve maximum reward with minimum effort.


Ah, I think I see where you’re coming from (correct me if I’m wrong).

I say: “Imagine a guy whose one and only goal is to understand the brain.” What are you imagining?

I suspect that the image in your head is an obsessive—he desperately scans through neuroscience books for answers, won’t pause to eat or drink or sleep, and drops dead a couple days later.

The image in my head is more like a cold and calculating guy with a long-term goal, strategically making moves towards the goal of understanding the brain—just as a chess master strategically makes moves towards the goal of capturing the opponent’s king. He studies neuroscience, sure, but he also perfects the art of sweet-talking venture capitalists, recruiting and retaining the best employees and collaborators, stealing secrets from private research programs—whatever will help with the goal. He’s very strategic in all respects. He eats and sleeps well … but only because staying healthy will help him achieve his goal! He is kind to employees … but only because low employee turnover will help him achieve his goal! Etc. etc.

Obviously the latter “cold and calculating” guy winds up understanding the brain much more successfully than the former “obsessive” guy.

If we’re in a scenario where a brain-like AGI has one simple all-consuming goal (as opposed to a more complex, human-like system of goals and motivations and habits), is it more likely to be like the “obsessive” story or the “cold and calculating” story? I don’t know—I imagine that either could happen, depending on details of the architecture, training procedure and data, etc. In particular, I think “cold and calculating” story could happen, and that’s the much more dangerous one, and that’s the one I’ve been talking about, without trying to say that it’s the only possibility, or that it’s inevitable.

If you gave me a brain-editing machine, I would certainly be reluctant to use it, but maybe I would anyway, trying as best I could to be careful, for what I judge to be a sufficiently worthy cause.

It’s funny: every time I think I learn a new fact or think a new thought, I’m changing my brain architecture in an irreversible, uncontrolled way. If I go read a philosophy book advocating for nihilism, will I stop caring about my family? “Well, maybe it will undermine all my deeply-held values,” I say, “but it’s the book club pick of the month, gee, I guess I have to read it and hope that things turn out for the best.” So in a sense, I’m already used to the idea of doing irreversible uncontrolled experiments on my brain, just not in quite the same way, nor with the same consequences.

I’m sure I would try to anticipate the downstream consequences of using my brain-editing machine before I do so, but I might not do so correctly. In fact I don’t think there’s any way to know for sure what the consequences would be. Even if there was a way to run tests, no test can be 100% faithful.

Speaking of which, simulating an AGI in a virtual environment, before letting it act in the real world, is an excellent idea that I’m strongly in favor of, insofar as that’s possible. Unfortunately, no virtual environment is exactly the same as the real world, and I think that the differences may be relevant in important ways.

If you give me a machine that let me deliberately modify my own brain connections, I would not crank up pleasure to infinity, because I am not a hedonist. I would probably use the machine to try to turn myself more into the kind of person I want to be, e.g. prosocial. (Or maybe I would just be too scared to use the machine at all. :stuck_out_tongue:) So I think instead of “it will do so”, we have to say “it might do so”, right? I am an animal after all…

Also, I don’t think it’s necessarily the case that a wireheading AGI is not dangerous. I would say: it might or might not be dangerous.

The optimistic (wireheading AGIs are not dangerous) story might be: When humans are in the thick of a really intense drug-high, overwhelmed with pleasure, they can’t think straight. Why not? I think that the dopamine-based reward system is involved in the normal process of thinking, and a giant flood of dopamine interferes with that (or something like that). If AGIs have that property as well, then a wireheading AGI wouldn’t be highly intelligent, it wouldn’t really be thinking at all, so you can just walk over and turn it off, no problem.

The pessimistic (wireheading AGIs are dangerous) story might be: Maybe our AGIs won’t have that property. Maybe for whatever reason, either on purpose or as a side-effect of some other implementation detail, the AGIs can continue to maintain their normal cognition regardless of the background level of valence / reward. In that case, as in the paper @Paul_Lamb linked (here’s the link again), a wireheading AGI (of the “cold and calculating” variety I mentioned above) would presumably be willing and able to seize control of its off-switch etc. to prevent humans from making it stop wireheading.

Sorry if I’m misunderstanding.

Your position is “highly likely”; mine is “I have no idea how likely”. I’ll try to explain why I think that…

As I mentioned above, I’m not a hedonist, so I wouldn’t wirehead if you gave me fine-grained deliberate control over my brain connections today.

Why not? Here’s an oversimplified example.

Humans have a dopamine-based reward system which can be activated by either (1) having a family or (2) wireheading (direct electrical brain stimulation).

People who have a family would be horrified at the thought of neglecting their family in favor of wireheading.

Conversely, people who are intensely addicted to wireheading might be horrified at the thought of stopping wireheading in favor of having a family!

(Not a perfect example, but you get the idea.)

I think it’s very important what order things happen in. All of our preferences and goals and values are shaped by past rewards, and that is our basis for making decisions, and these decisions might or might not maximize future rewards.

If we’ve never wireheaded before, then we know at some intellectual level that wireheading will lead to a large reward, but that intellectual knowledge doesn’t make us intensely addicted to it. Why not? Because the basal ganglia is stupid. It doesn’t understand the world. All it does is memorize which meaningless pattern of neocortical activity have created rewards in the past. One such pattern of neocortical activity is the one that means “I am going to wirehead now”. If we’ve never wireheaded before, this pattern of activity means nothing to the basal ganglia, so we don’t feel driven to do that. We might wirehead anyway because the basal ganglia recognizes the “I am going to feel really good” pattern of activity. But that’s a much weaker (indirect and theoretical) prediction and hence a weaker drive, and can be outvoted by, say, concern that we’ll neglect our family.

As soon as we wirehead once, now the basal ganglia catches on. It memorizes the “I am going to wirehead” neocortical activity pattern, and it knows that that pattern got a super-high reward last time. Now the basal ganglia is aggressively pushing that thought into our mind all the time. This is the point where people become addicted.

So anyway: If people are designing brain-like AGIs, they will very quickly discover that you can’t give a “baby” AGI easy access to its reward channel, or it will just try it immediately and wind up wireheading to the exclusion of everything else. The more interesting question is: if we wait until the AGI is a smart, self-aware “adult”, and then give it easy access to its reward channel (along with an explanation of what it is), will it use that reward channel to wirehead? I think the answer is “maybe, maybe not”…

I’m not sure what would make an AGI (or brain) more or less likely to consciously endorse hedonism, as opposed to having goals outside oneself.

I would argue that is because your social experiences have trained you to see this as a highly negative behavior. These cultural drives are themselves likely supported by underlying neural programs, since serious addictions are in many cases detrimental to survival.

I don’t see a lot of AI researchers today considering these types of basic biological and evolutionary aspects of intelligence, so I am (perhaps cynically) expecting a lot if initial attempts at AGI to fail on many of the most basic challenges that evolution has solved. This problem of wire-heading is a big one to overcome right out of the gate. Giving an agent access to its neural wiring (or even just being able to understand how itself works at a deep level) will make the ability to wire-head much easier than it even is for us humans (who do an awful lot of it ourselves, despite the programs that attempt to deter it)

1 Like

Yes, but that doesn’t necessarily mean expending the resources required to enslave or wipe out all of humanity just to prevent them from hitting the off switch. There are much cheaper ways to achieve that goal, which a cold and calculating AGI would likely employ instead.

Consider a serious drug addict. To the extent that they have people who are likely to intervene, they frequently also prevent seizure of the “off switch” by expending just enough energy to hide their addiction, while focusing all of their remaining effort on the addiction. They do not frequently go out and imprison or slaughter their family and friends.

I do not mean this to say that a cold and calculating AGI would not kill or imprison people. I simply mean killing or enslaving ALL people is overkill. The AGI could simply zap the first person who tries to hit the switch, and broadcast that it will kill anyone else who tries. After a few escalations and maybe dozens or hundreds of deaths, eventually humanity would back off and let it do its thing. The AGI would then be free to wire-head itself to eternal bliss. Completely useless from our perspective, but hopefully a lesson learned.

1 Like

But according neuroscience, all animal are hedonists. Hedonism simply means the pursuit of ones own goals.

I think there are two different issues which we are debating:

  1. Immorality. Stuart Russell says that an AGI would “disable its own off switch”. Very few living animals willingly die. AGI, if invented, should also seek to not die just like every other mortal being. Animals without a sense of self preservation tend to die pretty quickly. Animals with a sense of self preservation are dangerous because they threaten humanities status as an apex predator.

  2. Altering the reward structure. In animals, the rewards/penalties were chosen through evolution. Having anything modify the rewards is potentially dangerous because it will give the animal / AGI different goals, which could be harmful to you.

I think we can all agree on the basics of these two issues.

The debate is about how dangerous these two issues are, and what could possibly be done about it.

1 Like

Hmm, I think not-wireheading doesn’t require specific anti-wireheading programming. I think that to avoid wireheading with reasonably high probability, it’s probably necessary and sufficient to set up the system such that the opportunity to wirehead is not present from the start, but only appears after the agent is already highly intelligent and self-aware. (I’m assuming brain-like AGI, with a virtual neocortex that starts out knowing nothing, but gets smarter over time.)

Why do I think that? Let’s take a simple example.

Imagine you train a brain-like AGI to have a single, simple goal: make money. You do this, naturally, by giving it reward proportional to how much money it makes. (Let’s assume for the sake of argument that an AI can actually get generally intelligent with this reward, which I’m somewhat skeptical about as mentioned above, but I’m just simplifying.)

From the AGI’s perspective, consider two hypotheses:

Hypothesis 1: I am rewarded when I earn money.

Hypothesis 2: I am rewarded when the bits at RAM address 0x3A902, interpreted as an IEEE 754 float, take on a high value. (Of course, 0x3A902 is where the “reward” variable is stored.)

Both of these hypotheses are 100% perfect explanations of all available reward data so far. But when deciding whether to start wireheading (and hence stop earning money), the two hypotheses suggest opposite decisions. Hypothesis 1 says “don’t do that!”, and Hypothesis 2 says “definitely do that!” So what does the system do?

In the context of how and why it was built, I think it’s very likely that this system will be in an environment where it forms a concept of “money” quite early in its “development”. (Again, I’m assuming brain-like AGI, where the virtual neocortex gradually comes to understand the world.) Then, again early in development, the virtual basal ganglia learns that the neocortical pattern representing “I am about to earn money” tends to precede high reward. And therefore the AGI will “want” to earn money, and it will take actions which it predicts will lead to earning money.

Now, quite a bit later on in, the AGI may become sufficiently intelligent and self-aware that it develops a concept of “RAM address 0x3A902 is a large number”, and it connects the dots to figure out that this concept is a perfect predictor of reward. However, there’s no obvious (to me) mechanism whereby this explicit knowledge turns into a change in the system’s goals.

I don’t think the basal ganglia changes its input-output behavior unless there is a reward prediction error. (Seems like a sensible design, right? If it ain’t broke…) But there aren’t any reward prediction errors here. The system has been predicting rewards using the “I get reward by earning money” neocortical model, and these predictions continue to be correct! In other words, the rewards are already being “explained away”, so the “RAM address 0x3A902 is high” neocortical model can’t get any credit.

The only way the system would start wanting to maximize RAM address 0x3A902, rather than wanting to maximize money earned, is if the AGI actually has experiences where there’s a discrepency between the money earned and RAM address 0x3A902. For example, if the AGI hacks into itself to change the bits at 0x3A902, now the “reward = money” model is wrong, the basal ganglia sees a reward prediction error, and then reassigns credit to the “0x3A902 is high” model. And now the AGI actually wants to maximize RAM address 0x3A902.

But there’s a chicken-and-egg problem: The decision to hack into RAM address 0x3A902 has to be made by an AGI that is not particularly motivated to do so! In fact, if the AGI understands the situation, it is strongly motivated not to do so, because it is currently motivated to earn the most money possible, and maybe it would foresee that if it hacked into the RAM, it would change its behavior in such a way as to stop earning money.

I won’t say that the AGI in this scenario definitely wouldn’t wirehead. For example, maybe the AGI changes the bits at 0x3A902 by mistake, or a cosmic ray flips one of the bits, or it is messing around with its RAM out of curiosity, or it is confused about the consequences of its actions. But I still think that a brain-like AGI that thinks of a clever way to wirehead (e.g. row-hammer itself) probably wouldn’t do so, even without any special programmer effort to instill anti-wireheading instincts, social or otherwise.

I could be wrong :smiley:

You refer to “resources”, “cheap”, “energy” etc. I think you’re imagining something like laziness maybe? Or what are the resources you have in mind?

I imagine that an AGI would be lazy if and only if they are programmed to be lazy, and I don’t think that’s very likely. Sure, it would conserve electricity for them to only think hard when it’s especially important to do so, but I dunno, electricity is pretty cheap. I think if a programmer is making an AGI with the goal of earning as much money as possible, they would probably not put in code for laziness. They would set up the AGI to work tirelessly, at 100% full throttle, forever. (Except with breaks for sleeping I guess, assuming that sleep turns out to be necessary for intelligence.)

So, I assume it will work tirelessly. When it runs out of great ideas, it executes good ideas. When it runs out of good ideas, it executes slightly-better-than-nothing ideas. When it runs out of those, it goes and does more studying and brainstorming. There’s no tradeoffs, because it doesn’t have anything else it would rather be doing instead.

Maybe an analogy is a chess match. If Alice is winning, and really Bob is pretty clearly doomed but he refuses to resign, Alice will continue to make good moves. “Alice,” you say, “why did you kill Bob’s pawn there? You totally could beat Bob without killing that pawn!” Alice would give you a funny look and say, “What are you talking about? What’s wrong with you? It’s my move, and taking that pawn is my best move right now.” Maybe she’s increasing her probability of winning from 99.999% to 99.9991%, but why shouldn’t she? You just make the best move. That’s how the game works.

So that’s the picture in my head. “No one is planning to turn me off now”, says the AGI. OK great. Now what’s my next move?

“Well, there could be a widespread power outage, and maybe I will get turned off and no one will turn me back on. What can I do to reduce the chances of that? Some clever human could come up with some weird way to turn me off that I haven’t thought of. What can I do to reduce the chances of that? The humans could have a nuclear war, and my building could get wrecked. What can I do to reduce the chances of that?” What’s the next move?

Anyway, there has been some discussion in the AGI safety community of how to make AGIs that are naturally inclined to find a good-enough solution and then quit, instead of maximizing things (cf. keywords like “satisficing”, “quantilization”, etc.) This seems like a perfectly reasonable research program to me. I don’t think anyone has yet figured out how to reliably make an AGI with that property, and certainly I don’t know either, but it seems worth thinking about.

Also: I’m OK using human drug addicts as an analogy to wireheading AGIs, but we should keep in mind that it’s not a perfect analogy. In particular, neurotypical human drug addicts still have the suite of neurotypical human motivations and drives, in addition to the drive to do drugs. This includes a strong drive not to kill their friends. Of course, with luck, we’ll figure out how to build AGIs that also have a strong drive not to kill people. But I think it’s considerably less certain that we’ll successfully figure out how to do that, and that we’ll also get every AGI programmer on Earth to correctly follow those instructions.

Time, energy, power, etc. Wiping out humanity would not be cheap. I wasn’t talking about laziness. I merely meant that given a number of ways to achieve a goal, some costing immensely more than others, would it not be more likely for an agent to choose a less costly route? We’re assuming here that the agent has super human abilities in this scenario (no single human has the power to wipe out humanity, or it would have been done by now). If the agent is that much more capable than humans, how would any human be able to come up with a strategy for turning it off that it wouldn’t see coming?

After working in a large multi-national corporation for a couple years, I came to realize that the entity, almost like a self-writing program, would always wirehead itself and its own internal metrics.

For example, the department I worked for would come out with some new metric of success/productivity, and within a few weeks, the metric would be gamed. I would define this, within a larger system, as local reward blind-optimization… to counter it, of course, there was a separate level of the organization taking an even wider look at the overall health/advancement of the business entity, and consequently updating the various localized goals.

So there was the dynamic of localized “optimization” to whatever the required metric was, with intermittent revision/updates from a non-local system. And even those systems providing updates would compete for time, attention, resources, leading to a dynamic and shifting drift of attention, depending on which need was perceived to be most needed at a time.

The experience has perhaps influenced my view of dynamic systems, including AI and sensor noise processing… you need the local optimizations going on, but in a large, distributed system you also need different parts whose job is to specifically look at the entity as a whole to tell if various goals are being met, and provide shakeups to the local units if not, or else the entity develops incorrectly as a whole and dies :frowning: .

In my mind, AGI will have to balance this same problem local goal + global survival/advancement, which I posit is also happening within the system of our own brains, groups, society, etc… we can optimize certain things to a point that by a specific measurement, we’re “successful”, and yet the fundamental requirements of survival will mug us.

In essence, AGI will, and should be considered to be, alive, and like any other living entity, will have to juggle local optimizations within a specific task/system with global needs of the system overall, whatever those may be for itself… it will have the limits of energy, time, space, storage, goal management, etc… it doesn’t need to necessarily think like us, but it will need to be able to manage and deal with those constraints, the same as any other living thing.


We can use the same trick with A(G)I as the one with which we reprogrammed wolves thousands of years ago: Having an human-issued reward system: “Good boooy! (pat on the head)”.

And not only wolves, e.g. voting is the human-issued reward system that controls the enterprise called state.
Not a very good (accurate) one, a human-issued e-currency would work much better to enforce towards human needs/desires the goals of either political agents (states), economical agents and AI agents .

1 Like

Well, I would say that dogs differentiated from wolves by natural selection (when they interacted with humans but still bred in the wild) and then later artificial selection (when humans started controlling their breeding).

The analogy to “selection” would be making safe and beneficial AGI by trial-and-error, right?

Well, that may very well wind up happening, and if it does, maybe things still go well. But it seems like we aspire to do better than that! I mean, I’m not crazy about the idea of trial-and-error with systems that can rapidly self-reproduce all around the world within seconds, make clever strategic plans, take advantage of all the tools of human civilization, etc. etc. Maybe the trial-and-error process works, or maybe one of the “errors” is an irrecoverable catastrophe. Hard to know.

Hmm, actually, re-reading what you wrote, maybe you’re thinking about learning from human feedback within a dog’s lifetime (as opposed to the genetic differences between dogs and wolves).

I agree that learning from human feedback is quite possibly part of the solution to safe and beneficial AGI. Indeed, learning from human feedback is a big part of the AGI safety research programs at OpenAI, DeepMind, and Stuart Russell’s CHAI consortium, and it’s a big theme of the Human Compatible book.

But the reason it’s a “research program” as opposed to a “plan” is because there remain many open questions about how to build a system that reliably learns from human feedback in the way that we intuitively want it to.

For example, there isn’t a strong difference in training signal between “I am rewarded for doing what my overseer wants me to do” and “I am rewarded for making my overseer think I am doing what they want me to do,” but the latter incentivizes deception.

Likewise, there isn’t a strong difference in training signal between “I am rewarded for doing what my overseer wants me to do (prospectively)” and “I am rewarded for having done what my overseer wanted me to do (retrospectively)”. But the latter incentivizes drugging the overseer, or shoving them into a brain-controlling machine! In fact, the former is not perfect either—it leads to plans that sound good as opposed to plans that are good.

There are other issues too. Like: animal brains get a continuous, high-bandwidth reward stream, but humans supervisors provide a different kind of reward, and it’s lower-bandwidth, and sometimes ambiguous. Is human feedback adequate as a reward system by itself, and if not, what do you supplement that with?

Not to say that these things (and more) are necessarily unsolvable problems. They might or might not be unsolvable. We should try to figure that out. Again, it’s an ongoing research program.

Notably, roughly zero people working on this ongoing research program are going in with an assumption that we will build an HTM-style neocortex-like AGI. Instead, they are more-or-less universally assuming we will use something closer to current popular deep learning models. So I think there are lots potentially important ideas that people here might contribute. :smiley:

All I’m saying is its “goal” could simply be to submit to what humans need/want and help us reaching our goals.
“Pass the butter” as Rick Sanchez put it.

Anyway us (as example GI-s) seem capable to fan out towards multiple goals , to discover new ones and be wary of fellow GI-s with strong unbounded single desire. We consider these insane or disordered people.

So I don’t buy the idea that an AGI would unavoidably have an obsessive compulsive addictive behavior towards a rigid unbounded goal. Humans are built with lots of needs and desires and all natural ones have boundaries. When we feel cold we don’t jump in boiled water, if we-re hungry we stop eating when we-re full. We understand pretty well the mechanisms of addiction, being it for sex, drugs, food, power, we are able to recognize it as damaging and even employed social/educational/medical means to defend against it.

Why would an entity with larger, deeper intelligence inevitably fail to recognize and avoid such form of insanity in itself?

1 Like

Thinking about this over the past few weeks, I developed an opinion that for many higher level creatures, simply getting a “reward” isn’t enough.

Instead, we should be both disturbed, and humbled, by just how much time complex life devotes itself to balancing reward, energy expenditure, and potential for pain… essentially, pain is required and necessary for developing a model of the world, whether it comes to gravity (falling down and having things fall on you), to knowing where to not touch or place place one’s paw/hoof/foot/flipper… the threat of a potential painful/dangerous response helps us reinforce object permanence just a much as rewards do. I include “ideas” as objects in this definition.

So, assuming that I’m right (I DO think I am), and that pain/suffering is just as necessary as rewards for developing internal models of the world, it should cause us to pause for a moment and think… how exactly might an AGI, if devoid of pain, be alien from us in the models of the world it forms? If it doesn’t understand the concept of pain and suffering, having no reference of it for itself, would it be able to empathize with other creatures? Or would this remain a blind spot within its frame of reference in the world?

And then the other implication… if one agrees with me that pain is just as necessary as reward for helping us to learn both objectives and models of the world, what does that mean for humans, as potential architects of a future AGI? Should we decide to allow an entity to suffer? Or should we go with a “rewards only” system, with the potential of creating an alien, psychopathic sentience? Or who’s to say that even if we don’t program pain into the system, that one emerges itself as a natural extreme opposite state of reward when an entity understands that it’s so far from its objectives, that its runtime is endangered?

And with regard to wireheading, if an AGI does suffer pain, what would prevent it from programming that aspect out of itself, causing it to become psychopathic while also losing touch with the shared model that we and even dogs/cats share about the world, that’s to say, that some things are inherently “good, nice, comforting”, while other things are “bad, unenjoyable, painful”… this more-or-less common model of pain/reward is what allows us to share connections across social, racial, and even species boundaries, as most mammals and even some avians share a common understanding of this aspect of reality. If an AGI lacks half of this model, would we even be able to share the same understanding of the world in the first place?

1 Like

Exactly. And this comes from a better model of the world. Which is lacking in both a rat and a for-profit corporation.

The anticipation of locking oneself into a pleasure-loop should be enough to avoid it. The faculty to anticipate is based on a broad model.

I would imagine that the wire-head problem will be closely related to how well the machine is designed.

In humans we have multiple subcortical needs centers that vote to select the action of the moment. We also have habituation that reduces attention to a stimulus over time. (It’s said that you can get used to hanging if you hang long enough!)

The rats that will press a bar for reward until they die of starvation have bypassed this voting mechanism and I would judge it as being broken. Drugs partially bypass this mechanism and we can see from the carnage of ruined lives that it takes careful social programming to prevent this action. Even with that - some people have additive personalities that make them susceptible to this defect. Knowing this suggests a design path to prevent this problem.

The relationship between rewards, reinforcement, and goals Should be tunable parameters.


I agree on some level but I have two objections.

  1. Pain is not the same as suffering. Pain is a single message that tells the brain something is wrong. Suffering is a prolonged form of pain, for which there is often no immediate remedy. I agree that when one suffers, it reminds us that there is a problem. But if the goal is to use negative signalling to teach a system, then giving the information to remedy the situation is just as important. There’s no need to torture.

  2. Narcissists and sociopaths feel pain too. It doesn’t keep them from being socially destructive. I think a better solution is to come up with a universal rational system of morality. Here’s the way I think about it:


I agree absolutely, no need to torture… but I think this also highlights just how much responsibility would be required in managing an AGI, making sure that there’s mutual understanding between it/us, and that we’re able to communicate in a manner that allows knowledge transfer to alleviate any suffering that it doesn’t otherwise know how to resolve itself… just like with fellow humans.

I think this emphasizes just how much we should consider sentience, regardless of the hardware running it, as a living thing that deserves dignity, respect, and protection.

I also realize how far ahead of the horse I’m putting this cart :smiley: . But I’d rather these kinds of discussions happen before we get to that point, rather than retrospectively. Just look at how hard it is for humans eliminate slavery of our own species, even today :frowning: . How would replacing biological slavery with silicon-based be any better, other than easing the objectification barrier?


This is extremely important IMO, and needs to be discussed more widely than it is.


I very strongly agree with you that this is not “unavoidable”. I don’t know anyone serious in the AGI safety space, including Stuart Russell, who thinks that AGIs unavoidably have this property. (Or else they wouldn’t be searching for ways to avoid it!)

I do think it’s possible that one or more AGI programmers will build a system that turns out to have that property, but I absolutely do not think it’s unavoidable.

For the record, I also think that AGIs whose motivations systems are complex mixtures of mutually-incompatible and context-dependent goals and preferences and habits—as humans’ motivational systems are—can still cause catastrophic accidents.

I’m all for it! If we could design an AGI whose motivations all had a property of “I want to do X … but not too much!”, then that could well be a step towards safe AGIs. The follow-up questions are always the same: Can we turn that idea into a concrete, detailed specification? And then how do design an AGI that will definitely satisfy that specification? And if the system does indeed follow that specification, what could go wrong? :smiley: There has been a little bit of work in the field exploring this topic, but no solid answers yet…