Tradeoff between generality and optimallity in regards to AI alignment issues


4 posts were split to a new topic: Post here if you have a complete AGI


It’d be nice to have AI that makes decisions and acts on them on its own, but until we solve the control problem, can’t we just use AI which thinks and moves its sensor but does not care about anything nor have any chance of killing a fly with its sensor? If you frame AI as optimizing some value, then there is a problem, but the cortex just thinks.

Even if you do add behavior and emotion, just don’t tell it to make as many paper clips as possible or whatnot. Tell it to make five paper clips by pushing buttons on this machine which cannot possibly hurt or manipulate anyone and this spool of metal wire and then wipe itself at a specific time (saving a modified backup for reboot). But I think it’d be pretty stupid anyway to give it behavior when you can just read its thoughts and translate that into motor output, which is easier.

A thinking machine does not inevitably have goals. You tweak the outputs of individual neurons according to some learning rules, optimizing how well they represent a pattern for instance. That leads to intelligence, but the other neurons are optimizing other things, and the things neurons optimize are about the neural response to represent things, not to do something in the world. You do not optimize synapses to make paper clips.


I finished watching it, and I still argue that this is really coming from the perspective of building an AI that is pre-programmed to function out of the box at or higher than the level of a human who has had years of experience, and try to take into account all possible variables up-front. Of course that is an astronomically difficult problem. Human’s don’t have these “utility functions” dictating our decisions, why do we assume AI’s must have them?

Another assumption seems to be that we would be taking these ultra-advanced AGI’s and having them work on optimizing a limited set of specific problems (collect resources, maximize the number of smiles, make burritos, etc). But humans do not operate this way - we have countless conflicting goals that are forged during our live experiences, which together prevent us from mindlessly optimizing one singular goal ad nauseam. Why should human-level AGI’s be applied differently?

It’s almost as if Eliezer is arguing that we humans are somehow smarter than the hypothetical super-human AGI’s (despite their being, by definition, more intelligent than their human predecessors). Why do we assume they won’t be as smart as we are at ensuring their own creations don’t destroy the society they are a member of? If the AGI builders that are designing these next-gen smarter AGI’s have learned a value system by living a life within the norms of a society (being punished when they stray from those accepted norms, and rewarded when they conform to them), then it won’t only be humans who are enforcing the values of that society who are fearful of creating something that destroys it.

As long as the new smarter AGIs aren’t suddenly exponentially more capable than their predecessors (in resources and intelligence) right out of the box, then their predecessors will have just as much motivation as their own human predecessors to make sure the new AGI’s are functioning normally in society, and will enforce those norms when things go wrong.


My requirement, perhaps ridiculously high, for AGI is that it be able to construct abstract connections to arrive at novel solutions.

For humans, this seems to rely on different aspects: knowledge (the rule book), goal chasing (motivation), and perhaps surprisingly boredom (time to mull things over and create connections). As somebody that uses Deep Learning to earn my money, I see it maybe fulfilling (partially) the goal chasing aspect, via optimization through gradient descent and back propagation. Rulebook style approaches were tried, with some mixed levels of success using LISP, but ran into limits due to the manual amount of work required, lack of computing at the time, and some failures to transfer small-world models into the real world with good enough results.

Finally, I see SDRs within a neocortical model (HTM or otherwise) as that “abstract connection”, due to pooling and distal connections between synapse regions, each taking in different types of input.

Ultimately I think we’re going to get there with a combination of different approaches. After all, there’s more than the neocortex in our skulls.


This is a good observation I think. If an AI is designed with a singular limited goal, and it is given enough intelligence to perform super-human feats in its quest to complete that goal, then it will likely find unusual (potentially destructive) ways of achieving it (turn the universe into paperclips, etc). But if it is designed in a way that it forms many competing goals, then its optimal policy is far less likely to be something so extreme (turning the universe into paperclips may be fun and all, but man that’s a lot of work…).


On AI alignment, I sometimes fear that if a machine has to learn values based on observing humans and our behavior over history, we’re done for.

What I consider is that even if we have machines that need to learn from scratch (akin to an infant), if they have some way to pool their accumulated knowledge in a central way, one day of human experience, for 1000 machines, would equal multiple years of experience. It would be like day 1 == infantile state. Day 2 == 3.5 years old. At least in humans, that age then shows an explosion of learning and curiosity. Imagine all the “why” questions coming out in day 2. After a week, if it were limited to human learning rates (and there’s no reason it should be), we would have an intelligence that is older than most humans. Imagine even more input sources, and the composition of experiences.

AI that learns from is that way, would very much manifest the best and the worst of humanity. A lot of things could go wrong in those first few days, any misunderstandings, and humanity would see very quickly the results of its behavior and treatment of others. I’m not sure if we’ll like the result of that yet or not.

Fact is, we would have little idea of what we were doing, and the development speed of such an intelligent entity might quickly just dismiss us as stupid pets (in a positive outcome) or dispose of us as a disease that harms the planet (in a less positive outcome). For this reason, I’m in the camp that says we should figure out how to approach a super intelligence prior to having it come into existence.

Post here if you have a complete AGI

AI alignment concerns itself with what happens when the AI becomes super-intelligent. It does not have to be programmed like that from the start, but can evolve to be like that. Once an agent becomes super-intelligent we can assume it does not make manifestly stupid things like spending all of it’s resources navigating a circular path through the space of world states. This entails that it has a at least a partial ordering of world states (it’s preferences over world states are not circular) which means we can represent it’s preferences with a real-valued utility function.
Again humans don’t have a well-defined utility function, we are circular in that way sometimes. The argument is that a super-intelligence does not have those circular preferences (or that they are negligible).
You can try to “raise” an AGI in human society but that does not guarantee it will behave nicely once it becomes far more intelligent.

The utility function may include conflicting goals and the AI would need to reconcile the conflicts in some way to achieve the optimal solution.

Once you accept the utility function assumption you should also look at the assumption of Instrumental Convergence which states that optimizing most utility functions (except some trivial ones) entails optimizing some common goals (like acquiring resources, becoming more intelligent, etc…).

The super-intelligent AGI will certainly want to make sure it’s subsequent creations also maximize the same utility function (and do it better than itself), the field of AI Alignment is exactly about how do we do this in the case of humans (i.e. how do we make sure our creation shares our human values).

An exponential rise in intelligence will occur as soon as we reach the threshold of self-imporvment, this is called an Intelligence explosion.

Again the question is about super-intelligent AGI not a human-level AGI. Of course human-level AGI can behave nicely, but once you reach super-intelligence your’e in a different ball park.


Isn’t that belief running contrary to your point on balance between Generalization and Optimization ? Which would put some hard limit to that growth curve.


Yes. My point was in fact that this trade-off, if it exists, might mean that an AGI like described in the AI Alignment community cannot exist in practice and we shouldn’t be worrying about it.
But I’m not completely sure about it, so my default position is to worry :slight_smile: .


(Yes, I have to tag @Bitking every other post, but just watched this interview thanks to him…)

I was not sitting on the worried benches of the semicircle to begin with, but he sure can drive the point home, that current AI technologies, and their foreseeable future, are not really leading us to a fearsome HAL 9000.


Yes, but I don’t think the problem should be dismissed without serious thought. Most people think it’s ridiculous on the face of it, but can you provide some solid arguments against it? I think solving the control problem may be impossible, but it probably is possible to prove that such a problem cannot actually manifest in reality in the first place, and we should also dump some resources into this approach.


I don’t think it is dismissed. Aren’t there ethics gatherings pondering about lethal drones and such ?
I grew up with Asimov books at my side and as much soap sci-fi material as the other geek, to entertain fear of Skynet scenarios. Yet when precisely, you know a little more about computing and think a little deeper how current AI turns out to be developped… nothing requires us to give our AI human-like drives or greedy impulses.
Or a nuclear propelled exoskeleton, for that matter.


Also, some point about that “threshold of self-improvement” thing.

At present (Since we’re currently not envisioning a LISP-like approach to AI, that is…, but a neural sim with concrete limitations of size and learning-rules, just like our minds), I do not see this as a “threshold” on a curve, at all.
“Self-improvement” in that context would mean that you do have such an AGI, and a quite good one at that, and you ask it to develop a better version and you provide the physical resources or allow it to bootstrap and run its byproduct.

An analogy : A computer side simulation of DNA is no threat to world health and cannot “escape control” by itself, unless you start to physically mess with test tubes and viral vectors. You may then worry about viral stuff escaping the lab, but then it means someone f* up really bad. And that was not the computer sim.

Sure this is something you could envision could happen. But there is no threshold out of the blue here. That’s more of a mad scientist scenario at this point ?


That’s childs’ play, the control problem is about a super-intelligence and we have no idea how to even approach it.

That’s right, but whatever “drive” (utility function) we give it, it should align well with our humans values, otherwise the misalignment will get out of control as the AGI becomes very smart.

I don’t think it needs us to give it a weapon to be dangerous, it’s dangerous just because it’s more intelligent than us and we can’t anticipate it’s actions. Sure you can lock it up and prevent it from communicating with the outside world, but then you risk it breaking through your security measures, and if it doesn’t, you just built a useless box.

Of course you can deny the AGI access to resources, but the point is that you wouldn’t want to. The AGI is going to be beneficial for whatever reason it was built for. It’s going to be valuable, otherwise no one will build it. It is precisely this economic inventive which will drive us to give it more and more resources. Whatever it’s goal is, it’s going to be very good at achieving it and we would want it (initially) to pursue it. You could say the solution is to be responsible enough as a civilization and deny the AGI it’s resources, but we don’t have anywhere near the capability to make intelligent collective decisions as a species let alone enough to enforce or reverse the huge economic incentives in making such an AGI and providing it with resources.
The problem is how do we stop it once it has the enough resources to be able to resist. Solution: make sure the AI wants you to stop it if it behaves badly. Problem: how do we define that as a utility funcion? we have no idea.
Also, I don’t think the AGI needs to be that sophisticated in order to self-improve, it only needs enough compute resources and it pretty much can brute force it’s way.

I actually wrote an article about how I think an AGI would (most probably) rise in real life:

The AI of today is capable of next to nothing, the problem I think needs addressing is the eventual rise of super-intelligence.


It takes a little used to, for me, to view, for example, HTM online-learning mechanisms, as an “utility function”. But I believe you’d have all the arguments to convince me that it’s conceptually one and the same than an utility function used for some more classic ANNs. So, Fair enough. I’d agree to your point.

Now, would you please take that same step in reverse towards the Jeffy view ? What people think of, when imagining a “drive” for AI (especially when debating catastrophe scenarios or together with words such as “values”), is in my mind still very anthropomorphic, compared to what’s really at our disposal. A very evolved AGI used to forecast weather, or to provide knowledge as a preceptor, could be very different than an embodied, lifelike android with a strive to survive, right ? And so, that weather bot could be programmed to follow… say… “curiosity” ? Where does curiosity fit in the category of “human values” ? Rightfully, nobody would care, if those technologies were perceived that way.

I can agree to that view. Knowing humankind and economics, it is almost ensured that people will give them those resources at some point.
Was simply nitpicking about that “threshold” concept. When the time comes - if it comes - people themselves will give those AGI the resources to self-improve. Not a sudden unexpected thing crosssing a threshold in one of current research labs.

That’s why I precised that most probably, first AGIs will not be “LISP-like”. They won’t be able to metaprogram themselves if they’re some version of NN. They simply can tweak internal parameters and continue to follow same fundamental NN update rules. By doing so they will learn, true. Or improve if you will, but just as we do. So one such AGI could be very intelligent, why not, but it is not an exponential anything. “Exponential” Self improvement would only happen when there can be a change at the meta level. For NN-based AI, that means starting another NN with other rules.

And that’s provided that, except from a size increase, there are indeed “other rules” which may continually be found better than the previous ones, and that your own proposed tradeoff is never a barrier to this.


The super-intelligence does have one drive, and that’s to maximize it’s utility function. This function can encompass what have you, including human values etc… A lot of people fall into the trap of anthropomorphising, but the way we should think of an AGI is a very good optimization process.

You need to also consider the AGI agency in all of this. Once it has enough resources and power, it’s no longer in the hands of humans anymore. That’s why I said in brackets: “initially”. The AGI will eventually self-sustain without needing humans anymore and will grow on it’s own.

Why not? It’s equivalent to a human tinkering with another human brain, it’s perfectly doable. Whatever method the AGI is constructed with, if it’s smart enough, it can probably invent a better AGI.
My hunch tells me that such improvements (to the algorithm) will quickly run out of steam and the AGI will then be on the hunt to gather more and more computing resources until it reaches some equilibrium where getting more compute resources would undermine it’s final goal.

(P.S don’t give up on the lisp approach so soon, may be useful sometime in the future :wink:)


You win - you get to optimize the ONE and ONLY drive: Don’t be evil.


That (was!) google’s motto.


Why does it need a utility function? I don’t think it has to optimize anything. Everything I’ve read and watched about the control problem assumes it optimizes something, so it’s clearly something smart people think, but it seems ridiculous. For example, let’s say we want to build a high-tech ladder to the moon out of nanobots which build towards the moon. Would you say those nanobots optimize the ladder’s height? You could, but you cannot take that claim and infer that the nanobots will kill anyone who gets in their way.

Intelligence seems a bit different from a ladder to the moon because we are intelligent agents. But intelligence does not necessitate goal making, it just enables it.

I completely agree that the control problem is really difficult for AI with a utility function. But if someone were to tell it to optimize anything ad nauseam, I would say they are just trying to make a mechanical god because it sounds like the future society envisions, not because they want to make sure the future is better. Something will inevitably go wrong in the execution of an idealization because we have limited knowledge, and also because it would massively change society, which is much easier to break than to fix.

If it needs to do things in the world, either give it automatic behavior which doesn’t optimize anything, or tell a superintelligent non-behaving AI to think about the control problem. I don’t see why this is considered a difficult problem. It’s just difficult if you assume it will optimize things.

The control problem distracts from the more important problem, which is how AI will be used and who will use it. AI will take away jobs, slowly at first but all of them will disappear over decades or less as it becomes superintelligent. Everyone will need to be cared for, which is a level of cooperation and selflessness society is not capable of right now. Even if equality is not a problem, people might still not be happy. We’ll need to make the transition flawlessly.

Personally, I would get really sad in some ways if we suddenly had massive abundance from AI. Most people want to do slightly difficult things. I’m not talking about scientific accomplishments or anything like that. For example, if I had the option to be told how to make perfect microwave nachos, they’d be pretty good, but making better than usual nachos makes me happy, whereas being handed perfect nachos just gives pleasure. If, given the option, I chose not to be told how to make perfect nachos, trying to make good nachos would feel pointless.

It’d be absolutely fine if I didn’t have to think about some things, but having the option not to have to think about anything would be soul crushing.


This is not a new problem and the assessment of “technological unemployment” in 1930 was not encouraging.
I don’t see much changing from then to now.