Rationality- From AI to Zombies

Page 126

by Eliezer Yudkowsky

But if you discard every procedure that evolution gave you and all its products, then you discard your whole brain. You discard everything that could potentially recognize morality when it sees it. You discard everything that could potentially respond to moral arguments by updating your morality. You even unwind past the unwinder: you discard the intuitions underlying your conclusion that you can’t trust evolution to be moral. It is your existing moral intuitions that tell you that evolution doesn’t seem like a very good source of morality. What, then, will the words “right” and “should” and “better” even mean?

Humans do not perfectly recognize truth when they see it, and hunter-gatherers do not have an explicit concept of the Bayesian criterion of evidence. But all our science and all our probability theory was built on top of a chain of appeals to our instinctive notion of “truth.” Had this core been flawed, there would have been nothing we could do in principle to arrive at the present notion of science; the notion of science would have just sounded completely unappealing and pointless.

One of the arguments that might have shaken my teenage self out of his mistake, if I could have gone back in time to argue with him, was the question:

Could there be some morality, some given rightness or wrongness, that human beings do not perceive, do not want to perceive, will not see any appealing moral argument for adopting, nor any moral argument for adopting a procedure that adopts it, et cetera? Could there be a morality, and ourselves utterly outside its frame of reference? But then what makes this thing morality—rather than a stone tablet somewhere with the words “Thou shalt murder” written on them, with absolutely no justification offered?

So all this suggests that you should be willing to accept that you might know a little about morality. Nothing unquestionable, perhaps, but an initial state with which to start questioning yourself. Baked into your brain but not explicitly known to you, perhaps; but still, that which your brain would recognize as right is what you are talking about. You will accept at least enough of the way you respond to moral arguments as a starting point to identify “morality” as something to think about.

But that’s a rather large step.

It implies accepting your own mind as identifying a moral frame of reference, rather than all morality being a great light shining from beyond (that in principle you might not be able to perceive at all). It implies accepting that even if there were a light and your brain decided to recognize it as “morality,” it would still be your own brain that recognized it, and you would not have evaded causal responsibility—or evaded moral responsibility either, on my view.

It implies dropping the notion that a ghost of perfect emptiness will necessarily agree with you, because the ghost might occupy a different moral frame of reference, respond to different arguments, be asking a different question when it computes what-to-do-next.

And if you’re willing to bake at least a few things into the very meaning of this topic of “morality,” this quality of rightness that you are talking about when you talk about “rightness”—if you’re willing to accept even that morality is what you argue about when you argue about “morality”—then why not accept other intuitions, other pieces of yourself, into the starting point as well?

Why not accept that, ceteris paribus, joy is preferable to sorrow?

You might later find some ground within yourself or built upon yourself with which to criticize this—but why not accept it for now? Not just as a personal preference, mind you; but as something baked into the question you ask when you ask “What is truly right”?

But then you might find that you know rather a lot about morality! Nothing certain—nothing unquestionable—nothing unarguable—but still, quite a bit of information. Are you willing to relinquish your Socratic ignorance?

I don’t argue by definitions, of course. But if you claim to know nothing at all about morality, then you will have problems with the meaning of your words, not just their plausibility.

*

1. Rorty, “Out of the Matrix: How the Late Philosopher Donald Davidson Showed That Reality Can’t Be an Illusion.”

273

Morality as Fixed Computation

Toby Ord commented:

Eliezer, I’ve just reread your article and was wondering if this is a good quick summary of your position (leaving apart how you got to it):

“I should X” means that I would attempt to X were I fully informed.

Toby’s a pro, so if he didn’t get it, I’d better try again. Let me try a different tack of explanation—one closer to the historical way that I arrived at my own position.

Suppose you build an AI, and—leaving aside that AI goal systems cannot be built around English statements, and all such descriptions are only dreams—you try to infuse the AI with the action-determining principle, “Do what I want.”

And suppose you get the AI design close enough—it doesn’t just end up tiling the universe with paperclips, cheesecake or tiny molecular copies of satisfied programmers—that its utility function actually assigns utilities as follows, to the world-states we would describe in English as:

: +20

: 0

: 0

: +60

You perceive, of course, that this destroys the world.

. . . since if the programmer initially weakly wants “X” and X is hard to obtain, the AI will modify the programmer to strongly want “Y,” which is easy to create, and then bring about lots of Y. The referent of “Y ” might be, say, iron atoms—those are highly stable.

Can you patch this problem? No. As a general rule, it is not possible to patch flawed Friendly AI designs.

If you try to bound the utility function, or make the AI not care about how much the programmer wants things, the AI still has a motive (as an expected utility maximizer) to make the programmer want something that can be obtained with a very high degree of certainty.

If you try to make it so that the AI can’t modify the programmer, then the AI can’t talk to the programmer (talking to someone modifies them).

If you try to rule out a specific class of ways the AI could modify the programmer, the AI has a motive to superintelligently seek out loopholes and ways to modify the programmer indirectly.

As a general rule, it is not possible to patch flawed FAI designs.

We, ourselves, do not imagine the future and judge that any future in which our brains want something, and that thing exists, is a good future. If we did think this way, we would say: “Yay! Go ahead and modify us to strongly want something cheap!” But we do not say this, which means that this AI design is fundamentally flawed: it will choose things very unlike what we would choose; it will judge desirability very differently from how we judge it. This core disharmony cannot be patched by ruling out a handful of specific failure modes.

There’s also a duality between Friendly AI problems and moral philosophy problems—though you’ve got to structure that duality in exactly the right way. So if you prefer, the core problem is that the AI will choose in a way very unlike the structure of what is, y’know, actually right—never mind the way we choose. Isn’t the whole point of this problem that merely wanting something doesn’t make it right?

So this is the paradoxical-seeming issue which I have analogized to the difference between:

A calculator that, when you press “2,” “+,” and “3,” tries to compute:

“What is 2 + 3?”

A calculator that, when you press “2,” “+,” and “3,” tries to compute:

“What does this calculator output when you press ‘2,’ ‘+,’ and ‘3’?”

The Type 1 calculator, as it were, wants to output 5.

The Type 2 “calculator” could return any result; and in the act of returning that result, it becomes the co
rrect answer to the question that was internally asked.

We ourselves are like unto the Type 1 calculator. But the putative AI is being built as though it were to reflect the Type 2 calculator.

Now imagine that the Type 1 calculator is trying to build an AI, only the Type 1 calculator doesn’t know its own question. The calculator continually asks the question by its very nature—it was born to ask that question, created already in motion around that question—but the calculator has no insight into its own transistors; it cannot print out the question, which is extremely complicated and has no simple approximation.

So the calculator wants to build an AI (it’s a pretty smart calculator, it just doesn’t have access to its own transistors) and have the AI give the right answer. Only the calculator can’t print out the question. So the calculator wants to have the AI look at the calculator, where the question is written, and answer the question that the AI will discover implicit in those transistors. But this cannot be done by the cheap shortcut of a utility function that says “All X: { calculator asks ‘X?,’ answer X}: utility 1; else: utility 0” because that actually mirrors the utility function of a Type 2 calculator, not a Type 1 calculator.

This gets us into FAI issues that I am not going into (some of which I’m still working out myself).

However, when you back out of the details of FAI design, and swap back to the perspective of moral philosophy, then what we were just talking about was the dual of the moral issue: “But if what’s ‘right’ is a mere preference, then anything that anyone wants is ‘right.’”

The key notion is the idea that what we name by “right” is a fixed question, or perhaps a fixed framework. We can encounter moral arguments that modify our terminal values, and even encounter moral arguments that modify what we count as a moral argument; nonetheless, it all grows out of a particular starting point. We do not experience ourselves as embodying the question “What will I decide to do?” which would be a Type 2 calculator; anything we decided would thereby become right. We experience ourselves as asking the embodied question: “What will save my friends, and my people, from getting hurt? How can we all have more fun? . . .” where the “. . .” is around a thousand other things.

So “I should X” does not mean that I would attempt to X were I fully informed.

“I should X” means that X answers the question, “What will save my people? How can we all have more fun? How can we get more control over our own lives? What’s the funniest jokes we can tell? . . .”

And I may not know what this question is, actually; I may not be able to print out my current guess nor my surrounding framework; but I know, as all non-moral-relativists instinctively know, that the question surely is not just “How can I do whatever I want?”

When these two formulations begin to seem as entirely distinct as “snow” and snow, then you shall have created distinct buckets for the quotation and the referent.

*

274

Magical Categories

We can design intelligent machines so their primary, innate emotion is unconditional love for all humans. First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language. Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy.

—Bill Hibbard (2001), Super-Intelligent Machines1

That was published in a peer-reviewed journal, and the author later wrote a whole book about it, so this is not a strawman position I’m discussing here.

So . . . um . . . what could possibly go wrong . . .

When I mentioned (sec. 7.2)2 that Hibbard’s AI ends up tiling the galaxy with tiny molecular smiley-faces, Hibbard wrote an indignant reply saying:

When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of “human facial expressions, human voices and human body language” (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by “tiny molecular pictures of smiley-faces.” You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.

As Hibbard also wrote “Such obvious contradictory assumptions show Yudkowsky’s preference for drama over reason,” I’ll go ahead and mention that Hibbard illustrates a key point: There is no professional certification test you have to take before you are allowed to talk about AI morality. But that is not my primary topic today. Though it is a crucial point about the state of the gameboard that most AGI/FAI wannabes are so utterly unsuited to the task that I know no one cynical enough to imagine the horror without seeing it firsthand. Even Michael Vassar was probably surprised his first time through.

No, today I am here to dissect “You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.”

Once upon a time—I’ve seen this story in several versions and several places, sometimes cited as fact, but I’ve never tracked down an original source—once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks.

The researchers trained a neural net on 50 photos of camouflaged tanks amid trees, and 50 photos of trees without tanks. Using standard techniques for supervised learning, the researchers trained the neural network to a weighting that correctly loaded the training set—output “yes” for the 50 photos of camouflaged tanks, and output “no” for the 50 photos of forest.

Now this did not prove, or even imply, that new examples would be classified correctly. The neural network might have “learned” 100 special cases that wouldn’t generalize to new problems. Not, “camouflaged tanks versus forest,” but just, “photo-1 positive, photo-2 negative, photo-3 negative, photo-4 positive . . .”

But wisely, the researchers had originally taken 200 photos, 100 photos of tanks and 100 photos of trees, and had used only half in the training set. The researchers ran the neural network on the remaining 100 photos, and without further training the neural network classified all remaining photos correctly. Success confirmed!

The researchers handed the finished work to the Pentagon, which soon handed it back, complaining that in their own tests the neural network did no better than chance at discriminating photos.

It turned out that in the researchers’ data set, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest.

This parable—which might or might not be fact—illustrates one of the most fundamental problems in the field of supervised learning and in fact the whole field of Artificial Intelligence: If the training problems and the real problems have the slightest difference in context—if they are not drawn from the same independently identically distributed process—there is no statistical guarantee from past success to future success. It doesn’t matter if the AI seems to be working great under the training conditions. (This is not an unsolvable problem but it is an unpatchable problem. There are deep ways to address it—a topic beyond the scope of this essay—but no bandaids.)

As described in Superexponential Conceptspace, there are exponentially more possible concepts than possible objects, just as the number of possible objects is exponential in the number of attributes. If a black-and-white image is 256 pixels on a side, then the total image is 65,536 pixels. The number of possible images is 265,536. And the number of possible concepts that classify images into positive and negative instances—the number of possible boundaries you could draw in the space of images—is 2265,536. From this, we see that even supervised learning is almost entirely a matter of inductive bias, without which it would take a minimum of 265
,536 classified examples to discriminate among 2265,536 possible concepts—even if classifications are constant over time.

So let us now turn again to:

First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language. Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy.

and

When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of “human facial expressions, human voices and human body language” (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by “tiny molecular pictures of smiley-faces.” You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.

It’s trivial to discriminate a photo of a picture with a camouflaged tank, and a photo of an empty forest, in the sense of determining that the two photos are not identical. They’re different pixel arrays with different 1s and 0s in them. Discriminating between them is as simple as testing the arrays for equality.

Classifying new photos into positive and negative instances of “smile,” by reasoning from a set of training photos classified positive or negative, is a different order of problem.

When you’ve got a 256×256 image from a real-world camera, and the image turns out to depict a camouflaged tank, there is no additional 65,537th bit denoting the positiveness—no tiny little XML tag that says “This image is inherently positive.” It’s only a positive example relative to some particular concept.

‹ Prev Next ›