Rationality- From AI to Zombies

Page 69

by Eliezer Yudkowsky

In this case the word “wiggin” does not help describe reality more compactly—it is not defined by someone sending the shortest message—it has no role in the simplest explanation. Equivalently, the word “wiggin” will be of no help to you in doing any Bayesian inference. Even if you do not call the word a lie, it is surely an error.

And the way to carve reality at its joints is to draw your boundaries around concentrations of unusually high probability density in Thingspace.

*

176

Superexponential Conceptspace, and Simple Words

Thingspace, you might think, is a rather huge space. Much larger than reality, for where reality only contains things that actually exist, Thingspace contains everything that could exist.

Actually, the way I “defined” Thingspace to have dimensions for every possible attribute—including correlated attributes like density and volume and mass—Thingspace may be too poorly defined to have anything you could call a size. But it’s important to be able to visualize Thingspace anyway. Surely, no one can really understand a flock of sparrows if all they see is a cloud of flapping cawing things, rather than a cluster of points in Thingspace.

But as vast as Thingspace may be, it doesn’t hold a candle to the size of Conceptspace.

“Concept,” in machine learning, means a rule that includes or excludes examples. If you see the data {2:+, 3:-, 14:+, 23:-, 8:+, 9:-} then you might guess that the concept was “even numbers.” There is a rather large literature (as one might expect) on how to learn concepts from data . . . given random examples, given chosen examples . . . given possible errors in classification . . . and most importantly, given different spaces of possible rules.

Suppose, for example, that we want to learn the concept “good days on which to play tennis.” The possible attributes of Days are

Sky: {Sunny, Cloudy, Rainy}

AirTemp: {Warm, Cold}

Humidity: {Normal, High}

Wind: {Strong, Weak}.

We’re then presented with the following data, where + indicates a positive example of the concept, and - indicates a negative classification:

+ Sky: Sunny; AirTemp: Warm; Humidity: High; Wind: Strong.

- Sky: Rainy; AirTemp: Cold; Humidity: High; Wind: Strong.

+ Sky: Sunny; AirTemp: Warm; Humidity: High; Wind: Weak.

What should an algorithm infer from this?

A machine learner might represent one concept that fits this data as follows:

{Sky: ?; AirTemp: Warm; Humidity: High; Wind: ?}.

In this format, to determine whether this concept accepts or rejects an example, we compare element-by-element: ? accepts anything, but a specific value accepts only that specific value.

So the concept above will accept only Days with AirTemp = Warm and Humidity = High, but the Sky and the Wind can take on any value. This fits both the negative and the positive classifications in the data so far—though it isn’t the only concept that does so.

We can also simplify the above concept representation to

{?, Warm, High, ?}.

Without going into details, the classic algorithm would be:

Maintain the set of the most general hypotheses that fit the data—those that positively classify as many examples as possible, while still fitting the facts.

Maintain another set of the most specific hypotheses that fit the data—those that negatively classify as many examples as possible, while still fitting the facts.

Each time we see a new negative example, we strengthen all the most general hypotheses as little as possible, so that the new set is again as general as possible while fitting the facts.

Each time we see a new positive example, we relax all the most specific hypotheses as little as possible, so that the new set is again as specific as possible while fitting the facts.

We continue until we have only a single hypothesis left. This will be the answer if the target concept was in our hypothesis space at all.

In the case above, the set of most general hypotheses would be

{{?, Warm, ?, ?},{Sunny, ?, ?, ?}},

while the set of most specific hypotheses contains the single member {Sunny, Warm, High, ?}.

Any other concept you can find that fits the data will be strictly more specific than one of the most general hypotheses, and strictly more general than the most specific hypothesis.

(For more on this, I recommend Tom Mitchell’s Machine Learning, from which this example was adapted.1)

Now you may notice that the format above cannot represent all possible concepts. E.g., “Play tennis when the sky is sunny or the air is warm.” That fits the data, but in the concept representation defined above, there’s no quadruplet of values that describes the rule.

Clearly our machine learner is not very general. Why not allow it to represent all possible concepts, so that it can learn with the greatest possible flexibility?

Days are composed of these four variables, one variable with 3 values and three variables with 2 values. So there are 3 × 2 × 2 × 2 = 24 possible Days that we could encounter.

The format given for representing Concepts allows us to require any of these values for a variable, or leave the variable open. So there are 4 × 3 × 3 × 3 = 108 concepts in that representation. For the most-general/most-specific algorithm to work, we need to start with the most specific hypothesis “no example is ever positively classified.” If we add that, it makes a total of 109 concepts.

Is it suspicious that there are more possible concepts than possible Days? Surely not: After all, a concept can be viewed as a collection of Days. A concept can be viewed as the set of days that it classifies positively, or isomorphically, the set of days that it classifies negatively.

So the space of all possible concepts that classify Days is the set of all possible sets of Days, whose size is 224 = 16,777,216.

This complete space includes all the concepts we have discussed so far. But it also includes concepts like “Positively classify only the examples {Sunny, Warm, High, Strong} and {Sunny, Warm, High, Weak} and reject everything else” or “Negatively classify only the example {Rainy, Cold, High, Strong} and accept everything else.” It includes concepts with no compact representation, just a flat list of what is and isn’t allowed.

That’s the problem with trying to build a “fully general” inductive learner: They can’t learn concepts until they’ve seen every possible example in the instance space.

If we add on more attributes to Days—like the Water temperature, or the Forecast for tomorrow—then the number of possible days will grow exponentially in the number of attributes. But this isn’t a problem with our restricted concept space, because you can narrow down a large space using a logarithmic number of examples.

Let’s say we add the Water: {Warm, Cold} attribute to days, which will make for 48 possible Days and 325 possible concepts. Let’s say that each Day we see is, usually, classified positive by around half of the currently-plausible concepts, and classified negative by the other half. Then when we learn the actual classification of the example, it will cut the space of compatible concepts in half. So it might only take 9 examples (29 = 512) to narrow 325 possible concepts down to one.

Even if Days had forty binary attributes, it should still only take a manageable amount of data to narrow down the possible concepts to one. Sixty-four examples, if each example is classified positive by half the remaining concepts. Assuming, of course, that the actual rule is one we can represent at all!

If you want to think of all the possibilities, well, good luck with that. The space of all possible concepts grows superexponentially in the number of attributes.

By the time you’re talking about data with forty binary attributes, the number of possible examples is past a trillion—but the number of possible concepts is past two-to-the-trillionth-power. To narrow down that superexponential concept space, you’d have to see over a trillion examples before you could say what was In, and what was Out. You’d have to see
every possible example, in fact.

That’s with forty binary attributes, mind you. Forty bits, or 5 bytes, to be classified simply “Yes” or “No.” Forty bits implies 240 possible examples, and 2240 possible concepts that classify those examples as positive or negative.

So, here in the real world, where objects take more than 5 bytes to describe and a trillion examples are not available and there is noise in the training data, we only even think about highly regular concepts. A human mind—or the whole observable universe—is not nearly large enough to consider all the other hypotheses.

From this perspective, learning doesn’t just rely on inductive bias, it is nearly all inductive bias—when you compare the number of concepts ruled out a priori, to those ruled out by mere evidence.

But what has this (you inquire) to do with the proper use of words?

It’s the whole reason that words have intensions as well as extensions.

In the last essay, I concluded:

The way to carve reality at its joints is to draw boundaries around concentrations of unusually high probability density.

I deliberately left out a key qualification in that (slightly edited) statement, because I couldn’t explain it until now. A better statement would be:

The way to carve reality at its joints, is to draw simple boundaries around concentrations of unusually high probability density in Thingspace.

Otherwise you would just gerrymander Thingspace. You would create really odd noncontiguous boundaries that collected the observed examples, examples that couldn’t be described in any shorter message than your observations themselves, and say: “This is what I’ve seen before, and what I expect to see more of in the future.”

In the real world, nothing above the level of molecules repeats itself exactly. Socrates is shaped a lot like all those other humans who were vulnerable to hemlock, but he isn’t shaped exactly like them. So your guess that Socrates is a “human” relies on drawing simple boundaries around the human cluster in Thingspace. Rather than, “Things shaped exactly like [5-megabyte shape specification 1] and with [lots of other characteristics], or exactly like [5-megabyte shape specification 2] and [lots of other characteristics], . . . , are human.”

If you don’t draw simple boundaries around your experiences, you can’t do inference with them. So you try to describe “art” with intensional definitions like “that which is intended to inspire any complex emotion for the sake of inspiring it,” rather than just pointing at a long list of things that are, or aren’t art.

In fact, the above statement about “how to carve reality at its joints” is a bit chicken-and-eggish: You can’t assess the density of actual observations until you’ve already done at least a little carving. And the probability distribution comes from drawing the boundaries, not the other way around—if you already had the probability distribution, you’d have everything necessary for inference, so why would you bother drawing boundaries?

And this suggests another—yes, yet another—reason to be suspicious of the claim that “you can define a word any way you like.” When you consider the superexponential size of Conceptspace, it becomes clear that singling out one particular concept for consideration is an act of no small audacity—not just for us, but for any mind of bounded computing power.

Presenting us with the word “wiggin,” defined as “a black-haired green-eyed person,” without some reason for raising this particular concept to the level of our deliberate attention, is rather like a detective saying: “Well, I haven’t the slightest shred of support one way or the other for who could’ve murdered those orphans . . . not even an intuition, mind you . . . but have we considered John Q. Wiffleheim of 1234 Norkle Rd as a suspect?”

*

1. Tom M. Mitchell, Machine Learning (McGraw-Hill Science/Engineering/Math, 1997).

177

Conditional Independence, and Naive Bayes

Previously I spoke of mutual information between X and Y, written I(X;Y), which is the difference between the entropy of the joint probability distribution, H(X,Y), and the entropies of the marginal distributions, H(X) + H(Y).

I gave the example of a variable X, having eight states, X1 through X8, which are all equally probable if we have not yet encountered any evidence; and a variable Y, with states Y1 through Y4, which are all equally probable if we have not yet encountered any evidence. Then if we calculate the marginal entropies H(X) and H(Y), we will find that X has 3 bits of entropy, and Y has 2 bits.

However, we also know that X and Y are both even or both odd; and this is all we know about the relation between them. So for the joint distribution (X,Y) there are only 16 possible states, all equally probable, for a joint entropy of 4 bits. This is a 1-bit entropy defect, compared to 5 bits of entropy if X and Y were independent. This entropy defect is the mutual information—the information that X tells us about Y, or vice versa, so that we are not as uncertain about one after having learned the other.

Suppose, however, that there exists a third variable Z. The variable Z has two states, “even” and “odd,” perfectly correlated to the evenness or oddness of (X,Y). In fact, we’ll suppose that Z is just the question “Are X and Y even or odd?”

If we have no evidence about X and Y, then Z itself necessarily has 1 bit of entropy on the information given. There is 1 bit of mutual information between Z and X, and 1 bit of mutual information between Z and Y. And, as previously noted, 1 bit of mutual information between X and Y. So how much entropy for the whole system (X,Y,Z)? You might naively expect that

H(X,Y,Z) = H(X) + H(Y) + H(Z) - I(X;Z) - I(Z;Y) - I(X;Y),

but this turns out not to be the case.

The joint system (X,Y,Z) only has 16 possible states—since Z is just the question “Are X and Y even or odd?”—so H(X,Y,Z) = 4 bits.

But if you calculate the formula just given, you get

(3 + 2 + 1 - 1 - 1 - 1) bits = 3 bits = WRONG!

Why? Because if you have the mutual information between X and Z, and the mutual information between Z and Y, that may include some of the same mutual information that we’ll calculate exists between X and Y. In this case, for example, knowing that X is even tells us that Z is even, and knowing that Z is even tells us that Y is even, but this is the same information that X would tell us about Y. We double-counted some of our knowledge, and so came up with too little entropy.

The correct formula is (I believe):

H(X,Y,Z) = H(X) + H(Y) + H(Z) - I(X;Z) - I(Z;Y) - I(X;Y|Z).

Here the last term, I(X;Y |Z), means, “the information that X tells us about Y, given that we already know Z.” In this case, X doesn’t tell us anything about Y, given that we already know Z, so the term comes out as zero—and the equation gives the correct answer. There, isn’t that nice?

“No,” you correctly reply, “for you have not told me how to calculate I(X;Y |Z), only given me a verbal argument that it ought to be zero.”

We calculate I(X;Y|Z) just the way you would expect. We know I(X;Y) = H(X) + H(Y) - H(X,Y), so

I(X;Y|Z) = H(X|Z) + H(Y|Z) - H(X,Y|Z).

And now, I suppose, you want to know how to calculate the conditional entropy? Well, the original formula for the entropy is

H(S) = ∑i{-P(Si) × log2(P(Si))}.

If we then learned a new fact Z0, our remaining uncertainty about S would be

H(S|Z0) = ∑i{-P(Si|Z0)log2(P(Si|Z0))}.

So if we’re going to learn a new fact Z, but we don’t know which Z yet, then, on average, we expect to be around this uncertain of S afterward:

H(S|Z) = ∑j{P(Zj)∑i{-P(Si|Zj)log2(P(Si|Zj))}}

And that’s how one calculates conditional entropies; from which, in turn, we can get the conditional mutual information.

There are all sorts of ancillary theorems here, like

H(X|Y) = H (X,Y) - H(Y)

and

if I(X;Z) = 0 and I(Y;X|Z) = 0 then I(X;Y) = 0,

but I’m not going to go into those.

“But,” you ask, “what doe
s this have to do with the nature of words and their hidden Bayesian structure?”

I am just so unspeakably glad that you asked that question, because I was planning to tell you whether you liked it or not. But first there are a couple more preliminaries.

You will remember—yes, you will remember—that there is a duality between mutual information and Bayesian evidence. Mutual information is positive if and only if the probability of at least some joint events P(x,y) does not equal the product of the probabilities of the separate events P(x)P(y). This, in turn, is exactly equivalent to the condition that Bayesian evidence exists between x and y:

I(X;Y) > 0 ⇒

P(x,y) ≠ P(x)P(y)

P(x,y) / P(y) ≠ P(x)

P(x|y) ≠ P(x).

If you’re conditioning on Z, you just adjust the whole derivation accordingly:

I(X;Y|Z) > 0 ⇒

P(x,y|z) ≠ P(x|z)P(y|z)

P(x,y|z) / P(y|z) ≠ P (x|z)

(P(x,y,z)/P(z)) / (P(y,z)/P(z)) ≠ P(x|z)

P(x,y,z) / P(y,z) ≠ P(x|z)

P(x|y,z) ≠ P(x|z).

Which last line reads “Even knowing Z, learning Y still changes our beliefs about X.”

Conversely, as in our original case of Z being “even” or “odd,” Z screens off X from Y—that is, if we know that Z is “even,” learning that Y is in state Y4 tells us nothing more about whether X is X2, X4, X6, or X8. Or if we know that Z is “odd,” then learning that X is X5 tells us nothing more about whether Y is Y1 or Y3. Learning Z has rendered X and Y conditionally independent.

Conditional independence is a hugely important concept in probability theory—to cite just one example, without conditional independence, the universe would have no structure.

Here, though, I only intend to talk about one particular kind of conditional independence—the case of a central variable that screens off other variables surrounding it, like a central body with tentacles.

Let there be five variables U,V, W, X, and Y; and moreover, suppose that for every pair of these variables, one variable is evidence about the other. If you select U and W, for example, then learning U = U1 will tell you something you didn’t know before about the probability that W = W1.

‹ Prev Next ›