If we wrote down the function for the payoff, it would be Payoff = P(winner), where P(winner) is the amount of play money you bet on the winning color on that round. If we wrote down the function for the expected payoff given that Payoff rule, it would be:
P(color) is the amount of play money you bet on a color, and F(color) is the frequency with which that color wins.
Suppose that the actual frequencies of the lights are 30% blue, 20% green, and 50% red. And suppose that on each round I bet 40% on blue, 50% on green, and 10% on red. I would get 40 cents 30% of the time, 50 cents 20% of the time, and 10 cents 50% of the time, for an average payoff of $0.12 + $0.10 + $0.05 or $0.27. That is:
P(color) = play money assigned to that color
F(color) = frequency with which that color wins
Payoff = P(winner) = amount of play money allocated to winning color.
Actual frequencies of winning:
F(blue) = 30%
F(green) = 20%
F(red) = 50%.
In the long run, red wins 50% of the time, green wins 20% of the time, and blue wins 30% of the time. So our average payoff on each round is 50% of the payoff if red wins, plus 20% of the payoff if green wins, plus 30% of the payoff if blue wins.
The payoff is a function of the winning color and the betting scheme. We want to compute the average payoff, given a betting scheme and the frequencies at which each color wins. The mathematical term for this kind of computation, taking a function of each case and weighting it by the frequency of that case, is an expectation. Thus, to compute our expected payoff we would calculate:
= P(blue) × F(blue) + P(green) × F(green) + P(red) × F(red)
= $0.40 × 30% + $0.50 × 20% + $0.10 × 50%
= $0.12 + $0.10 + $0.05
= $0.27.
With this betting scheme I’ll win, on average, around 27 cents per round.
I allocated my play money in a grossly arbitrary way, and the question arises: Can I increase my expected payoff by allocating my play money more wisely? Given the scoring rule provided, I maximize my expected payoff by allocating my entire dollar to red. Despite my expected payoff of 50 cents per round, the light might actually flash green, blue, blue, green, green and I would receive an actual payoff of zero. However, the chance of the light’s coming up non-red on five successive rounds is approximately 3%. Compare the red/blue card game in Lawful Uncertainty.
A proper scoring rule is a rule for scoring bets so that you maximize your expected payoff by betting play money that exactly equals the chance of that color flashing. We want a scoring rule so that if the lights actually flash at the frequencies 30% blue, 20% green, and 50% red, you can maximize your average payoff only by betting 30 cents on blue, 20 cents on green, and 50 cents on red. A proper scoring rule is one that forces your optimal bet to exactly report your estimate of the probabilities. (This is also sometimes known as a strictly proper scoring rule.) As we’ve seen, not all scoring rules have this property; and if you invent a plausible-sounding scoring rule at random, it probably won’t have the property.
One rule with this proper property is to pay a dollar minus the squared error of the bet, rather than the bet itself—if you bet 30 cents on the winning light, your error would be 70 cents, your squared error would be 49 cents (0.72 = 0.49), and a dollar minus your squared error would be 51 cents.3 (Presumably your play money is denominated in the square root of cents, so that the squared error is a monetary sum.)
We shall not use the squared-error rule. Ordinary statisticians take the squared error of everything in sight, but not Bayesian statisticians.
We add a new requirement: we require, not only a proper scoring rule, but that our proper scoring rule gives us the same answer whether we apply it to rounds individually or combined. This is what Bayesians do instead of taking the squared error of things; we require invariances.
Suppose I press the button twice in a row. There are nine possible outcomes: green-green, green-blue, green-red, blue-green, blue-blue, blue-red, red-green, red-blue, and red-red. Suppose that green wins, and then blue wins. The experimenter would assign the first score based on our probability assignments for P(green1) and the second score based on P(blue2|green1).4 We would make two predictions, and get two scores. Our first prediction was the probability we assigned to the color that won on the first round, green. Our second prediction was our probability that blue would win on the second round, given that green won on the first round. Why do we need to write P(blue2|green1) instead of just P(blue2)? Because you might have a hypothesis about the flashing light that says “blue never follows green,” or “blue always follows green” or “blue follows green with 70% probability.” If this is so, then after seeing green on the first round, you might want to revise your prediction—change your bets—for the second round. You can always revise your predictions right up to the moment the experimenter presses the button, using every scrap of information; but after the light flashes it is too late to change your bet.
Suppose the actual outcome is green1 followed by blue2. We require this invariance: I must get the same total score, regardless of whether:
I am scored twice, first on my prediction for P(green1), and second on my prediction for P(blue2|green1).
I am scored once for my joint prediction P(green1 and blue2).
Suppose I assign a 60% probability to green1, and then the green light flashes. I must now produce probabilities for the colors on the second round. I assess the possibility blue2, and allocate it 25% of my probability mass. Lo and behold, on the second round the light flashes blue. So on the first round my bet on the winning color was 60%, and on the second round my bet on the winning color was 25%. But I might also, at the start of the experiment and after assigning P(green1), imagine that the light first flashes green, imagine updating my theories based on that information, and then say what confidence I will give to blue on the next round if the first round is green. That is, I generate the probabilities P(green1) and P(blue2|green1). By multiplying these two probabilities together we would get the joint probability, P(green1 and blue2) = 15%.
A double experiment has nine possible outcomes. If I generate nine probabilities for P(green1, green2), P(green1, blue2), . . . , P(red1, blue2), P(red1, red2), the probability mass must sum to no more than one. I am giving predictions for nine mutually exclusive possibilities of a “double experiment.”
We require a scoring rule (and maybe it won’t look like anything an ordinary bookie would ever use) such that my score doesn’t change regardless of whether we consider the double result as two predictions or one prediction. I can treat the sequence of two results as a single experiment, “press the button twice,” and be scored on my prediction for P(blue2, green1) = 15%. Or I can be scored once for my first prediction P(green1) = 60%, then again on my prediction P(blue2|green1) = 25%. We require the same total score in either case, so that it doesn’t matter how we slice up the experiments and the predictions—the total score is always exactly the same. This is our invariance.
We have just required:
Score[P(green1,blue2)] = Score[P(green1)] + Score[P(blue2|green1)].
And we already know:
P(green1,blue2) = P(green1) × P(blue2|green1).
The only possible scoring rule is:
Score(P) = log(P).
The new scoring rule is that your score is the logarithm of the probability you assigned to the winner.
The base of the logarithm is arbitrary—whether we use the logarithm base ten or the logarithm base two, the scoring rule has the desired invariance. But we must choose some actual base. A mathematician would choose base e; an engineer would choose base ten; a computer scientist would choose base two. If we use base ten, we can convert to decibels, as in the Intuitive Explanation; but sometimes bits are easier to manipulate.
The logarithm scoring rule is proper—it has its expected maximum when we say our exact anticipations; it rewards honesty. If we think the blue light has a 60% probability of flashing, a
nd we calculate our expected payoff for different betting schemas, we find that we maximize our expected payoff by telling the experimenter “60%.” (Readers with calculus can verify this.) The scoring rule also gives an invariant total, regardless of whether pressing the button twice counts as “one experiment” or “two experiments.” However, payoffs are now all negative, since we are taking the logarithm of the probability and the probability is between zero and one. The logarithm base ten of 0.1 is -1; the logarithm base ten of 0.01 is -2. That’s okay. We accepted that the scoring rule might not look like anything a real bookie would ever use. If you like, you can imagine that the experimenter has a pile of money, and at the end of the experiment they award you some amount minus your large negative score. (Er, the amount plus your negative score.) Maybe the experimenter has a hundred dollars, and at the end of a hundred rounds you accumulated a score of -48, so you get $52 dollars.
A score of -48 in what base? We can eliminate the ambiguity in the score by specifying units. Ten decibels equals a factor of 10; negative ten decibels equals a factor of 1/10. Assigning a probability of 0.01 to the actual outcome would score -20 decibels. A probability of 0.03 would score -15 decibels. Sometimes we may use bits: 1 bit is a factor of 2, -1 bit is a factor of 1/2. A probability of 0.25 would score -2 bits; a probability of 0.03 would score around -5 bits.
If you arrive at a probability assessment P for each color, with P(red), P(blue), P(green), then your expected score is:
Score(P) = log(P)
Suppose you had probabilities of 25% red, 50% blue, and 25% green. Let’s think in base 2 for a moment, to make things simpler. Your expected score is:
Score(red) = −2 bits, flashes 25% of the time,
Score(blue) = −1 bit, flashes 50% of the time,
Score(green) = −2 bits, flashes 25% of the time,
Expectation(Score) = −1.5 bits.
* * *
Contrast our Bayesian scoring rule with the ordinary or colloquial way of speaking about degrees of belief, where someone might casually say, “I’m 98% certain that canola oil contains more omega-3 fats than olive oil.” What they really mean by this is that they feel 98% certain—there’s something like a little progress bar that measures the strength of the emotion of certainty, and this progress bar is 98% full. And the emotional progress bar probably wouldn’t be exactly 98% full, if we had some way to measure. The word “98%” is just a colloquial way of saying: “I’m almost but not entirely certain.” It doesn’t mean that you could get the highest expected payoff by betting exactly 98 cents of play money on that outcome. You should only assign a calibrated confidence of 98% if you’re confident enough that you think you could answer a hundred similar questions, of equal difficulty, one after the other, each independent from the others, and be wrong, on average, about twice. We’ll keep track of how often you’re right, over time, and if it turns out that when you say “90% sure” you’re right about seven times out of ten, then we’ll say you’re poorly calibrated.
If you say “98% probable” a thousand times, and you are surprised only five times, we still ding you for poor calibration. You’re allocating too much probability mass to the possibility that you’re wrong. You should say “99.5% probable” to maximize your score. The scoring rule rewards accurate calibration, encouraging neither humility nor arrogance.
At this point it may occur to some readers that there’s an obvious way to achieve perfect calibration—just flip a coin for every yes-or-no question, and assign your answer a confidence of 50%. You say 50% and you’re right half the time. Isn’t that perfect calibration? Yes. But calibration is only one component of our Bayesian score; the other component is discrimination.
Suppose I ask you ten yes-or-no questions. You know absolutely nothing about the subject, so on each question you divide your probability mass fifty-fifty between “Yes” and “No.” Congratulations, you’re perfectly calibrated—answers for which you said “50% probability” were true exactly half the time. This is true regardless of the sequence of correct answers or how many answers were Yes. In ten experiments you said “50%” on twenty occasions—you said “50%” to Yes1, No1, Yes2, No2, Yes3, No3, . . . On ten of those occasions the answer was correct, the occasions: Yes1, No2, No3, . . . And on ten of those occasions the answer was incorrect: No1, Yes2, Yes3, . . .
Now I give my own answers, putting more effort into it, trying to discriminate whether Yes or No is the correct answer. I assign 90% confidence to each of my favored answers, and my favored answer is wrong twice. I’m more poorly calibrated than you. I said “90%” on ten occasions and I was wrong two times. The next time someone listens to me, they may mentally translate “90%” into 80%, knowing that when I’m 90% sure, I’m right about 80% of the time. But the probability you assigned to the final outcome is 1/2 to the tenth power, which is 0.001 or 1/1,024. The probability I assigned to the final outcome is 90% to the eighth power times 10% to the second power, 0.98 × 0.12, which works out to 0.004 or 0.4%. Your calibration is perfect and mine isn’t, but my better discrimination between right and wrong answers more than makes up for it. My final score is higher—I assigned a greater joint probability to the final outcome of the entire experiment. If I’d been less overconfident and better calibrated, the probability I assigned to the final outcome would have been 0.88 × 0.22, which works out to 0.006 or 6%.
Is it possible to do even better? Sure. You could have guessed every single answer correctly, and assigned a probability of 99% to each of your answers. Then the probability you assigned to the entire experimental outcome would be 0.9910 ≈ 90%.
Your score would be log (90%), which is -0.45 decibels or -0.15 bits. We need to take the logarithm so that if I try to maximize my expected score, ∑ P × log (P), I have no motive to cheat. Without the logarithm rule, I would maximize my expected score by assigning all my probability mass to the most probable outcome. Also, without the logarithm rule, my total score would be different depending on whether we counted several rounds as several experiments or as one experiment.
A simple transform can fix poor calibration by decreasing discrimination. If you are in the habit of saying “million-to-one” on 90 correct and 10 incorrect answers for each hundred questions, we can perfect your calibration by replacing “million-to-one” with “nine-to-one.” In contrast, there’s no easy way to increase (successful) discrimination. If you habitually say “nine-to-one” on 90 correct answers for each hundred questions, I can easily increase your claimed discrimination by replacing “nine-to-one” with “million-to-one.” But no simple transform can increase your actual discrimination such that your reply distinguishes 95 correct answers and 5 incorrect answers. From Yates et al.:5 “Whereas good calibration often can be achieved by simple mathematical transformations (e.g., adding a constant to every probability judgment), good discrimination demands access to solid, predictive evidence and skill at exploiting that evidence, which are difficult to find in any real-life, practical situation.” If you lack the ability to distinguish truth from falsehood, you can achieve perfect calibration by confessing your ignorance; but confessing ignorance will not, of itself, distinguish truth from falsehood.
We thus dispose of another false stereotype of rationality, that rationality consists of being humble and modest and confessing helplessness in the face of the unknown. That’s just the cheater’s way out, assigning a 50% probability to all yes-or-no questions. Our scoring rule encourages you to do better if you can. If you are ignorant, confess your ignorance; if you are confident, confess your confidence. We penalize you for being confident and wrong, but we also reward you for being confident and right. That is the virtue of a proper scoring rule.
* * *
Suppose I flip a coin twenty times. If I believe the coin is fair, the best prediction I can make is to predict an even chance of heads or tails on each flip. If I believe the coin is fair, I assign the same probability to every possible sequence of twenty coinflips. There are roughly a million (1,048,576) possi
ble sequences of twenty coinflips, and I have only 1.0 of probability mass to play with. So I assign to each individual possible sequence a probability of (1∕2)20—odds of about a million to one; -20 bits or -60 decibels.
I made an experimental prediction and got a score of -60 decibels! Doesn’t this falsify the hypothesis? Intuitively, no. We do not flip a coin twenty times and see a random-looking result, then reel back and say, why, the odds of that are a million to one. But the odds are a million to one against seeing that exact sequence, as I would discover if I naively predicted the exact same outcome for the next sequence of twenty coinflips. It’s okay to have theories that assign tiny probabilities to outcomes, so long as no other theory does better. But if someone used an alternate hypothesis to write down the exact sequence in a sealed envelope in advance, and she assigned a probability of 99%, I would suspect the fairness of the coin. Provided that she only sealed one envelope, and not a million.
That tells us what we ought common-sensically to answer, but it doesn’t say how the common-sense answer arises from the math. To say why the common sense is correct, we need to integrate all that has been said so far into the framework of Bayesian revision of belief. When we’re done, we’ll have a technical understanding of the difference between a verbal understanding and a technical understanding.
Rationality- From AI to Zombies Page 114