Statistical Inference as Severe Testing

Home > Other > Statistical Inference as Severe Testing > Page 4
Statistical Inference as Severe Testing Page 4

by Deborah G Mayo


  The performance philosophy sees the key function of statistical method as controlling the relative frequency of erroneous inferences in the long run of applications. For example, a frequentist statistical test, in its naked form, can be seen as a rule: whenever your outcome exceeds some value (say, X > x *), reject a hypothesis H 0 and infer H 1 . The value of the rule, according to its performance-oriented defenders, is that it can ensure that, regardless of which hypothesis is true, there is both a low probability of erroneously rejecting H 0 (rejecting H 0 when it is true) as well as erroneously accepting H 0 (failing to reject H 0 when it is false).

  The second philosophy, probabilism, views probability as a way to assign degrees of belief, support, or plausibility to hypotheses. Many keep to a comparative report, for example that H 0 is more believable than is H 1 given data x ; others strive to say H 0 is less believable given data x than before, and offer a quantitative report of the difference.

  What happened to the goal of scrutinizing BENT science by the severity criterion? Neither “ probabilism” nor “ performance” directly captures that demand. To take these goals at face value, it’ s easy to see why they come up short. Potti and Nevins’ strong belief in the reliability of their prediction model for cancer therapy scarcely made up for the shoddy testing. Neither is good long-run performance a sufficient condition. Most obviously, there may be no long-run repetitions, and our interest in science is often just the particular statistical inference before us. Crude long-run requirements may be met by silly methods. Most importantly, good performance alone fails to get at why methods work when they do; namely – I claim – to let us assess and control the stringency of tests. This is the key to answering a burning question that has caused major headaches in statistical foundations: why should a low relative frequency of error matter to the appraisal of the inference at hand? It is not probabilism or performance we seek to quantify, but probativeness .

  I do not mean to disparage the long-run performance goal – there are plenty of tasks in inquiry where performance is absolutely key. Examples are screening in high-throughput data analysis, and methods for deciding which of tens of millions of collisions in high-energy physics to capture and analyze. New applications of machine learning may lead some to say that only low rates of prediction or classification errors matter. Even with prediction, “ black-box” modeling, and non-probabilistic inquiries, there is concern with solving a problem. We want to know if a good job has been done in the case at hand.

  Severity (Strong): Argument from Coincidence

  The weakest version of the severity requirement (Section 1.1 ), in the sense of easiest to justify, is negative, warning us when BENT data are at hand, and a surprising amount of mileage may be had from that negative principle alone. It is when we recognize how poorly certain claims are warranted that we get ideas for improved inquiries. In fact, if you wish to stop at the negative requirement, you can still go pretty far along with me. I also advocate the positive counterpart:

  Severity (strong): We have evidence for a claim C just to the extent it survives a stringent scrutiny. If C passes a test that was highly capable of finding flaws or discrepancies from C , and yet none or few are found, then the passing result, x , is evidence for C .

  One way this can be achieved is by an argument from coincidence . The most vivid cases occur outside formal statistics.

  Some of my strongest examples tend to revolve around my weight. Before leaving the USA for the UK, I record my weight on two scales at home, one digital, one not, and the big medical scale at my doctor’ s office. Suppose they are well calibrated and nearly identical in their readings, and they also all pick up on the extra 3 pounds when I’ m weighed carrying three copies of my 1-pound book, Error and the Growth of Experimental Knowledge (EGEK). Returning from the UK, to my astonishment, not one but all three scales show anywhere from a 4– 5 pound gain. There’ s no difference when I place the three books on the scales, so I must conclude, unfortunately, that I’ ve gained around 4 pounds. Even for me, that’ s a lot. I’ ve surely falsified the supposition that I lost weight! From this informal example, we may make two rather obvious points that will serve for less obvious cases. First, there’ s the idea I call lift-off.

  Lift-off: An overall inference can be more reliable and precise than its premises individually.

  Each scale, by itself, has some possibility of error, and limited precision. But the fact that all of them have me at an over 4-pound gain, while none show any difference in the weights of EGEK, pretty well seals it. Were one scale off balance, it would be discovered by another, and would show up in the weighing of books. They cannot all be systematically misleading just when it comes to objects of unknown weight, can they? Rejecting a conspiracy of the scales, I conclude I’ ve gained weight, at least 4 pounds. We may call this an argument from coincidence , and by its means we can attain lift-off. Lift-off runs directly counter to a seemingly obvious claim of drag-down.

  Drag-down: An overall inference is only as reliable/precise as is its weakest premise.

  The drag-down assumption is common among empiricist philosophers: As they like to say, “ It’ s turtles all the way down.” Sometimes our inferences do stand as a kind of tower built on linked stones – if even one stone fails they all come tumbling down. Call that a linked argument.

  Our most prized scientific inferences would be in a very bad way if piling on assumptions invariably leads to weakened conclusions. Fortunately we also can build what may be called convergent arguments, where lift-off is attained. This seemingly banal point suffices to combat some of the most well entrenched skepticisms in philosophy of science. And statistics happens to be the science par excellence for demonstrating lift-off!

  Now consider what justifies my weight conclusion, based, as we are supposing it is, on a strong argument from coincidence. No one would say: “ I can be assured that by following such a procedure, in the long run I would rarely report weight gains erroneously, but I can tell nothing from these readings about my weight now.” To justify my conclusion by long-run performance would be absurd. Instead we say that the procedure had enormous capacity to reveal if any of the scales were wrong, and from this I argue about the source of the readings: H : I’ ve gained weight. Simple as that. It would be a preposterous coincidence if none of the scales registered even slight weight shifts when weighing objects of known weight, and yet were systematically misleading when applied to my weight. You see where I’ m going with this. This is the key – granted with a homely example – that can fill a very important gap in frequentist foundations: Just because an account is touted as having a long-run rationale, it does not mean it lacks a short run rationale, or even one relevant for the particular case at hand.

  Nor is it merely the improbability of all the results were H false; it is rather like denying an evil demon has read my mind just in the cases where I do not know the weight of an object, and deliberately deceived me. The argument to “ weight gain” is an example of an argument from coincidence to the absence of an error, what I call:

  Arguing from Error : There is evidence an error is absent to the extent that a procedure with a very high capability of signaling the error, if and only if it is present, nevertheless detects no error.

  I am using “ signaling” and “ detecting” synonymously: It is important to keep in mind that we don’ t know if the test output is correct, only that it gives a signal or alert, like sounding a bell. Methods that enable strong arguments to the absence (or presence) of an error I call strong error probes . Our ability to develop strong arguments from coincidence, I will argue, is the basis for solving the “ problem of induction.”

  Glaring Demonstrations of Deception

  Intelligence is indicated by a capacity for deliberate deviousness. Such deviousness becomes self-conscious in inquiry: An example is the use of a placebo to find out what it would be like if the drug has no effect. What impressed me the most in my first statistics class was the demonstration of how a
pparently impressive results are readily produced when nothing’ s going on, i.e., “ by chance alone.” Once you see how it is done, and done easily, there is no going back. The toy hypotheses used in statistical testing are nearly always overly simple as scientific hypotheses. But when it comes to framing rather blatant deceptions, they are just the ticket!

  When Fisher offered Muriel Bristol-Roach a cup of tea back in the 1920s, she refused it because he had put the milk in first. What difference could it make? Her husband and Fisher thought it would be fun to put her to the test (1935a ). Say she doesn’ t claim to get it right all the time but does claim that she has some genuine discerning ability. Suppose Fisher subjects her to 16 trials and she gets 9 of them right. Should I be impressed or not? By a simple experiment of randomly assigning milk first/tea first Fisher sought to answer this stringently. But don’ t be fooled: a great deal of work goes into controlling biases and confounders before the experimental design can work. The main point just now is this: so long as lacking ability is sufficiently like the canonical “ coin tossing” (Bernoulli) model (with the probability of success at each trial of 0.5), we can learn from the test procedure. In the Bernoulli model, we record success or failure, assume a fixed probability of success θ on each trial, and that trials are independent. If the probability of getting even more successes than she got, merely by guessing, is fairly high, there’ s little indication of special tasting ability. The probability of at least 9 of 16 successes, even if θ = 0.5, is 0.4. To abbreviate, Pr(at least 9 of 16 successes; H 0 : θ = 0.5) = 0.4. This is the P -value of the observed difference; an unimpressive 0.4. You’ d expect as many or even more “ successes” 40% of the time merely by guessing. It’ s also the significance level attained by the result. (I often use P -value as it’ s shorter.) Muriel Bristol-Roach pledges that if her performance may be regarded as scarcely better than guessing, then she hasn’ t shown her ability. Typically, a small value such as 0.05, 0.025, or 0.01 is required.

  Such artificial and simplistic statistical hypotheses play valuable roles at stages of inquiry where what is needed are blatant standards of “ nothing’ s going on.” There is no presumption of a metaphysical chance agency, just that there is expected variability – otherwise one test would suffice – and that probability models from games of chance can be used to distinguish genuine from spurious effects. Although the goal of inquiry is to find things out, the hypotheses erected to this end are generally approximations and may be deliberately false. To present statistical hypotheses as identical to substantive scientific claims is to mischaracterize them. We want to tell what’ s true about statistical inference. Among the most notable of these truths is:

  P -values can be readily invalidated due to how the data (or hypotheses!) are generated or selected for testing.

  If you fool around with the results afterwards, reporting only successful guesses, your report will be invalid. You may claim it’ s very difficult to get such an impressive result due to chance, when in fact it’ s very easy to do so, with selective reporting. Another way to put this: your computed P -value is small, but the actual P -value is high! Concern with spurious findings, while an ancient problem, is considered sufficiently serious to have motivated the American Statistical Association to issue a guide on how not to interpret P -values (Wasserstein and Lazar 2016 ); hereafter, ASA 2016 Guide. It may seem that if a statistical account is free to ignore such fooling around then the problem disappears! It doesn’ t.

  Incidentally, Bristol-Roach got all the cases correct, and thereby taught her husband a lesson about putting her claims to the test.

  Peirce

  The philosopher and astronomer C. S. Peirce, writing in the late nineteenth century, is acknowledged to have anticipated many modern statistical ideas (including randomization and confidence intervals). Peirce describes how “ so accomplished a reasoner” as Dr. Playfair deceives himself by a technique we know all too well – scouring the data for impressive regularities (2.738). Looking at the specific gravities of three forms of carbon, Playfair seeks and discovers a formula that holds for all of them (each is a root of the atomic weight of carbon, which is 12). Can this regularity be expected to hold in general for metalloids? It turns out that half of the cases required Playfair to modify the formula after the fact. If one limits the successful instances to ones where the formula was predesignated, and not altered later on, only half satisfy Playfair’ s formula. Peirce asks, how often would such good agreement be found due to chance? Again, should we be impressed?

  Peirce introduces a mechanism to arbitrarily pair the specific gravity of a set of elements with the atomic weight of another. By design, such agreements could only be due to the chance pairing. Lo and behold, Peirce finds about the same number of cases that satisfy Playfair’ s formula. “ It thus appears that there is no more frequent agreement with Playfair’ s proposed law than what is due to chance” (2.738).

  At first Peirce’ s demonstration seems strange. He introduces an accidental pairing just to simulate the ease of obtaining so many agreements in an entirely imaginary situation. Yet that suffices to show Playfair’ s evidence is BENT. The popular inductive accounts of his time, Peirce argues, do not prohibit adjusting the formula to fit the data, and, because of that, they would persist in Playfair’ s error. The same debate occurs today, as when Anil Potti (of the Duke scandal) dismissed the whistleblower Perez thus: “ we likely disagree with what constitutes validation” (Nevins and Potti 2015). Erasing genomic data that failed to fit his predictive model was justified, Potti claimed, by the fact that other data points fit (Perez 2015 )! Peirce’ s strategy, as that of Coombes et al., is to introduce a blatant standard to put the method through its paces, without bogus agreements. If the agreement is no better than bogus agreement, we deny there is evidence for a genuine regularity or valid prediction. Playfair’ s formula may be true, or probably true, but Peirce’ s little demonstration is enough to show his method did a lousy job of testing it.

  Texas Marksman

  Take an even simpler and more blatant argument of deception. It is my favorite: the Texas Marksman. A Texan wants to demonstrate his shooting prowess. He shoots all his bullets any old way into the side of a barn and then paints a bull’ s-eye in spots where the bullet holes are clustered. This fails utterly to severely test his marksmanship ability. When some visitors come to town and notice the incredible number of bull’ s-eyes, they ask to meet this marksman and are introduced to a little kid. How’ d you do so well, they ask? Easy, I just drew the bull’ s-eye around the most tightly clustered shots. There is impressive “ agreement” with shooting ability, he might even compute how improbably so many bull’ s-eyes would occur by chance. Yet his ability to shoot was not tested in the least by this little exercise. There’ s a real effect all right, but it’ s not caused by his marksmanship! It serves as a potent analogy for a cluster of formal statistical fallacies from data-dependent findings of “ exceptional” patterns.

  The term “ apophenia” refers to a tendency to zero in on an apparent regularity or cluster within a vast sea of data and claim a genuine regularity. One of our fundamental problems (and skills) is that we’ re apopheniacs. Some investment funds, none that we actually know, are alleged to produce several portfolios by random selection of stocks and send out only the one that did best. Call it the Pickrite method. They want you to infer that it would be a preposterous coincidence to get so great a portfolio if the Pickrite method were like guessing. So their methods are genuinely wonderful, or so you are to infer. If this had been their only portfolio, the probability of doing so well by luck is low. But the probability of at least one of many portfolios doing so well (even if each is generated by chance) is high, if not guaranteed.

  Let’ s review the rogues’ gallery of glaring arguments from deception. The lady tasting tea showed how a statistical model of “ no effect” could be used to amplify our ordinary capacities to discern if something really unusual is going on. The P -value is the probability of at l
east as high a success rate as observed, assuming the test or null hypothesis, the probability of success is 0.5. Since even more successes than she got is fairly frequent through guessing alone (the P -value is moderate), there’ s poor evidence of a genuine ability. The Playfair and Texas sharpshooter examples, while quasi-formal or informal, demonstrate how to invalidate reports of significant effects. They show how gambits of post-data adjustments or selection can render a method highly capable of spewing out impressive looking fits even when it’ s just random noise.

  We appeal to the same statistical reasoning to show the problematic cases as to show genuine arguments from coincidence.

  So am I proposing that a key role for statistical inference is to identify ways to spot egregious deceptions (BENT cases) and create strong arguments from coincidence? Yes, I am.

  Spurious P-values and Auditing

  In many cases you read about you’ d be right to suspect that someone has gone circling shots on the side of a barn. Confronted with the statistical news flash of the day, your first question is: Are the results due to selective reporting, cherry picking, or any number of other similar ruses? This is a central part of what we’ ll call auditing a significance level.

  A key point too rarely appreciated: Statistical facts about P -values themselves demonstrate how data finagling can yield spurious significance. This is true for all error probabilities. That’ s what a self-correcting inference account should do. Ben Goldacre, in Bad Pharma (2012 ), sums it up this way: the gambits give researchers an abundance of chances to find something when the tools assume you have had just one chance. Scouring different subgroups and otherwise “ trying and trying again” are classic ways to blow up the actual probability of obtaining an impressive, but spurious, finding – and that remains so even if you ditch P -values and never compute them. FDA rules are designed to outlaw such gambits. To spot the cheating or questionable research practices (QRPs) responsible for a finding may not be easy. New research tools are being developed to detect them. Unsurprisingly, P -value analysis is relied on to discern spurious P -values (e.g., by lack of replication, or, in analyzing a group of tests, finding too many P -values in a given range). Ultimately, a qualitative severity scrutiny is necessary to get beyond merely raising doubts to falsifying purported findings.

 

‹ Prev