Statistical Inference as Severe Testing

Home > Other > Statistical Inference as Severe Testing > Page 5
Statistical Inference as Severe Testing Page 5

by Deborah G Mayo


  Association Is Not Causation: Hormone Replacement Therapy (HRT)

  Replicable results from high-quality research are sound, except for the sin that replicability fails to uncover: systematic bias. 2 Gaps between what is actually producing the statistical effect and what is inferred open the door by which biases creep in. Stand-in or proxy variables in statistical models may have little to do with the phenomenon of interest.

  So strong was the consensus-based medical judgment that hormone replacement therapy helps prevent heart disease that many doctors deemed it “ unethical to ask women to accept the possibility that they might be randomized to a placebo” (The National Women’ s Health Network (NWHN) 2002 , p. 180). Post-menopausal women who wanted to retain the attractions of being “ Feminine Forever,” as in the title of an influential tract (Wilson 1971 ), were routinely given HRT. Nevertheless, when a large randomized controlled trial (RCT) was finally done, it revealed statistically significant increased risks of heart disease, breast cancer, and other diseases that HRT was to have helped prevent. The observational studies on HRT, despite reproducibly showing a benefit, had little capacity to unearth biases due to “ the healthy women’ s syndrome.” There were confounding factors separately correlated with the beneficial outcomes enjoyed by women given HRT: they were healthier, better educated, and less obese than women not taking HRT. (That certain subgroups are now thought to benefit is a separate matter.)

  Big Data scientists are discovering there may be something in the data collection that results in the bias being “ hard-wired” into the data, and therefore even into successful replications. So replication is not enough. Beyond biased data, there’ s the worry that lab experiments may be only loosely connected to research claims. Experimental economics, for instance, is replete with replicable effects that economist Robert Sugden calls “ exhibits.” “ An exhibit is an experimental design which reliably induces a surprising regularity” with at best an informal hypothesis as to its underlying cause (Sugden 2005 , p. 291). Competing interpretations remain. (In our museum travels, “ exhibit” will be used in the ordinary way.) In analyzing a test’ s capability to control erroneous interpretations, we must consider the porousness at multiple steps from data, to statistical inference, to substantive claims.

  Souvenir A: Postcard to Send

  The gift shop has a postcard listing the four slogans from the start of this Tour. Much of today’ s handwringing about statistical inference is unified by a call to block these fallacies. In some realms, trafficking in too-easy claims for evidence, if not criminal offenses, are “ bad statistics” ; in others, notably some social sciences, they are accepted cavalierly – much to the despair of panels on research integrity. We are more sophisticated than ever about the ways researchers can repress unwanted, and magnify wanted, results. Fraud-busting is everywhere, and the most important grain of truth is this: all the fraud- busting is based on error statistical reasoning (if only on the meta-level). The minimal requirement to avoid BENT isn’ t met. It’ s hard to see how one can grant the criticisms while denying the critical logic.

  We should oust mechanical, recipe-like uses of statistical methods that have long been lampooned, and are doubtless made easier by Big Data mining. They should be supplemented with tools to report magnitudes of effects that have and have not been warranted with severity. But simple significance tests have their uses, and shouldn’ t be ousted simply because some people are liable to violate Fisher’ s warning and report isolated results. They should be seen as a part of a conglomeration of error statistical tools for distinguishing genuine and spurious effects. They offer assets that are essential to our task: they have the means by which to register formally the fallacies in the postcard list. The failed statistical assumptions, the selection effects from trying and trying again, all alter a test’ s error-probing capacities. This sets off important alarm bells, and we want to hear them. Don’ t throw out the error-control baby with the bad statistics bathwater.

  The slogans about lying with statistics? View them, not as a litany of embarrassments, but as announcing what any responsible method must register, if not control or avoid. Criticisms of statistical tests, where valid, boil down to problems with the critical alert function. Far from the high capacity to warn, “ Curb your enthusiasm!” as correct uses of tests do, there are practices that make sending out spurious enthusiasm as easy as pie. This is a failure for sure, but don’ t trade them in for methods that cannot detect failure at all. If you’ re shopping for a statistical account, or appraising a statistical reform, your number one question should be: does it embody trigger warnings of spurious effects? Of bias? Of cherry picking and multiple tries? If the response is: “ No problem; if you use our method, those practices require no change in statistical assessment!” all I can say is, if it sounds too good to be true, you might wish to hold off buying it.

  We shouldn’ t be hamstrung by the limitations of any formal methodology. Background considerations, usually absent from typical frequentist expositions, must be made more explicit; taboos and conventions that encourage “ mindless statistics” (Gigerenzer 2004 ) eradicated. The severity demand is what we naturally insist on as consumers. We want methods that are highly capable of finding flaws just when they’ re present, and we specify worst case scenarios. With the data in hand, we custom tailor our assessments depending on how severely (or inseverely) claims hold up. Here’ s an informal statement of the severity requirements (weak and strong):

  Severity Requirement (weak): If data x agree with a claim C but the method was practically incapable of finding flaws with C even if they exist, then x is poor evidence for C .

  Severity (strong): If C passes a test that was highly capable of finding flaws or discrepancies from C , and yet none or few are found, then the passing result, x , is an indication of, or evidence for, C .

  You might aver that we are too weak to fight off the lures of retaining the status quo – the carrots are too enticing, given that the sticks aren’ t usually too painful. I’ ve heard some people say that evoking traditional mantras for promoting reliability, now that science has become so crooked, only makes things worse. Really? Yes there is gaming, but if we are not to become utter skeptics of good science, we should understand how the protections can work. In either case, I’ d rather have rules to hold the “ experts” accountable than live in a lawless wild west. I, for one, would be skeptical of entering clinical trials based on some of the methods now standard. There will always be cheaters, but give me an account that has eyes with which to spot them, and the means by which to hold cheaters accountable. That is, in brief, my basic statistical philosophy. The stakes couldn’ t be higher in today’ s world. Feynman said to take on an “ extra type of integrity” that is not merely the avoidance of lying but striving “ to check how you’ re maybe wrong.” I couldn’ t agree more. But we laywomen are still going to have to proceed with a cattle prod.

  1.3 The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon

  How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? … Is complacency in the face of contradiction acceptable for a central discipline of science?

  (Donald Fraser 2011 , p. 329)

  We [statisticians] are not blameless … we have not made a concerted professional effort to provide the scientific world with a unified testing methodology.

  (J. Berger 2003 , p. 4)

  From the aerial perspective of a hot-air balloon, we may see contemporary statistics as a place of happy multiplicity: the wealth of computational ability allows for the application of countless methods, with little handwringing about foundations. Doesn’ t this show we may have reached “ the end of statistical foundations” ? One might have thought so. Yet, descending close to a marshy wetland, and especially scratching a bit below the surface, reveals unease on all sides. The false dilemma betwe
en probabilism and long-run performance lets us get a handle on it. In fact, the Bayesian versus frequentist dispute arises as a dispute between probabilism and performance. This gets to my second reason for why the time is right to jump back into these debates: the “ statistics wars” present new twists and turns. Rival tribes are more likely to live closer and in mixed neighborhoods since around the turn of the century. Yet, to the beginning student, it can appear as a jungle.

  Statistics Debates: Bayesian versus Frequentist

  These days there is less distance between Bayesians and frequentists, especially with the rise of objective [default] Bayesianism, and we may even be heading toward a coalition government.

  (Efron 2013 , p. 145)

  A central way to formally capture probabilism is by means of the formula for conditional probability, where Pr( x )> 0:

  Since Pr(H and x ) = Pr( x |H )Pr(H ) and Pr( x ) = Pr( x |H )Pr(H ) + Pr( x |~H )Pr(~H ) , we get:

  where ~H is the denial of H . It would be cashed out in terms of all rivals to H within a frame of reference. Some call it Bayes’ Rule or inverse probability. Leaving probability uninterpreted for now, if the data are very improbable given H , then our probability in H after seeing x , the posterior probability , may be lower than the probability in H prior to x , the prior probability Pr(H ). Bayes’ Theorem is just a theorem stemming from the definition of conditional probability; it is only when statistical inference is thought to be encompassed by it that it becomes a statistical philosophy. Using Bayes’ Theorem doesn’ t make you a Bayesian.

  Larry Wasserman, a statistician and master of brevity, boils it down to a contrast of goals. According to him (2012b):

  The Goal of Frequentist Inference : Construct procedure with frequentist guarantees [i.e., low error rates].

  The Goal of Bayesian Inference : Quantify and manipulate your degrees of beliefs. In other words, Bayesian inference is the Analysis of Beliefs.

  At times he suggests we use B(H ) for belief and F(H ) for frequencies. The distinctions in goals are too crude, but they give a feel for what is often regarded as the Bayesian-frequentist controversy. However, they present us with the false dilemma (performance or probabilism) I’ ve said we need to get beyond.

  Today’ s Bayesian– frequentist debates clearly differ from those of some years ago. In fact, many of the same discussants, who only a decade ago were arguing for the irreconcilability of frequentist P -values and Bayesian measures, are now smoking the peace pipe, calling for ways to unify and marry the two. I want to show you what really drew me back into the Bayesian– frequentist debates sometime around 2000. If you lean over the edge of the gondola, you can hear some Bayesian family feuds starting around then or a bit after. Principles that had long been part of the Bayesian hard core are being questioned or even abandoned by members of the Bayesian family. Suddenly sparks are flying, mostly kept shrouded within Bayesian walls, but nothing can long be kept secret even there. Spontaneous combustion looms. Hard core subjectivists are accusing the increasingly popular “ objective (non-subjective)” and “ reference” Bayesians of practicing in bad faith; the new frequentist– Bayesian unificationists are taking pains to show they are not subjective; and some are calling the new Bayesian kids on the block “ pseudo Bayesian.” Then there are the Bayesians camping somewhere in the middle (or perhaps out in left field) who, though they still use the Bayesian umbrella, are flatly denying the very idea that Bayesian updating fits anything they actually do in statistics. Obeisance to Bayesian reasoning remains, but on some kind of a priori philosophical grounds. Let’ s start with the unifications.

  While subjective Bayesianism offers an algorithm for coherently updating prior degrees of belief in possible hypotheses H 1 , H 2 , … , Hn , these unifications fall under the umbrella of non-subjective Bayesian paradigms. Here the prior probabilities in hypotheses are not taken to express degrees of belief but are given by various formal assignments, ideally to have minimal impact on the posterior probability. I will call such Bayesian priors default . Advocates of unifications are keen to show that (i) default Bayesian methods have good performance in a long series of repetitions – so probabilism may yield performance; or alternatively, (ii) frequentist quantities are similar to Bayesian ones (at least in certain cases) – so performance may yield probabilist numbers. Why is this not bliss? Why are so many from all sides dissatisfied?

  True blue subjective Bayesians are understandably unhappy with non-subjective priors. Rather than quantify prior beliefs, non-subjective priors are viewed as primitives or conventions for obtaining posterior probabilities. Take Jay Kadane (2008 ):

  The growth in use and popularity of Bayesian methods has stunned many of us who were involved in exploring their implications decades ago. The result … is that there are users of these methods who do not understand the philosophical basis of the methods they are using , and hence may misinterpret or badly use the results … No doubt helping people to use Bayesian methods more appropriately is an important task of our time.

  (p. 457, emphasis added)

  I have some sympathy here: Many modern Bayesians aren’ t aware of the traditional philosophy behind the methods they’ re buying into. Yet there is not just one philosophical basis for a given set of methods. This takes us to one of the most dramatic shifts in contemporary statistical foundations. It had long been assumed that only subjective or personalistic Bayesianism had a shot at providing genuine philosophical foundations, but you’ ll notice that groups holding this position, while they still dot the landscape in 2018, have been gradually shrinking. Some Bayesians have come to question whether the widespread use of methods under the Bayesian umbrella, however useful, indicates support for subjective Bayesianism as a foundation.

  Marriages of Convenience?

  The current frequentist– Bayesian unifications are often marriages of convenience; statisticians rationalize them less on philosophical than on practical grounds. For one thing, some are concerned that methodological conflicts are bad for the profession. For another, frequentist tribes, contrary to expectation, have not disappeared. Ensuring that accounts can control their error probabilities remains a desideratum that scientists are unwilling to forgo. Frequentists have an incentive to marry as well. Lacking a suitable epistemic interpretation of error probabilities – significance levels, power, and confidence levels – frequentists are constantly put on the defensive. Jim Berger (2003 ) proposes a construal of significance tests on which the tribes of Fisher, Jeffreys, and Neyman could agree, yet none of the chiefs of those tribes concur (Mayo 2003b ). The success stories are based on agreements on numbers that are not obviously true to any of the three philosophies. Beneath the surface – while it’ s not often said in polite company – the most serious disputes live on. I plan to lay them bare.

  If it’ s assumed an evidential assessment of hypothesis H should take the form of a posterior probability of H – a form of probabilism – then P -values and confidence levels are applicable only through misinterpretation and mistranslation. Resigned to live with P -values, some are keen to show that construing them as posterior probabilities is not so bad (e.g., Greenland and Poole 2013 ). Others focus on long-run error control, but cede territory wherein probability captures the epistemological ground of statistical inference. Why assume significance levels and confidence levels lack an authentic epistemological function? I say they do: to secure and evaluate how well probed and how severely tested claims are.

  Eclecticism and Ecumenism

  If you look carefully between dense forest trees, you can distinguish unification country from lands of eclecticism (Cox 1978 ) and ecumenism (Box 1983 ), where tools first constructed by rival tribes are separate, and more or less equal (for different aims). Current-day eclecticisms have a long history – the dabbling in tools from competing statistical tribes has not been thought to pose serious challenges. For example, frequentist methods have long been employed to check or calibrate Bayesian methods (e.g., Box 1983 ); you might test you
r statistical model using a simple significance test, say, and then proceed to Bayesian updating. Others suggest scrutinizing a posterior probability or a likelihood ratio from an error probability standpoint. What this boils down to will depend on the notion of probability used. If a procedure frequently gives high probability for claim C even if C is false, severe testers deny convincing evidence has been provided, and never mind about the meaning of probability.

  One argument is that throwing different methods at a problem is all to the good, that it increases the chances that at least one will get it right. This may be so, provided one understands how to interpret competing answers. Using multiple methods is valuable when a shortcoming of one is rescued by a strength in another. For example, when randomized studies are used to expose the failure to replicate observational studies, there is a presumption that the former is capable of discerning problems with the latter. But what happens if one procedure fosters a goal that is not recognized or is even opposed by another? Members of rival tribes are free to sneak ammunition from a rival’ s arsenal – but what if at the same time they denounce the rival method as useless or ineffective?

 

‹ Prev