Statistical Inference as Severe Testing
How to Get Beyond the Statistics Wars
Mounting failures of replication in the social and biological sciences give a new urgency to critically appraising proposed reforms. This book pulls back the cover on disagreements between experts charged with restoring integrity to science. It denies two pervasive views of the role of probability in inference: to assign degrees of belief, and to control error rates in a long run. If statistical consumers are unaware of assumptions behind rival evidence reforms, they can’ t scrutinize the consequences that affect them (in personalized medicine, psychology, and so on). The book sets sail with a simple tool: If little has been done to rule out flaws in inferring a claim, then it has not passed a severe test. Many methods advocated by data experts do not stand up to severe scrutiny, and are even in tension with successful strategies for blocking or accounting for cherry picking and selective reporting. Through a series of excursions, tours, and exhibits, the philosophy and history of inductive inference come alive, while philosophical tools are put to work to solve problems about science and pseudoscience, induction and falsification.
Deborah G. Mayo is Professor Emerita in the Department of Philosophy at Virginia Tech and is a visiting professor at the London School of Economics and Political Science, Centre for the Philosophy of Natural and Social Science. She is the author of Error and the Growth of Experimental Knowledge (1996), which won the 1998 Lakatos Prize awarded to the most outstanding contribution to the philosophy of science during the previous six years. She co-edited Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science (2010, Cambridge University Press) with Aris Spanos, and has published widely in the philosophy of science, statistics, and experimental inference.
“ In this lively, witty, and intellectually engaging book, Deborah Mayo returns to first principles to make sense of statistics. She takes us beyond statistical formalism and recipes, and asks us to think philosophically about the enterprise of statistical inference itself. Her contribution will be a welcomed addition to statistical learning. Mayo’ s timely book will shrink enlarged posteriors and overinflated significance, by focusing on whether our inferences have been severely tested, which is where we should be focused.”
– Nathan A. Schachtman, Lecturer in Law, Columbia Law School
“ Whether or not you agree with her basic stance on statistical inference, if you are interested in the subject – and all scientists ought to be – Deborah Mayo’ s writings are a must. Her views on inference are all the more valuable for being contrary to much current consensus. Her latest book will delight many and infuriate others but force all who are serious about these issues to think. Her capacity to jolt the complacent is second to none.”
– Stephen Senn, author of Dicing with Death
“ Deborah Mayo’ s insights into the philosophical dimensions of these problems are unsurpassed in their originality, their importance, and the breadth of understanding on which they are based. Here she combines perspectives from philosophy of science and the foundations of statistics to eliminate mirages produced by misunderstandings both philosophical and statistical, while putting into focus the ways in which her error-statistical approach is relevant to current problems of scientific inquiry in various disciplines.”
– Kent Staley, Saint Louis University
“ This book by Deborah Mayo is a timely examination of the use of statistics in science. Her severity requirement demands that the scientist provide a sharp question and related data. Absent that, the observer should withhold judgment or outright reject. It is time to get tough. Funding agencies should take note.”
– S. Stanley Young, Ph.D., FASA FAAAS
Statistical Inference as Severe Testing
How to Get Beyond the Statistics Wars
Deborah G. Mayo
Virginia Tech
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314– 321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
79 Anson Road, #06– 04/06, Singapore 079906
Cambridge University Press is part of the University of Cambridge.
It furthers the University’ s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9781107054134
DOI: 10.1017/9781107286184
© Deborah G. Mayo 2018
This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press.
First published 2018
Printed in the United States of America by Sheridan Books, Inc.
A catalog record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
Names: Mayo, Deborah G., author.
Title: Statistical inference as severe testing : how to get beyond the statistics wars / Deborah G. Mayo (Virginia Tech).
Description: Cambridge : Cambridge University Press, 2018. | Includes bibliographical references and index.
Identifiers: LCCN 2018014718 | ISBN 9781107054134 (alk. paper)
Subjects: LCSH: Mathematical statistics. | Inference. | Error analysis (Mathematics) | Fallacies (Logic) | Deviation (Mathematics)
Classification: LCC QA276 .M3755 2018 | DDC 519.5/4– dc23
LC record available at https://lccn.loc.gov/2018014718
ISBN 978-1-107-05413-4 Hardback
ISBN 978-1-107-66464-7 Paperback
Additional resources for this publication at www.cambridge.org/mayo
Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.
To George W. Chatfield
for his magnificent support
Itinerary
Preface
Acknowledgments
Excursion 1 How to Tell What’ s True about Statistical Inference I Beyond Probabilism and Performance 1.1 Severity Requirement: Bad Evidence, No Test (BENT)
1.2 Probabilism, Performance, and Probativeness
1.3 The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon
II Error Probing Tools versus Logics of Evidence 1.4 The Law of Likelihood and Error Statistics
1.5 Trying and Trying Again: The Likelihood Principle
Excursion 2 Taboos of Induction and Falsification I Induction and Confirmation 2.1 The Traditional Problem of Induction
2.2 Is Probability a Good Measure of Confirmation?
II Falsification, Pseudoscience, Induction 2.3 Popper, Severity, and Methodological Probability
2.4 Novelty and Severity
2.5 Fallacies of Rejection and an Animal Called NHST
2.6 The Reproducibility Revolution (Crisis) in Psychology
2.7 How to Solve the Problem of Induction Now
Excursion 3 Statistical Tests and Scientific Inference I Ingenious and Severe Tests 3.1 Statistical Inference and Sexy Science: The 1919 Eclipse Test
3.2 N-P Tests: An Episode in Anglo-Polish Collaboration
3.3 How to Do All N-P Tests Do (and More) W
hile a Member of the Fisherian Tribe
II It’ s the Methods, Stupid 3.4 Some Howlers and Chestnuts of Statistical Tests
3.5 P -values Aren’ t Error Probabilities Because Fisher Rejected Neyman’ s Performance Philosophy
3.6 Hocus-Pocus: P -values Are Not Error Probabilities, Are Not Even Frequentist!
III Capability and Severity: Deeper Concepts 3.7 Severity, Capability, and Confidence Intervals (CIs)
3.8 The Probability Our Results Are Statistical Fluctuations: Higgs’ Discovery
Excursion 4 Objectivity and Auditing I The Myth of “ The Myth of Objectivity” 4.1 Dirty Hands: Statistical Inference Is Sullied with Discretionary Choices
4.2 Embrace Your Subjectivity
II Rejection Fallacies: Who’ s Exaggerating What? 4.3 Significant Results with Overly Sensitive Tests: Large n Problem
4.4 Do P -Values Exaggerate the Evidence?
4.5 Who’ s Exaggerating? How to Evaluate Reforms Based on Bayes Factor Standards
III Auditing: Biasing Selection Effects and Randomization 4.6 Error Control Is Necessary for Severity Control
4.7 Randomization
IV More Auditing: Objectivity and Model Checking 4.8 All Models Are False
4.9 For Model-Checking, They Come Back to Significance Tests
4.10 Bootstrap Resampling: My Sample Is a Mirror of the Universe
4.11 Misspecification (M-S) Testing in the Error Statistical Account
Excursion 5 Power and Severity I Power: Pre-data and Post-data 5.1 Power Howlers, Trade-offs, and Benchmarks
5.2 Cruise Severity Drill: How Tail Areas (Appear to) Exaggerate the Evidence
5.3 Insignificant Results: Power Analysis and Severity
5.4 Severity Interpretation of Tests: Severity Curves
II How Not to Corrupt Power 5.5 Power Taboos, Retrospective Power, and Shpower
5.6 Positive Predictive Value: Fine for Luggage
III Deconstructing the N-P versus Fisher Debates 5.7 Statistical Theatre: “ Les Miserables Citations”
5.8 Neyman’ s Performance and Fisher’ s Fiducial Probability
Excursion 6 (Probabilist) Foundations Lost, (Probative) Foundations Found I What Ever Happened to Bayesian Foundations? 6.1 Bayesian Ways: From Classical to Default
6.2 What are Bayesian Priors? A Gallimaufry
6.3 Unification or Schizophrenia: Bayesian Family Feuds
6.4 What Happened to Updating by Bayes’ Rule?
II Pragmatic and Error Statistical Bayesians 6.5 Pragmatic Bayesians
6.6 Error Statistical Bayesians: Falsificationist Bayesians
6.7 Farewell Keepsake
Souvenirs
References
Index
Preface
The Statistics Wars
Today’ s “ statistics wars” are fascinating: They are at once ancient and up to the minute. They reflect disagreements on one of the deepest, oldest, philosophical questions: How do humans learn about the world despite threats of error due to incomplete and variable data? At the same time, they are the engine behind current controversies surrounding high-profile failures of replication in the social and biological sciences. How should the integrity of science be restored? Experts do not agree. This book pulls back the curtain on why.
Methods of statistical inference become relevant primarily when effects are neither totally swamped by noise, nor so clear cut that formal assessment of errors is relatively unimportant. Should probability enter to capture degrees of belief about claims? To measure variability? Or to ensure we won’ t reach mistaken interpretations of data too often in the long run of experience? Modern statistical methods grew out of attempts to systematize doing all of these. The field has been marked by disagreements between competing tribes of frequentists and Bayesians that have been so contentious – likened in some quarters to religious and political debates – that everyone wants to believe we are long past them. We now enjoy unifications and reconciliations between rival schools, it will be said, and practitioners are eclectic, prepared to use whatever method “ works.” The truth is, long-standing battles still simmer below the surface in questions about scientific trustworthiness and the relationships between Big Data-driven models and theory. The reconciliations and unifications have been revealed to have serious problems, and there’ s little agreement on which to use or how to interpret them. As for eclecticism, it’ s often not clear what is even meant by “ works.” The presumption that all we need is an agreement on numbers – never mind if they’ re measuring different things – leads to pandemonium. Let’ s brush the dust off the pivotal debates, walk into the museums where we can see and hear such founders as Fisher, Neyman, Pearson, Savage, and many others. This is to simultaneously zero in on the arguments between metaresearchers – those doing research on research – charged with statistical reforms.
Statistical Inference as Severe Testing
Why are some arguing in today’ s world of high-powered computer searches that statistical findings are mostly false? The problem is that high-powered methods can make it easy to uncover impressive-looking findings even if they are false: spurious correlations and other errors have not been severely probed. We set sail with a simple tool: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test . In the severe testing view, probability arises in scientific contexts to assess and control how capable methods are at uncovering and avoiding erroneous interpretations of data. That’ s what it means to view statistical inference as severe testing . A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. You may be surprised to learn that many methods advocated by experts do not stand up to severe scrutiny, are even in tension with successful strategies for blocking or accounting for cherry picking and selective reporting!
The severe testing perspective substantiates, using modern statistics, the idea Karl Popper promoted, but never cashed out. The goal of highly well-tested claims differs sufficiently from highly probable ones that you can have your cake and eat it too: retaining both for different contexts. Claims may be “ probable” (in whatever sense you choose) but terribly tested by these data. In saying we may view statistical inference as severe testing, I’ m not saying statistical inference is always about formal statistical testing. The testing metaphor grows out of the idea that before we have evidence for a claim, it must have passed an analysis that could have found it flawed. The probability that a method commits an erroneous interpretation of data is an error probability . Statistical methods based on error probabilities I call error statistics . The value of error probabilities, I argue, is not merely to control error in the long run, but because of what they teach us about the source of the data in front of us. The concept of severe testing is sufficiently general to apply to any of the methods now in use, whether for exploration, estimation, or prediction.
Getting Beyond the Statistics Wars
Thomas Kuhn’ s remark that only in the face of crisis “ do scientists behave like philosophers” (1970 ), holds some truth in the current statistical crisis in science. Leaders of today’ s programs to restore scientific integrity have their own preconceptions about the nature of evidence and inference, and about “ what we really want” in learning from data. Philosophy of science can also alleviate such conceptual discomforts. Fortunately, you needn’ t accept the severe testing view in order to employ it as a tool for bringing into focus the crux of all these issues. It’ s a tool for excavation, and for keeping us afloat in the marshes and quicksand that often mark today’ s controversies. Nevertheless, important consequences will follow once this tool is used. First there will be a reformulation of existing tools (tests, confidence intervals, and others) so as to avoid misinterpretations and abuses. The debates on statistical inference generally concern inference after a statistical model and data statements are in place, when in fact the most interesting work involves the local inferences need
ed to get to that point. A primary asset of error statistical methods is their contributions to designing, collecting, modeling, and learning from data. The severe testing view provides the much-needed link between a test’ s error probabilities and what’ s required for a warranted inference in the case at hand. Second, instead of rehearsing the same criticisms over and over again, challengers on all sides should now begin by grappling with the arguments we trace within. Kneejerk assumptions about the superiority of one or another method will not do. Although we’ ll be excavating the actual history, it’ s the methods themselves that matter; they’ re too important to be limited by what someone 50, 60, or 90 years ago thought, or to what today’ s discussants think they thought.
Who is the Reader of This Book?
This book is intended for a wide-ranging audience of readers. It’ s directed to consumers and practitioners of statistics and data science, and anyone interested in the methodology, philosophy, or history of statistical inference, or the controversies surrounding widely used statistical methods across the physical, social, and biological sciences. You might be a researcher or science writer befuddled by the competing recommendations offered by large groups (“ megateams” ) of researchers (should P -values be set at 0.05 or 0.005, or not set at all?). By viewing a contentious battle in terms of a difference in goals – finding highly probable versus highly well-probed hypotheses – readers can see why leaders of rival tribes often talk right past each other. A fair-minded assessment may finally be possible. You may have a skeptical bent, keen to hold the experts accountable. Without awareness of the assumptions behind proposed reforms you can’ t scrutinize consequences that will affect you, be it in medical advice, economics, or psychology.
Statistical Inference as Severe Testing Page 1