Although the problem of statistical inference is only a small part of what today goes under the umbrella of formal epistemology, progress in the statistics wars would advance more surely if philosophers regularly adopted the language of statistics. Not only would we be better at the job of clarifying the conceptual discomforts among practitioners of statistics and modeling, some of the classic problems of confirmation could be scotched using the language of random variables and their distributions. 8 Philosophy of statistics had long been ahead of its time, in the sense of involving genuinely interdisciplinary work with statisticians, scientists, and philosophers of science. We need to return to that. There are many exceptions, of course; yet to try to list them would surely make me guilty of leaving several out.
1 i.e.,
2 Let HJ be (H & J ). To show: If there is a case where x confirms HJ more than x confirms J , then degree of probability cannot equal degree of confirmation.
(i) C (HJ , x ) > C (J , x ) is given.
(ii) J = ~HJ or HJ by logical equivalence.
(iii) C (HJ , x ) > C (~HJ or HJ , x ) by substituting (ii) in line (i).
Since ~HJ and HJ are mutually exclusive, we have from the special addition rule for probability:
(iv) Pr(HJ , x ) ≤ Pr(~HJ or HJ , x ).
So if Pr = C , (iii) and (iv) yield a contradiction. (Adapting Popper 1959 , p. 391)
3 Why is a B-boost not necessary for Achinstein? Suppose you know x : the newspaper says Harry won, and it’ s never wrong. Then a radio, also assumed 100% reliable, announces y : Harry won. Statement y , Achinstein thinks, should still count as evidence for H : he won. I agree.
4 To expand the reasoning, first observe that Pr(H | x )/Pr(H ) = Pr( x |H )/Pr( x ) and Pr(H &J | x )/Pr(H &J ) = Pr( x |H &J )/Pr( x ), both by Bayes’ Theorem. So, when Pr(H | x )/Pr(H )>1, we also have Pr( x |H )/Pr( x )>1. This, together with Pr( x |H &J )=Pr( x |H ) (given), yields Pr( x |H &J)/Pr( x )>1. Thus, we also have Pr(H &J | x )/Pr(H &J )>1.
5 Recall that Royall restricts the likelihood ratio to non-composite hypotheses, whereas here ~H is the Bayesian catchall.
6 “ I must insist that C (h , e ) can be interpreted as degree of corroboration only if e is a report on the severest tests we have been able to design . It is this point that marks the difference between the attitude of the inductivist, or verificationist, and my own attitude. The inductivist or verificationist wants affirmation for his hypothesis. He hopes to make it firmer by his evidence e and he looks out for ‘ firmness ’ – for ‘ confirmation. ’ … Yet if e is not a report about the results of our sincere attempts to overthrow h , then we shall simply deceive ourselves if we think we can interpret C (h,e ) as degree of corroboration, or anything like it.” (Popper 1959 , p. 418).
7 The real problem is that Pr(x; H & J ) = Pr(x; H & ~J ).
8 For a discussion and justification of the use of “ random variables,” see Mayo (1996).
Tour II
Falsification, Pseudoscience, Induction
We’ ll move from the philosophical ground floor to connecting themes from other levels, from Popperian falsification to significance tests, and from Popper’ s demarcation to current-day problems of pseudoscience and irreplication. An excerpt from our Museum Guide gives a broad-brush sketch of the first few sections of Tour II:
Karl Popper had a brilliant way to “ solve” the problem of induction: Hume was right that enumerative induction is unjustified, but science is a matter of deductive falsification. Science was to be demarcated from pseudoscience according to whether its theories were testable and falsifiable. A hypothesis is deemed severely tested if it survives a stringent attempt to falsify it. Popper’ s critics denied he could sustain this and still be a deductivist …
Popperian falsification is often seen as akin to Fisher’ s view that “ every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis” (1935a , p. 16). Though scientists often appeal to Popper, some critics of significance tests argue that they are used in decidedly non-Popperian ways. Tour II explores this controversy.
While Popper didn’ t make good on his most winning slogans, he gives us many seminal launching-off points for improved accounts of falsification, corroboration, science versus pseudoscience, and the role of novel evidence and predesignation. These will let you revisit some thorny issues in today’ s statistical crisis in science.
2.3 Popper, Severity, and Methodological Probability
Here’ s Popper’ s summary (drawing from Popper, Conjectures and Refutations , 1962 , p. 53):
[Enumerative] induction … is a myth. It is neither a psychological fact … nor one of scientific procedure.
The actual procedure of science is to operate with conjectures…
Repeated observation and experiments function in science as tests of our conjectures or hypotheses, i.e., as attempted refutations.
[It is wrongly believed that using the inductive method can] serve as a criterion of demarcation between science and pseudoscience . … None of this is altered in the least if we say that induction makes theories only probable.
There are four key, interrelated themes:
(1) Science and Pseudoscience.
Redefining scientific method gave Popper a new basis for demarcating genuine science from questionable science or pseudoscience. Flexible theories that are easy to confirm – theories of Marx, Freud, and Adler were his exemplars – where you open your eyes and find confirmations everywhere, are low on the scientific totem pole (ibid., p. 35). For a theory to be scientific it must be testable and falsifiable.
(2) Conjecture and Refutation.
The problem of induction is a problem only if it depends on an unjustifiable procedure such as enumerative induction. Popper shocked everyone by denying scientists were in the habit of inductively enumerating. It doesn’ t even hold up on logical grounds. To talk of “ another instance of an A that is a B” assumes a conceptual classification scheme. How else do we recognize it as another item under the umbrellas A and B? (ibid., p. 44). You can’ t just observe, you need an interest, a point of view, a problem.
The actual procedure for learning in science is to operate with conjectures in which we then try to find weak spots and flaws. Deductive logic is needed to draw out the remote logical consequences that we actually have a shot at testing (ibid., p. 51). From the scientist down to the amoeba, says Popper, we learn by trial and error: conjecture and refutation (ibid., p. 52). The crucial difference is the extent to which we constructively learn how to reorient ourselves after clashes.
Without waiting, passively, for repetitions to impress or impose regularities upon us, we actively try to impose regularities upon the world… These may have to be discarded later, should observation show that they are wrong.
(ibid., p. 46)
(3) Observations Are Not Given.
Popper rejected the time-honored empiricist assumption that observations are known relatively unproblematically. If they are at the “ foundation,” it is only because there are apt methods for testing their validity. We dub claims observable because or to the extent that they are open to stringent checks. (Popper: “ anyone who has learned the relevant technique can test it” (1959 , p. 99).) Accounts of hypothesis appraisal that start with “ evidence x ,” as in confirmation logics, vastly oversimplify the role of data in learning.
(4) Corroboration Not Confirmation, Severity Not Probabilism.
Last, there is his radical view on the role of probability in scientific inference. Rejecting probabilism, Popper not only rejects Carnap-style logics of confirmation, he denies scientists are interested in highly probable hypotheses (in any sense). They seek bold, informative, interesting conjectures and ingenious and severe attempts to refute them. If one uses a logical notion of probability, as philosophers (including Popper) did at the time, the high content theories are highly improbable; in fact, Popper said universal theories have 0 probability. (Popper also talked of statistical probabilities as propensities.)r />
These themes are in the spirit of the error statistician. Considerable spadework is required to see what to keep and what to revise, so bring along your archeological shovels.
Demarcation and Investigating Bad Science
There is a reason that statisticians and scientists often refer back to Popper; his basic philosophy – at least his most winning slogans – are in sync with ordinary intuitions about good scientific practice. Even people divorced from Popper’ s full philosophy wind up going back to him when they need to demarcate science from pseudoscience. Popper’ s right that if using enumerative induction makes you scientific then anyone from an astrologer to one who blithely moves from observed associations to full blown theories is scientific. Yet the criterion of testability and falsifiability – as typically understood – is nearly as bad. It is both too strong and too weak. Any crazy theory found false would be scientific, and our most impressive theories aren’ t deductively falsifiable. Larry Laudan’ s famous (1983 ) “ The Demise of the Demarcation Problem” declared the problem taboo. This is a highly unsatisfactory situation for philosophers of science. Now Laudan and I generally see eye to eye, perhaps our disagreement here is just semantics. I share his view that what really matters is determining if a hypothesis is warranted or not, rather than whether the theory is “ scientific,” but surely Popper didn’ t mean logical falsifiability sufficed. Popper is clear that many unscientific theories (e.g., Marxism, astrology) are falsifiable. It’ s clinging to falsified theories that leads to unscientific practices. (Note: The use of a strictly falsified theory for prediction, or because nothing better is available, isn’ t unscientific.) I say that, with a bit of fine-tuning, we can retain the essence of Popper to capture what makes an inquiry, if not an entire domain, scientific.
Following Laudan, philosophers tend to shy away from saying anything general about science versus pseudoscience – the predominant view is that there is no such thing. Some say that there’ s at most a kind of “ family resemblance” amongst domains people tend to consider scientific (Dupré 1993 , Pigliucci 2010 , 2013). One gets the impression that the demarcation task is being left to committees investigating allegations of poor science or fraud. They are forced to articulate what to count as fraud, as bad statistics, or as mere questionable research practices (QRPs). People’ s careers depend on their ruling: they have “ skin in the game,” as Nassim Nicholas Taleb might say (2018 ). The best one I know – the committee investigating fraudster Diederik Stapel – advises making philosophy of science a requirement for researchers (Levelt Committee, Noort Committee, and Drenth Committee 2012 ). So let’ s not tell them philosophers haven given up on it.
Diederik Stapel.
A prominent social psychologist “ was found to have committed a serious infringement of scientific integrity by using fictitious data in his publications” (Levelt Committee 2012 , p. 7). He was required to retract 58 papers, relinquish his university degree and much else. The authors of the report describe walking into a culture of confirmation and verification bias. They could scarcely believe their ears when people they interviewed “ defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (ibid., p. 48). Free of the qualms that give philosophers of science cold feet, they advance some obvious yet crucially important rules with Popperian echoes:
One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing to repeat an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tend to confirm the researcher’ s research hypotheses, and essentially render the hypotheses immune to the facts.
(ibid., p. 48)
Exactly! This is our minimal requirement for evidence: If it’ s so easy to find agreement with a pet theory or claim, such agreement is bad evidence, no test, BENT. To scrutinize the scientific credentials of an inquiry is to determine if there was a serious attempt to detect and report errors and biasing selection effects. We’ ll meet Stapel again when we reach the temporary installation on the upper level: The Replication Revolution in Psychology.
The issue of demarcation (point (1)) is closely related to Popper’ s conjecture and refutation (point (2)). While he regards a degree of dogmatism to be necessary before giving theories up too readily, the trial and error methodology “ gives us a chance to survive the elimination of an inadequate hypothesis – when a more dogmatic attitude would eliminate it by eliminating us” (Popper 1962 , p. 52). Despite giving lip service to testing and falsification, many popular accounts of statistical inference do not embody falsification – even of a statistical sort.
Nearly everyone, however, now accepts point (3), that observations are not just “ given” – knocking out a crucial pillar on which naïve empiricism stood. To the question: What came first, hypothesis or observation? Popper answers, another hypothesis, only lower down or more local. Do we get an infinite regress? No, because we may go back to increasingly primitive theories and even, Popper thinks, to an inborn propensity to search for and find regularities (ibid., p. 47). I’ ve read about studies appearing to show that babies are aware of what is statistically unusual. In one, babies were shown a box with a large majority of red versus white balls (Xu and Garcia 2008 , Gopnik 2009 ). When a succession of white balls are drawn, one after another, with the contents of the box covered with a screen, the babies looked longer than when the more common red balls were drawn. I don’ t find this far-fetched. Anyone familiar with preschool computer games knows how far toddlers can get in solving problems without a single word, just by trial and error.
Greater Content, Greater Severity.
The position people are most likely to take a pass on is (4), his view of the role of probability. Yet Popper’ s central intuition is correct: if we wanted highly probable claims, scientists would stick to low-level observables and not seek generalizations, much less theories with high explanatory content. In this day of fascination with Big Data’ s ability to predict what book I’ ll buy next, a healthy Popperian reminder is due: humans also want to understand and to explain. We want bold “ improbable” theories. I’ m a little puzzled when I hear leading machine learners praise Popper, a realist, while proclaiming themselves fervid instrumentalists. That is, they hold the view that theories, rather than aiming at truth, are just instruments for organizing and predicting observable facts. It follows from the success of machine learning, Vladimir Cherkassky avers, that “ realism is not possible.” This is very quick philosophy! “… [I]n machine learning we are given a set of [random] data samples, and the goal is to select the best model (function, hypothesis) from a given set of possible models” (Cherkassky 2012 ). Fine, but is the background knowledge required for this setup itself reducible to a prediction– classification problem? I say no, as would Popper. Even if Cherkassky’ s problem is relatively theory free, it wouldn’ t follow this is true for all of science. Some of the most impressive “ deep learning” results in AI have been criticized for lacking the ability to generalize beyond observed “ training” samples, or to solve open-ended problems (Gary Marcus 2018).
A valuable idea to take from Popper is that probability in learning attaches to a method of conjecture and refutation, that is to testing: it is methodological probability . An error probability is a special case of a methodological probability. We want methods with a high probability of teaching us (and machines) how to distinguish approximately correct and incorrect interpretations of data, even leaving murky cases in the middle, and how to advance knowledge of detectable, while strictly unobservable, effects.
The choices for probability that we are commonly offered are stark: “
in here” (beliefs ascertained by introspection) or “ out there” (frequencies in long runs, or chance mechanisms). This is the “ epistemology” versus “ variability” shoehorn we reject (Souvenir D). To qualify the method by which H was tested, frequentist performance is necessary, but it’ s not sufficient. The assessment must be relevant to ensuring that claims have been put to severe tests. You can talk of a test having a type of propensity or capability to have discerned flaws, as Popper did at times. A highly explanatory, high-content theory, with interconnected tentacles, has a higher probability of having flaws discerned than low-content theories that do not rule out as much. Thus, when the bolder, higher content, theory stands up to testing, it may earn higher overall severity than the one with measly content. That a theory is plausible is of little interest, in and of itself; what matters is that it is im plausible for it to have passed these tests were it false or incapable of adequately solving its set of problems. It is the fuller, unifying, theory developed in the course of solving interconnected problems that enables severe tests.
Methodological probability is not to quantify my beliefs, but neither is it about a world I came across without considerable effort to beat nature into line. Let alone is it about a world-in-itself which, by definition, can’ t be accessed by us. Deliberate effort and ingenuity are what allow me to ensure I shall come up against a brick wall, and be forced to reorient myself, at least with reasonable probability, when I test a flawed conjecture. The capabilities of my tools to uncover mistaken claims (its error probabilities) are real properties of the tools. Still, they are my tools, specially and imaginatively designed. If people say they’ ve made so many judgment calls in building the inferential apparatus that what’ s learned cannot be objective, I suggest they go back and work some more at their experimental design, or develop better theories.
Statistical Inference as Severe Testing Page 12