Statistical Inference as Severe Testing
Page 30
[t]he key distinction between Bayesian and sampling theory statistics is the issue of what is to be regarded as random and what is to be regarded as fixed. To a Bayesian, parameters are random and data, once observed, are fixed…
(Kadane 2011 , p. 437)
Kadane’ s point is that “ [t]o a sampling theorist, data are random even after being observed, but parameters are fixed” (ibid.). When an error statistician speaks of the probability that the results standing before us are a mere statistical fluctuation, she is referring to a methodological probability: the probability the method used would produce data displays (e.g., bumps) as impressive as these, under the assumption of H 0 . If you’ re a Bayesian probabilist D-1 through D-3 appear to be assigning a probability to a hypothesis (about the parameter) because, since the data are known, only the parameter remains unknown. But they’ re to be scrutinizing a non-Bayesian procedure here. Whichever approach you favor, my point is that they’ re talking past each other. To get beyond this particular battle, this has to be recognized.
The Real Problem with D-1 through D-3. The error probabilities in U-1 through U-3 are straightforward. In the Higgs experiment, the needed computations are based on simulating relative frequencies of events where H 0 : μ = 0 (given a detector model). In terms of the corresponding P -value:
(1) Pr(test T would produce a P -value ≤ 0.0000003; H 0 ) ≤ 0.0000003.
D-1, 2, 3 are just slightly imprecise ways of expressing U-1, 2, 3. So what’ s the objection to D-1, 2, 3? It’ s the danger some find in moving from such claims to their complements. If I say there’ s a 0.0000003 probability their results are due to chance, some infer there’ s a 0.999999 (or whatever) probability their results are not due to chance – are not a false positive, are not a fluctuation. And those claims are wrong. If Pr(A; H 0 ) = p , for some assertion A, the probability of the complement is Pr(not-A; H 0 ) = 1 − p . In particular:
(1) Pr(test T would not display a P -value ≤ 0.0000003; H 0 ) > 0.9999993.
There’ s no transposing! That is, the hypothesis after the “ ;” does not switch places with the event to the left of “ ;” ! But despite how the error statistician hears D-1 through D-3, I’ m prepared to grant the corresponding U claims are safer. I assure you that my destination is not merely refining statistical language, but when critics convince practitioners that they’ ve been speaking Bayesian prose without knowing it (as in Molière), the consequences are non-trivial. I’ m about to get to them.
Detaching Inferences Uses an Implicit Severity Principle
Phrases such as “ the probability our results are a statistical fluctuation (or fluke) is very low” are common enough in HEP – although physicists tell me it’ s the science writers who reword their correct U-claims as slippery D-claims. Maybe so. But if you follow the physicist’ s claims through the process of experimenting and modeling, you find they are alluding to proper error probabilities. You may think they really mean an illicit posterior probability assignment to “ real effect” or H 1 if you think that statistical inference takes the form of probabilism. In fact, if you’ re a Bayesian probabilist, and assume the statistical inference must have a posterior probability, or a ratio of posterior probabilities, you will regard U-1 through U-3 as legitimate but irrelevant to inference; and D-1 through D-3 as relevant only by misinterpreting P -values as giving a probability to the null hypothesis H 0 .
If you are an error statistician (whether you favor a behavioral performance or a severe probing interpretation), even the correct claims U-1 through U-3 are not statistical inferences! They are the (statistical) justifications associated with implicit statistical inferences, and even though HEP practitioners are well aware of them, they should be made explicit. Such inferences can take many forms, such as those I place in brackets:
U-1. The probability of the background alone fluctuating up by this amount or more is about one in 3 million.
[Thus, our results are not due to background fluctuations.]
U-2. Only one experiment in 3 million would see an apparent signal this strong in a universe [where H 0 is adequate].
[Thus H 0 is not adequate.]
U-3. The probability that their signal would result by a chance fluctuation was less than one in 3.5 million.
[Thus the signal was not due to chance.]
The formal statistics moves from
(1) Pr(test T produces d( X ) ≥ 5; H 0 ) < 0.0000003
to
(2) there is strong evidence for
(first) (2a) a genuine (non-fluke) discrepancy from H 0 ;
(later) (2b) H* : a Higgs (or a Higgs-like) particle.
They move in stages from indications, to evidence, to discovery. Admittedly, moving from (1) to inferring (2) relies on the implicit assumption of error statistical testing, the severity principle. I deliberately phrase it in many ways. Here’ s yet another, in a Popperian spirit:
Severity Principle (from low P -value) Data provide evidence for a genuine discrepancy from H 0 (just) to the extent that H 0 would (very probably) have survived, were H 0 a reasonably adequate description of the process generating the data.
What is the probability that H 0 would have “ survived” (and not been falsified) at the 5-sigma level? It is the probability of the complement of the event {d( X ) ≥ 5} , namely, {d( X ) < 5} under H 0 . Its probability is correspondingly 1 − 0.0000003. So the overall argument starting from a fortified premise goes like this:
(1)* With probability 0.9999997, the bumps would be smaller, would behave like statistical fluctuations: disappear with more data, wouldn’ t be produced at both CMS and ATLAS, in a world adequately modeled by H 0 .
They did not disappear, they grew (from 5 to 7 sigma). So,
(2a) infer there’ s evidence of H 1 : non-fluke, or (2b) infer H *: a Higgs (or a Higgs-like) particle.
There’ s always the error statistical qualification of the inference in (2), given by the relevant methodological probability. Here it is a report of the stringency or severity of the test that the claim has passed, as given in (1)*: 0.9999997. We might even dub it the severity coefficient. Without making the underlying principle of testing explicit, some critics assume the argument is all about the reported P -value. It’ s a mere stepping stone to an inductive inference that is detached.
Members of a strict (N-P) behavioristic tribe might reason as follows: If you follow the rule of behavior: Interpret 5-sigma bumps as a real effect (a discrepancy from 0), you’ d erroneously interpret data with probability less than 0.0000003 – a very low error probability. Doubtless, HEP physicists are keen to avoid repeating such mistakes as apparently finding particles that move faster than light, only to discover some problem with the electrical wiring (Reich 2012 ). I claim the specific evidential warrant for the 5-sigma Higgs inferences aren’ t low long-run errors, but being able to detach an inference based on a stringent test or a strong argument from coincidence. 3
Learning How Fluctuations Behave: The Game of Bump-Hunting
Dennis Overbye (2013 ) wrote an article in the New York Times : “ Chasing the Higgs,” based on his interviews with spokespeople Fabiola Gianotti (ATLAS) and Guido Tonelli (CMS). It’ s altogether common, Tonelli explains, that the bumps they find are “ random flukes” – spuriously significant results – “ So ‘ we crosscheck everything’ and ‘ try to kill’ any anomaly that might be merely random.”
One bump on physicists’ charts … was disappearing. But another was blooming like the shy girl at a dance. … nobody could remember exactly when she had come in. But she was the one who would marry the prince … It continued to grow over the fall until it had reached the 3-sigma level – the chances of being a fluke [spurious significance] were less than 1 in 740, enough for physicists to admit it to the realm of “ evidence” of something, but not yet a discovery.
(Overbye 2013 )
What’ s one difference between HEP physics and fields where most results are claimed to be false? HEP physicists don’ t publish on the basis o
f a single, isolated (nominal) P -value. That doesn’ t mean promising effects don’ t disappear. “‘ We’ ve made many discoveries,’ Dr. Tonelli said, ‘ most of them false’” (ibid.).
Look Elsewhere Effect (LEE).
The null hypothesis is formulated to correspond to regions where an excess or bump is found. Not knowing the mass region in advance means “ the local p -value did not include the fact that ‘ pure chance’ had lots of opportunities … to provide an unlikely occurrence” (Cousins 2017 , p. 424). So here a nominal (they call it local) P -value is assessed at a particular, data-determined, mass. But the probability of so impressive a difference anywhere in a mass range – the global P -value – would be greater than the local one. “ The original concept of ‘ 5σ ’ in HEP was therefore mainly motivated as a (fairly crude) way to account for a multiple trials factor … known as the ‘ Look Elsewhere Effect’” (ibid. p. 425). HEP physicists often report both local and global P -values.
Background information enters, not via prior probabilities of the particles’ existence, but as to how researchers might be led astray. “ If they were flukes, more data would make them fade into the statistical background … If not, the bumps would grow in slow motion into a bona fide discovery” (Overbye 2013 ). So, they give the bump a hard time, they stress test, look at multiple decay channels, and they hide the details of the area they found it from the other team. When two independent experiments find the same particle signal at the same mass, it helps to overcome the worry of multiple testing, strengthening an argument from coincidence.
Once the null is rejected, the job shifts to testing if various parameters agree with the SM predictions.
This null hypothesis of no Higgs (or Higgs-like) boson was definitively rejected upon the announcement of the observation of a new boson by both ATLAS and CMS on July 4, 2012. The confidence intervals for signal strength θ … were in reasonable agreement with the predictions for the SM Higgs boson. Subsequently, much of the focus shifted to measurements of … production and decay mechanisms. For measurements of continuous parameters, … the tests … use the frequentist duality … between interval estimation and hypothesis testing. One constructs (approximate) confidence intervals and regions for parameters … and checks whether the predicted values for the SM Higgs boson are within the confidence regions.
(Cousins 2017 , p. 414)
Now the corresponding null hypothesis, call it , is the SM Higgs boson
and discrepancies from it are probed and estimated with confidence intervals. The most important role for statistical significance tests is actually when results are insignificant, or the P -values are not small: negative results. They afford a standard for blocking inferences that would be made too readily. In this episode, they arose to
(a) block precipitously declaring evidence of a new particle;
(b) rule out values of various parameters, e.g., spin values that would preclude its being “ Higgs-like,” and various mass ranges of the particle.
While the popular press highlighted the great success for the SM, the HEP physicists, at both stages, were vigorously, desperately seeking to uncover BSM (Beyond the Standard Model) physics.
Once again, the background knowledge of fluke behavior was central to curbing their enthusiasm about bumps that hinted at discrepancies with the new null: . Even though July 2012 data gave evidence of the existence of a Higgs-like particle – where calling it “ Higgs-like” still kept the door open for an anomaly with the “ plain vanilla” particle of the SM – they also showed some hints of such an anomaly.
Matt Strassler, who, like many, is longing to find evidence for BSM physics, was forced to concede: “ The excess (in favor of BSM properties) has become a bit smaller each time … That’ s an unfortunate sign, if one is hoping the excess isn’ t just a statistical fluke” (2013a ). Or they’ d see the bump at ATLAS … and not CMS. “ Taking all of the LHC’ s data, and not cherry picking … there’ s nothing here that you can call ‘ evidence’” for the much sought BSM (Strassler 2013b ). They do not say the cherry-picked results ‘ give evidence, but disbelief in BSM physics lead us to discount it,’ as Royall’ s Likelihoodist may opt to. They say: “ There’ s nothing here that you can call evidence.”
Considering the frequent statistical fluctuations, and the hot competition between the ATLAS and CMS to be first, a tool for when to “ curb their enthusiasm” is exactly what was wanted. So, this negative role of significance tests is crucial for denying BSM anomalies are real, and setting upper bounds for these discrepancies with the SM Higgs. Since each test has its own test statistic, I’ ll use g( x ) rather than d( x ).
Severity Principle (for non-significance): Data provide evidence to rule out a discrepancy δ * to the extent that a larger g( x 0 ) would very probably have resulted if δ were as great as δ *.
This can equivalently be seen as inferring confidence bounds or applying FEV. The particular value of δ * isn’ t so important at this stage. What happens with negative results here is that the indicated discrepancies get smaller and smaller as do the bumps, and just vanish. These were not genuine effects, even though there’ s no falsification of BSM.
Negative results in HEP physics are scarcely the stuff of file drawers, a serious worry leading to publication bias in many fields. Cousins tells of the wealth of papers that begin “ Search for …” (2017 , p. 412). They are regarded as important and informative – if only in ruling out avenues for theory development. There’ s another idea for domains confronted with biases against publishing negative results.
Back to O’ Hagan and a 2015/2016 Update
O’ Hagan published a digest of responses a few days later. When it was clear his letter had not met with altogether enthusiastic responses, he backed off, admitting that he had only been being provocative with the earlier letter. Still, he declares, the Higgs researchers would have been better off avoiding the “ ad hoc” 5 sigma by doing a proper (subjective) Bayesian analysis. “ They would surely be willing to [announce SM Higgs discovery] if they were, for instance, 99.99 percent certain” [SM Higgs] existed. Wouldn’ t it be better to report
Pr(SM Higgs|data) = 0.9999?
Actually, no. Not if it’ s taken as a formal probability rather than a chosen way to abbreviate: the reality of the SM Higgs has passed a severe test. Physicists believed in a Higgs particle before building the big billion-dollar collider. Given the perfect predictive success of the SM, and its simplicity, such beliefs would meet the familiar standards for plausibility. But that’ s very different from having evidence for a discovery, or information about the characteristics of the particle. Many aver they didn’ t expect it to have so small a mass, 125 GeV. In fact, given the unhappy consequences some find with this low mass, some researchers may well have gone back and changed their prior probabilities to arrive at something more sensible (more “ natural” in the parlance of HEP). Yet, their strong argument from coincidence via significance tests prevented the effect from going away.
O’ Hagan/Lindley admit that a subjective Bayesian model for the Higgs would require prior probabilities to scads of high dimensional “ nuisance” parameters of the background and the signal; it would demand multivariate priors, correlations between parameters, joint priors, and the ever worrisome Bayesian catchall factor: Pr(data|not-H* ). Lindley’ s idea of subjectively eliciting beliefs from HEP physicists is rather unrealistic here.
Now for the update. When the collider restarted in 2015, it had far greater collider energies than before. On December 15, 2015 something exciting happened: “ ATLAS and CMS both reported a small ‘ bump’ in their data” at a much higher energy level than the Higgs: 750 GeV (compared to 125 GeV) (Cartlidge 2016 ). “ As this unexpected bump could be the first hint of a new massive particle that is not predicted by the Standard Model of particle physics, the data generated hundreds of theory papers that attempt to explain the signal” (ibid.). I believe it was 500.
The significance reported by CMS is still far below phys
icists’ threshold for a discovery: 5 sigma, or a chance of around 3 in 10 million that the signal is a statistical fluke.
(Castelvecchi and Gibney 2016 )
We might replace “ the signal” with “ a signal like this” to avoid criticism. While more stringent than the usual requirement, the “ we’ re not that impressed” stance kicks in. It’ s not so very rare for even more impressive results to occur by background alone. As the data come in, the significance levels will either grow or wane with the bumps:
Physicists say that by June, or August [2016] at the latest, CMS and ATLAS should have enough data to either make a statistical fluctuation go away – if that’ s what the excess is – or confirm a discovery.
(Castelvecchi and Gibney 2016 )
Could the Bayesian model wind up in the same place? Not if Lindley/O’ Hagan’ s subjective model merely keeps updating beliefs in the already expected parameters. According to Savage, “ The probability of ‘ something else’ … is definitely very small” (Savage 1962 , p. 80). It would seem to require a long string of anomalies before the catchall is made sufficiently probable to start seeking new physics. Would they come up with a particle like the one they were now in a frenzy to explain? Maybe, but it would be a far less efficient way for discovery than the simple significance tests.