Book Read Free

Statistical Inference as Severe Testing

Page 41

by Deborah G Mayo

0 . We observe where and are the observed sample means in the treated and control groups, forming the standard test statistic d * = (d – Δ )/SE , where SE is the standard error ( ). Thanks to randomized assignment, we can estimate the standard error and the sampling distribution of d *.

  Under the (strong) null hypothesis, the two groups, treated and control, may be regarded as coming from the same population with respect to mean effect, such as age-related dementia. Think about the RCT reasoning this way: if the HRT treatment makes no difference, people in the treated group would have had (or not had) dementia even if they’ d been assigned to the control group. Some will get it, others won’ t (of course we can also consider degrees). Under the null hypothesis, any observed difference would be due to the accidental assignment to group T or C. So, if exceeds 0, it’ s just because more of the people who would have gotten dementia anyway happen to end up being assigned to the treated rather than the control group. Thanks to the random assignment, we can determine the probability of this occurring (under H 0 ). This is a particularly vivid illustration of a difference “ due to chance” – where the chance is the way subjects were assigned to treatment. A statistical connection between the theoretical parameter Δ is created by dint of the design and execution of the experiment. 9

  Bayesians may find a home for randomized assignment (and possibly, double blindness) in the course of demonstrating “ utmost good faith” (Senn 2007 , p. 35). In the face of suspicious second parties: “ Simple randomization is a method which by virtue of its very unpredictability affords the greatest degree of blinding. Some form of randomization is indispensable for any trial in which the issue of blinding is taken seriously…” (ibid., p. 70). Nevertheless, subjective Bayesians have generally concurred with Lindley and Novick that “ [O]ne can do no better than … use an allocation which You think is unlikely to have important confounding effects” (Lindley and Novick 1981 , p. 52). 10 Still, Berry and Kadane (1997 ) maintain that despite the “ standard result that Bayesians need not randomize” (p. 813) there are scenarios where, because different actors have different subjective goals and beliefs, randomization is the optimal allocation. Say there’ s two treatments, 1 and 2, where each either shows the response of interest or does not. Dan, who will decide on whether the allocation should be randomized or not, is keen for the result to give a good estimate of the response rates over the whole population, whereas Phyllis, a doctor, believes one type of patient, say healthy ones, does better with treatment 1 than 2, and her goal is giving patients the best treatment (ibid., p. 818). “ [I]f Dan has a positive probability that Phyllis, or whoever is allocating, is placing the patients on the two treatments unequally, then randomization is the preferred allocation scheme (optimal)” (ibid.). Presumably, the agent doesn’ t worry that he unconsciously biases his own study aimed at learning the success rate in the population. For non-subjective Bayesians, there may be an appeal to the fact that “ with randomization, the posterior is much less sensitive to the prior. And I think most practical Bayesians would consider it valuable to increase the robustness of the posterior” (Wasserman 2013 ). An extensive discussion may be found in Gelman et al. (2004). Still, as I understand it, it’ s not the direct protection of error probabilities that drives the concern.

  Randomization and the Philosophers

  It seems surprising that the value of randomisation should still be disputed at this stage, and of course it is not disputed by anybody in the business. There is, though, a body of philosophers who do dispute it.

  (Colquhoun 2011 , p. 333)

  It’ s a bit mortifying to hear Colquhoun allude to “ subversion by philosophers of science” (p. 321). Philosophical arguments against randomization stem largely from influential Bayesian texts (e.g., Howson and Urbach 1993 ), “ On the Bayesian view, randomization is optional, and the essential condition is for the comparison groups in a clinical trial to be adequately matched on factors believed to have prognostic significance” (Howson and Urbach 1993 , p. 378). A criticism philosophers often raise is due to the possibility of unknown confounding factors that differentiate the treated from the control groups (e.g., Worrall 2002 , p. S324). As Stephen Senn explains (2013b , p. 1447), such “ imbalance,” which is expected, will not impugn the statistical significance computations under randomization. The analysis is of a ratio of the between group variability and the within group variability. The said imbalance between groups (the numerator) would also impinge on the within group variability (the denominator). Larger variability will at worst result in larger standard errors, wider confidence intervals, resulting in a non-statistically significant result. The relevance of the unknown factors “ is bounded by outcome and if we have randomised, the variation within groups is related to the variation between in a way that can be described probabilistically by the Fisherian machinery” (ibid.). There’ s an observed difference between groups, and our question is how readily could the observed difference in outcome be generated by chance alone? Senn goes further.

  It is not necessary for the groups to be balanced. In fact, the probability calculation applied to a clinical trial automatically makes an allowance for the fact that groups will almost certainly be unbalanced , and if one knew that they were balanced, then the calculation that is usually performed would not be correct. Every statistician knows that you should not analyse a matched pair’ s design as if it were a completely randomised design.

  (ibid., p. 1442)

  The former randomly assigns a treatment to a pair that have been deliberately matched.

  In clinical trials where subjects are assigned as they present themselves, you can’ t look over the groups for possibly relevant factors, but you can include them in your model. Suppose the sex of the patient is deemed relevant. According to Senn:

  (1) If you have sex in the model the treatment estimate is corrected for sex whether or not the design is balanced; balancing makes it more efficient.

  (2) Balancing for sex but not having it in the model does not give a valid inference. 11

  RCT4D

  A different type of debate is cropping up in fields that are increasingly dabbling with using randomized controlled trials (RCTs) rather than a typical reliance on observational data and statistical modeling. The Poverty Action Lab at MIT led by Abhijit Banerjee and Esther Duflo (2011 ) is spearheading a major movement in development economics to employ RCTs to test the benefits of various aid programs for spurring economic growth and decreasing poverty, from bed nets and school uniforms, to micro-loans. For those advocating RCTs in development economics (abbreviated RCT4D), if you want to discover if school uniforms decrease teen pregnancy in Mumbai, you take k comparable schools and randomly assign uniforms to some and not to others; at the end of the study, differences in average results are observed. It is hoped thereby to repel criticisms from those who question if there are scientific foundations guiding aid-driven development.

  Philosopher of science Nancy Cartwright allows that RCTs, if done very well, can show a program worked in a studied situation, but that’ s “ a long way from establishing that it will work in a particular target” (2012 , p. 299). A major concern is that what works in Kenya needn’ t work in Mumbai. In general, the results of RCTs apply to the experimental population and, unless that’ s a random sample of a given target population, the issue of extrapolating (or external validity) arises. That is true. Merely volunteering to be part of a trial may well be what distinguishes subjects from others.

  The conflicting sides here are largely between those who advocate experimental testing of policies and those who think we can do better with econometric modeling coupled with theory. Opponents, such as Angus Deaton, think the attention should be “ refocused toward the investigation of potentially generalizable mechanisms that explain why and in what contexts projects can be expected to work” (Deaton 2010 , p. 426) via modeling and trial and error. But why not all of the above? Clearly RCTs limit what can be studied, so they can’ t be the only method. The “ hierarchies of e
vidence” we agree should include explicit recognition of flaws and fallacies of extrapolation.

  Giving the mother nutritional information improves child nourishment in city X, but not in city Y where the father does the shopping and the mother-in law decides who eats what. Small classrooms improve learning in one case, but not if applying it means shutting down spaces for libraries and study facilities. I don’ t see how either the modelers or the randomistas (as they are sometimes called) can avoid needing to be attuned to such foibles. Shouldn’ t the kind of field trials described in Banerjee and Duflo (2011) reveal clues as to why what works in X won’ t work in Y? Perhaps one of the most valuable pieces of information emerges from talking with and interacting amongst the people involved.

  Cartwright and Hardie worry that RCTs, being rule oriented, reduce or eliminate the use of necessary discretion and judgment:

  If a rule such as ‘ follow the RCTs, and do so faithfully’ were a good way of deciding about effectiveness, then certainly deliberation is second best (or worse) … the orthodoxy, which is a rules system, discourages decision makers from thinking about their problems, because the aim of rules is to reduce or eliminate the use of discretion and judgment, … The aim of reducing discretion comes from a lack of trust in the ability of operatives to exercise discretion well … Thus, the orthodoxy not only discourage deliberation, as unnecessary since the rules are superior, but selects in favor of operatives who cannot deliberate.

  (Cartwright and Hardie 2012 , pp. 158– 9)

  Do rules to prevent QRPs, conscious and unconscious, reflect a lack of trust? Absolutely, even those trying hard to get it right aren’ t immune to the tunnel vision of their tribe.

  The truth is, performing and interpreting RCTs involve enormous amounts of discretion. One of the big problems with experiments in many fields is the way statistical-scientific gaps are filled in interpreting results. In development economics, negative results are hard to hide, but there’ s still plenty of latitude for post-data explanations. RCTs don’ t protect you from post-data hunting and snooping. One RCT gave evidence that free school uniforms decreased teenage pregnancy, by encouraging students to remain in school. Here, teen pregnancies serve as a proxy for contracting HIV/AIDS. However, combining uniforms with a curriculum on HIV/AIDS gave negative results.

  In schools that had both the HIV/AIDS and the uniforms programs, girls were no less likely to become pregnant than those in the schools that had nothing. The HIV/AIDS education curriculum, instead of reducing sexual activity … , actually undid the positive effect of the [free uniforms].

  (Banerjee and Duflo 2011 , p. 115)

  Several different theories are offered. Perhaps the AIDS curriculum, which stresses abstinence before marriage, encouraged marriages and thus pregnancies. Post-data explanations for insignificant results are as seductive here as elsewhere and aren’ t protected by an RCT without prespecified outcomes. If we are to evaluate aid programs, an accumulation of data on bungled implementation might be the best way to assess what works and why. Rather than scrap the trials, a call for explicit attention to how a program could fail in the new environment is needed. Researchers should also be open to finding that none of the existing models captures people’ s motivations. Sometimes those receiving aid might just resist being “ helped,” or being nudged to do what’ s “ rational” .

  Randomization is no longer at the top of the evidence hierarchy. It’ s been supplanted by systematic reviews or meta-analysis of all relevant RCTs. Here too, however, there are problems of selecting which studies to include and their differing quality. Still the need for meta-analysis has promoted “ all trials,” rather than hiding any in file-drawers, and with the emphasis by the Cochrane collaboration, are clearly not going away. Meta-analytic reviews have received some black eyes, but it’ s one of the central ways of combining results in frequentist statistics. 12 Our itinerary got too full to visit this important topic; you may catch it on a return tour.

  Batch-Effect Driven Spurious Associations

  There is a relatively unknown problem with microarray experiments, in addition to the multiple testing problems [microarray] samples should be randomized over important sources of variation; otherwise p-values may be flawed. Until relatively recently, the microarray samples were not sent through assay equipment in random order. … Essentially all the microarray data pre-2010 is unreliable.

  (Young 2013 )

  The latest Big Data technologies are not immune from basic experimental design principles. We hear that a decade or more has been lost by failing to randomize microarrays. “ Stop Ignoring Experimental Design (or my head will explode)” declares genomics researcher Christophe Lambert (2010 ). The result is spurious associations due to confounding “ to the point, in fact, where real associations cannot be distinguished from experimental artifacts” (ibid., p. 1). Microarray analysis involves a great many steps, plating and processing, and washing and more; minute differences in entirely non-biological variables can easily swamp the difference of interest. Suppose a microarray, looking for genes differentially expressed between diseased and healthy tissue (cases and controls), processes the cases in one batch, say at a given lab on Monday, and the controls on Tuesday. The reported statistically significant differences may be swamped by artifacts – the tiny differences due to different technicians, different reagents, even ozone levels. A “ batch” would be a set of microarrays processed in a relatively homogeneous way, say at a single lab on Monday. Batch effects are defined to be systematic non-biological variations between groups of samples (or batches) due to such experimental artifacts. A paper on genetic associations and longevity (Sebastiani et al. 2010 ) didn’ t live very long because it turned out the samples from centenarians were differently collected and processed than the control group of average people. The statistically significant difference disappeared when the samples were run on the same batch, showing the observed association to be an experimental artifact.

  By randomly assigning the order of cases and controls, the spurious associations vanish! Ideally they also randomize over data collection techniques and balance experimental units to different batches, but “ the case/control status is the most important variable to randomize” (Lambert 2010 , p. 4). Then, corrections due to site and data collection can be made later. But the reverse isn’ t true. As Fisher said, “ To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination:… to say what the experiment died of” (Fisher 1938 , p. 17). Nevertheless, in an attempt to fix the batch effect driven spurious associations,

  [A] whole class of post-experiment statistical methods has emerged … These methods … represent a palliative, not a cure. … the GWAS research community has too often accommodated bad experimental design with automated post-experiment cleanup. … experimental designs for large-scale hypothesis testing have produced so many outliers that the field has made it standard practice to automate discarding outlying data.

  (Lambert and Black 2012 , pp. 196– 7)

  By contrast, with proper design they find that post-experiment automatic filters are unneeded. In other words, the introduction of randomized design frees them to deliberate over the handful of extreme values to see if they are real or artifacts. (This contrasts with the worry raised by Cartwright and Hardie (2012).)

  Souvenir T: Even Big Data Calls for Theory and Falsification

  Historically, epidemiology has focused on minimizing Type II error (missing a relationship in the data), often ignoring multiple testing considerations, while traditional statistical study has focused on minimizing Type I error (incorrectly attributing a relationship in data better explained by random chance). When traditional epidemiology met the field of GWAS, a flurry of papers reported findings which eventually became viewed as nonreplicable.

  (Lambert and Black 2012 , p. 199)

  This is from Christophe Lambert and Laura Black’ s important paper “ Learning from our GWAS Mistakes: From Experimenta
l Design to Scientific Method” ; it directly connects genome-wide association studies (GWAS) to philosophical themes from Meehl, Popper and falsification. In an attempt to staunch the non-replication, they explain, adjusted genome-wide thresholds of significance were required as well as replication in an independent sample (Section 4.6).

  However, the intended goal is often thwarted by how this is carried out. “ [R]esearchers commonly take forward, say, 20– 40 nominally significant signals” that did not meet the stricter significance levels, “ then run association tests for those signals in a second study, concluding that all the signals with a p-value ≤ .05 have replicated (no Bonferroni adjustment). Frequently 1 or 2 associations replicate – which is also the number expected by random chance” (ibid.). Next these “ replicated” cases are combined with the original data “ to compute p-values considered genome-wide significant. This method has been propagated in publications, leading us to wonder if standard practice could become to publish random signals and tell a plausible biological story about the findings” (ibid.).

  Instead of being satisfied with a post-data biological story to explain correlations, “ [i]f journals were to insist that association studies also suggest possible experiments that could falsify a putative theory of causation based on association, the quality and durability of association studies could increase” (ibid., p. 201). At the very least, the severe tester argues, we should strive to falsify methods of inquiry and analysis. This might at least scotch the tendency Lambert and Black observe, for others to propagate a flawed methodology once seen in respected journals: “ [W]ithout a clear falsifiable stance – one that has implications for the theory – associations do not necessarily contribute deeply to science” (ibid., p. 199).

 

‹ Prev