Statistical Inference as Severe Testing
Page 69
mixture of, 171
standard normal, 143 , 241 , 326 , 348 , 357 , 378
one-sided test of mean of, 142 – 145 , 323 , 348 , 357
two-sided tests of mean of, 42 , 248 , 257 , 430
normative epistemology, 54 , 422 normative statistical requirements, 437 , 441
Nosek, B., 106
novelty requirement, 90 – 92 and severity, 92
and eclipse tests, 119n1
temporal, 90 – 91
theoretical, 90 – 91 , 96 , 119n1
use, 90 – 91
Novick, M., 288 – 289
nuisance parameters, 385 , 392 , 411 , 433 replaced by sufficient statistic, 385
null hypothesis significance testing (NHST), 94 illicit animal, 179 , 438
O’ Hagan, T., 202 , 213 – 214 , 412
O’ Neil, C., 229
objectivity, 4 three requirements, 223
and Bayesian priors, 232 – 234
as embracing subjectivity, 232
in epistemology, 235 – 236
and equipoise priors, 231
idols of, 224
and journalism, 232
and model checking, 298
and observation, 225
repertoire of mistakes, 235
and sampling distribution, 231
in statistics, 221
and transparency, 236 – 237
and triangulation, 237 , 423
washout theorems, 231 – 232 ; see also default/non-subjective priors
Oishi, S., 101
optional stopping, see stopping rules
outcome-switching, 40 , 439
Overbye, D., 210 – 211
P -curves, 285
P -value distribution, 151 , 325 as measuring sensitivity (D. Cox), 151
P -values, 4 actual (computed) vs. reported (nominal), 17 , 43 , 179 , 274 , 303
invalidated by selection effects, 17 , 285
and N-P tests, 138 , 175 ; see also N-P tests
Bayesian P -values, 305 , 433
can’ t be trusted except when used to show can’ t be trusted, 284 – 285
and error probabilities, 173 – 176 , 440
exaggerate evidence, 246 – 253 , 260 – 264 , 332 , 411 , 440 – 441
police, 204
precise vs imprecise, 333 – 334 , 366 ; see also significance tests
paradox of replication, 270 – 271 , 441
Parameterized Post Newtonian (PPN) framework, 160 – 161
Pascal’ s wager, 381
Pearson, E., 8 , 37 , 47 , 50 , 55 , 59 , 64 – 65 , 83 , 86 , 88 , 93 , 95 , 121 , 132 – 133 , 135 – 137 , 144 , 146 – 147 , 151 , 164 – 165 , 172 – 174 , 182 , 189 , 239 , 269 , 285 , 341 , 371 , 379 – 381 , 384 – 388 , 392 , 403 – 404 , 421 1933 paper with Neyman, 371 – 378
answering Fisher criticism, 140 , 177 , 388
armour-piercing, 180 – 181
on Bayesian priors, 226 , 404
and inferential construal, 180 – 181 , 380 – 381 , 391 – 392
in love with woman his cousin was to marry, 137
on power (post data), 324
rejects behaviorism, 127 , 180 , 381 – 382 , 391
and tail areas, 169
and three steps in test construction, 131 , 178 , 386 ; see also Neyman and Pearson
Pearson, K., 120 – 121 , 131 – 132 , 140 , 146 , 189 , 386 , 404
Peirce, C. S., 18 , 86 on faulty analogy of induction and deduction, 64 – 66 , 380
on inverse chance, 408 – 409
on justifying induction, 113 – 114 , 267 , 307
randomization and predesignation, 89 , 267 , 288n9
and testing assumptions, 307
Perez, B., 6 , 18
performance construal of N-P, 174 – 178 vs. severity, 139 – 140
and Fisher’ s fiducial probability, 382 , 390 ; see also N-P tests
pest control, 299 – 300
phenomenon vs. data, 121
philosophy in statistical methodology, xii , 4 – 5 , 8 – 14 , 28 , 49 , 73 , 114 , 432 and cheating, 46 , 270 , 332
in identifying good science, 77
severe testing philosophy, 23 , 195 , 437 , 444
Bayesian philosophy, 24 , 26 , 396
Pickrite method, 19 , 30 , 51 , 276
piecemeal testing, 162 , 308 , 380 , 400 , 443 division of labor, 85 , 392 , 423
Pigliucci, M., 78
Playfair, L., 17 – 18
Pleiades, 373
Poole, C., 26 , 256 – 257 , 264 , 406
Popper, K., 8 – 9 , 27 , 40 , 59 , 66 – 68 , 72 – 73 , 75 – 80 , 82 – 93 , 95 – 96 , 114 , 119 , 125 – 126 , 159 , 195 , 209 , 227 , 229 , 237 , 259 , 294 , 390 , 433 ; see also falsification
positive predictive value (PPV), see diagnostic screening
posterior predictive distribution, 433 – 434 Duhem’ s Problem in, 435 ; see also M-S tests
Potti, A., 6 , 13 , 18 , 97 , 230
power, 135 – 139 3 roles accorded by N-P, 324
and the clinically relevant/irrelevant difference, 326 – 327
Cohen’ s snafu, 324
detailed discussion, 323 – 341
fallacious transposition in, 331
and Fisher, 325
how to increase, 325
incomplete concept, 353
low power and violated assumptions, 361
predata and postdata, 323 – 325
retrospective (post hoc), 353 – 359 , 359 – 361
and severity, 323 – 332
and Type II error probability, 138 ; see also power attained
power analysis, 323 and CIs, 356 – 358
fallacy of, 353 – 356
Jacob Cohen on, 324 , 338
and Neyman, 339
ordinary, 340
ordinary vs. shpower, 355
vs. severity, 338 , 343 , 350
and significance test reasoning, 339
power attained (att power), and attained sensitivity Π ( γ ), 151 , 164 , 196 , 324 , 342 – 343 , 355n2 , 358
Power Peninsula, 323 , 353 – 354 , 382
Pratt, J., 44 , 175 , 240 , 248 , 252 , 254 , 339
precautionary principle, 341
prespecification/predesignation, 40 , 106 , 269 , 373 – 377 , 438 and error probabilities, 286 , 320 ; see also novelty requirement
principle of indifference, 386 , 391 , 400
prions, 81 – 82 , 85 , 88 , 109 – 110 , 238 prion protein (PrP), 109
protein folding (pN), 109
protein misfolding (pD), 110
probabilistic instantiation fallacy, 367
probabilistic reduction (Spanos), 312 dememorized data, 316
detrended data, 307 , 316
lags, 314 , 316
menu of assumptions, 312
reparameterizations, 320n5
respecification, 313 ; see also M-S tests
probability, roles of probabilism, performance, and probativeness, 13 – 14 , 24 – 27 , 33 , 77 , 436 – 437
avoiding the need for different, 54 – 55 , 429
events vs. hypotheses, 407
formal vs. informal meanings, 10 , 194 , 214 , 427
performance vs. severity, 15 , 26 – 27 , 50 , 54 , 162
probabilism and performance, 428
probativism vs. probabilism, 127 , 226
variability vs. belief, xi , 54 , 80 , 428 ; see also methodological probability
probable errors, 124
probare , 10 , 226 , 423
protein misfolding cyclical amplification (PMCA), 110
Prusiner, S., 81 – 82 , 109 – 110 , 238 , 369
pseudoscience, see Demarcation Problem
questionable research practices (QRPs), 20 , 78 , 98 , 267 , 271 , 292 , 439
quicksand, 183 , 187 – 188 , 367 , 402
radical skepticism, 229
Raftery, A., 305
Raiffa, H., 44
randomization, 286 – 289 possible Bayesian home for, 288 – 289
and cloud seeding, 126
and deliberation, 292 , 294
in GWAS, 293
and C. S. Peirce, 18 , 267 , 288n9
and the philosophers, 289 – 290
Poverty Action Lab (MIT), 290 – 291
randomized controlled trials (RCTs), 98
RCT4D, 290 – 292
rational reconstruction, 8 , 73 , 85 , 162
Ratliff, K., 101
real random experiments, 111 , 298 – 299
realism vs. antirealism, 79 severe tester agnostic on, 297
theoretical mistakes, 297
Reich, E., 210
Reid, C., 120 – 121 , 137 , 139 , 141 , 146 , 189 – 190 , 372 , 387 – 388 , 404
Reid, N., 54 , 186 , 392 , 396 , 429
rejection ratio, 337 – 338
repertoire of errors, 89 , 234 , 308 , 400 , 414 , 442 in selection effects, 279
replicability/reproducibility, 6 , 20 , 28 ASA definition, 97
and diagnostic testing, 368
equivocation in, 246
and predesignation, 270 , 320
and Popper, 82 – 83
replication (crisis), 59 , 89 , 156 , 221 , 361 in GWAS, 295
in psychology, 78 , 97 – 107 ; see also paradox of replication
residuals, 298 , 303 , 310 – 311 , 317 small residuals vs. adequacy, 318
rigged hypothesis, 108
Robbins, H., 390 , 404
Robert, C., 401 , 402 , 406 , 407 , 413 , 428
Romano, J., 172 , 175 , 191
Rosenkrantz, R., 40 , 69 , 269 , 319 – 320 , 419n9
Rosenthal, R., 239
Rothman, K., 264 , 276 , 272n3
Royall, R., 33 – 39 , 41 , 44 , 50 , 52 , 68 , 70n5 , 82 , 212 , 225 , 243 , 283 , 319 , 332 , 421
rubbing-off construal, 65 , 194 , 244 , 391 , 429
Rubin, D., 47 , 433
Salmon, W., 64 , 95 , 114 , 310
Samanta, T., 51 , 124 , 305n2 , 402 – 403 , 405 , 431 , 434
sampling distribution, 32 , 142 and error probabilities, 130 , 173 , 428 , 438
and bootstrapping, 306
and frequentist objectivity, 231
relevant, 199
as testable meeting ground, 178
sampling plan, freedom from, see Likelihood Principle stopping rules
sampling theory/philosophy, 55 , 172
Sanna, L., 284
Savage Forum 1959, 46 – 48 , 420 – 421 , 430
Savage, L., 8 , 41 – 43 , 44 – 50 , 173 , 214 , 228 , 230n3 , 248 , 252 , 256 , 260 , 269 , 287 , 302 , 397 , 401 , 417 , 420 , 424 , 430 , 432
Schachtman, N., 272n3
schizophrenia and split personality, 409 , 414 , 424 , 436 , 434
Schlaifer, R., 44
Schnall, S., 100
Schweder, T., 195 , 392n10
SCOTUS, 272n3
Sebastiani, P., 293
Seidenfeld, T., 411n6
selection effects, biasing, 3 , 19 , 21n2 , 40 – 41 , 78 defn., 92 , 285 , 437 ; see also stopping rules
adjusting for, 154 , 268 , 275 – 277 , 364 – 365 , 418
and auditing, 234 , 267 , 269
NHST, 95
preregistration, 106 , 266 , 275 , 286 ; see also Bonferroni correction
self-correcting, 20 , 162 induction as, 114 , 307 ; see also arguments from error and coincidence
self-sealing fallacy, 103
Sellke, T., 175 – 176 , 184 , 248 – 252 , 258 – 260 , 338
Selvin, H., 274 , 279
semantic entailment: severity version, 65
Senn, S., 151 , 162n11 , 247 , 251 – 253 , 259n8 , 264 – 265 , 266 , 287 – 288 , 290 , 293n12 , 326 – 327 , 336 , 345 – 346 , 365n9 , 366 , 413 – 414 , 417 – 419
sensitivity, achieved or attained, 151 function, Π ( γ ), 151 – 152
and severity, 152 ; see also power attained
sequential trials, 47 ; see also stopping rules
severe tester, tribal features, 9 , 27 , 114 , 437 on comparativism, 79 , 421 , 441
on the demarcation of science, 88 – 89
and Duhem’ s problem, 85 – 86
on improving confidence intervals, 194 , 244 – 245 , 358 , 429 , 442
interpretation of probable flukes, 217
on large-scale theories, 129
on Likelihoodism and the LP, 39 – 41 , 48 – 50 , 72
new name, 55
vs. the N-P behavioristic prison, 140
vs. Popperian severity, 83
on the revolution in psychology, 100 , 103 – 104 , 107 , 370
solving the problem of induction, 107 – 114
on statistical falsification, 235
on statistical objectivity, 320
and translation guide, 52
severity, 5 applied, water plant accident, 143 – 145
attained power, 342 – 343
and confidence levels, 193
and difference between two means, 345 – 346
disobeys the probability axioms, 423
and explanatory content/informativeness, 79 – 80 , 237
and Fisherian tribes, 146
function (SEV), 143
and large-scale theories, 128 , 162
in meta-methodology, 9 , 32
and Popperian corroboration, 72 , 75 , 87
vs. power analysis, 343
and replicability, 370
and sensitivity, 152
when not calculable, 200
severity curves, 348 – 349 , 360n4
severity interpretation of negative results (SIN), 143 – 145 , 152 , 212 , 343 , 346 – 347 , 351 ; see also severity
severity interpretation of rejection (SIR), 143 , 265 – 266 , 351
severity requirement/principle, 5 , 92 , 125 , 258 and biasing selection effects, 92
and error control, 269
and failed replication, 158 , 266
to block fallacies of rejection, 144 , 357
as heuristic tool, 12 , 264
informal, 109
from low P -value, 209
as minimal principle of evidence, 5 , 396
for non-significance (Higgs), 212
in terms of solving a problem, 300
vs. fit measures, 72
weak and strong, 22 , 108
strong, 14
Sewell, W., 280
sexy science: severe testing in large-scale theories, 121 , 163 , 300
Shaffer, J., 275
Shalizi, C., 27 , 432 , 434
Sharpe, G., 137
shpower (retrospective power analysis), 354 – 356 howlers of, 355 – 356
vs. severity, 356 ; see also power analysis.
significance levels as predesignated, 137
attained vs. predesignated, 173 – 175 , 177 ; see also P -values
significance tests vs. comparativism, 35
Cox definition of, 93
criticisms of, 93 – 95 , 438 ; see also chestnuts and howlers of tests
fallacies of rejection/non-rejection
falsifying alternatives in, 159
Fisherian (pure/simple), 132 , 150
in Higgs, 202
roles for model testing and discovery, 298 – 304 ; see also M-S tests
simple or point hypotheses, 33
test T+, 144 ; see also Neyman and Pearson (N-P) Tests
Silberstein, L., 127
Silver, N., 232 – 233
similar tests, 385 , 386n6
Simmons, J., 43 , 237 , 270
Simonsohn, U., 43 , 237 , 270 , 284 , 285
Singh, K., 391
skin off your nose, 273
Skyrms, B., 62 , 73
Slovic, P., 422
Smeesters, D., 284
Smith, C., 47
Smith, H., 339n4
Sober, E., 35 – 36 , 47 , 92n3 , 242 , 317 – 318 , 380 – 381
Spanos, A., 120 , 133 – 134 , 139 , 146n5 , 200 , 254 – 255 , 305 , 308 , 312 – 313 , 317 – 319 , 331 , 352n10 , 355 , 367 , 387 , 426
Spiegelhalter, D., 204 – 205 , 401 , 404
spike and s
mear priors, 239 , 248 , 250 – 251 , 259 , 336 Bayesian justification for, 251
coffee shop, 257
criticisms of, 252n4 , 256 , 259 , 406 , 440
cult of the holy, 252
severe tester on, 257 – 258
spongiform diseases, 81 , see also kuru
Sprenger, J., 307
Sprott, D., 180 , 399 , 421
spurious associations, 3 batch effects, 293
in longevity study, 293 , 362
population and number of shoes, 308 , 317
sea level and price of bread, 317
Staley, K., 203 , 235 – 236
standard model (SM) physics, 203 , 206 , 214 , 215
Stapel, D., 78 , 97 , 100 , 276
statistical battles, current state of play, 11 – 12 , 23 – 28 , 395 – 397 , 400 – 402 , 444 proxy, 437 ; see also getting beyond statistics wars
statistical fluctuations/flukes and Higgs, 202 – 205 , 210 – 212 interpreting, 214 – 215
statistical inference, 7 , 20 , 42 , 65 – 66 , 174 and scientific theory appraisal, 119 , 202 ; see also inductive inference
Statistical Methods for Research Workers (SMRW) , 387 Fisher backtracks, 182
Statistical Power Analysis for the Behavioral Sciences (Cohen), 324
statistical tests, elements of, 129 – 130 test hypothesis, 31 , 109 , 130 , 133 , 341
test rule, 130
test statistic, 34 , 94 , 129 , 132 , 167
test statistic, pivotal, 378
statistical tests, properties of consistent, 136
monotonicity, 134
powerful, 135 – 136
unbiased, 136 , 141
Steegen, S., 105
Stern, H., 433
Stevens, S., 100
Stigler, S., 288n9
Stone, M., 254 , 413
stopping rules/optional stopping, 42 – 53 , 170 , 270 and Bayesian intervals, 430
principle, 43 , 54 , 431
proper, 42
statisticians on: Armitage, 47
G. Barnard, 48
J. Berger /Wolpert, 49 , 187 , 430 – 431
G. Box, 303
D. Cox and Hinkley, 45
Savage (E, L, & S), 43 , 46 ; see also intentions