Statistical Inference as Severe Testing

Page 69

by Deborah G Mayo

mixture of, 171

standard normal, 143 , 241 , 326 , 348 , 357 , 378

one-sided test of mean of, 142 – 145 , 323 , 348 , 357

two-sided tests of mean of, 42 , 248 , 257 , 430

normative epistemology, 54 , 422 normative statistical requirements, 437 , 441

Nosek, B., 106

novelty requirement, 90 – 92 and severity, 92

and eclipse tests, 119n1

temporal, 90 – 91

theoretical, 90 – 91 , 96 , 119n1

use, 90 – 91

Novick, M., 288 – 289

nuisance parameters, 385 , 392 , 411 , 433 replaced by sufficient statistic, 385

null hypothesis significance testing (NHST), 94 illicit animal, 179 , 438

O’ Hagan, T., 202 , 213 – 214 , 412

O’ Neil, C., 229

objectivity, 4 three requirements, 223

and Bayesian priors, 232 – 234

as embracing subjectivity, 232

in epistemology, 235 – 236

and equipoise priors, 231

idols of, 224

and journalism, 232

and model checking, 298

and observation, 225

repertoire of mistakes, 235

and sampling distribution, 231

in statistics, 221

and transparency, 236 – 237

and triangulation, 237 , 423

washout theorems, 231 – 232 ; see also default/non-subjective priors

Oishi, S., 101

optional stopping, see stopping rules

outcome-switching, 40 , 439

Overbye, D., 210 – 211

P -curves, 285

P -value distribution, 151 , 325 as measuring sensitivity (D. Cox), 151

P -values, 4 actual (computed) vs. reported (nominal), 17 , 43 , 179 , 274 , 303

invalidated by selection effects, 17 , 285

and N-P tests, 138 , 175 ; see also N-P tests

Bayesian P -values, 305 , 433

can’ t be trusted except when used to show can’ t be trusted, 284 – 285

and error probabilities, 173 – 176 , 440

exaggerate evidence, 246 – 253 , 260 – 264 , 332 , 411 , 440 – 441

police, 204

precise vs imprecise, 333 – 334 , 366 ; see also significance tests

paradox of replication, 270 – 271 , 441

Parameterized Post Newtonian (PPN) framework, 160 – 161

Pascal’ s wager, 381

Pearson, E., 8 , 37 , 47 , 50 , 55 , 59 , 64 – 65 , 83 , 86 , 88 , 93 , 95 , 121 , 132 – 133 , 135 – 137 , 144 , 146 – 147 , 151 , 164 – 165 , 172 – 174 , 182 , 189 , 239 , 269 , 285 , 341 , 371 , 379 – 381 , 384 – 388 , 392 , 403 – 404 , 421 1933 paper with Neyman, 371 – 378

answering Fisher criticism, 140 , 177 , 388

armour-piercing, 180 – 181

on Bayesian priors, 226 , 404

and inferential construal, 180 – 181 , 380 – 381 , 391 – 392

in love with woman his cousin was to marry, 137

on power (post data), 324

rejects behaviorism, 127 , 180 , 381 – 382 , 391

and tail areas, 169

and three steps in test construction, 131 , 178 , 386 ; see also Neyman and Pearson

Pearson, K., 120 – 121 , 131 – 132 , 140 , 146 , 189 , 386 , 404

Peirce, C. S., 18 , 86 on faulty analogy of induction and deduction, 64 – 66 , 380

on inverse chance, 408 – 409

on justifying induction, 113 – 114 , 267 , 307

randomization and predesignation, 89 , 267 , 288n9

and testing assumptions, 307

Perez, B., 6 , 18

performance construal of N-P, 174 – 178 vs. severity, 139 – 140

and Fisher’ s fiducial probability, 382 , 390 ; see also N-P tests

pest control, 299 – 300

phenomenon vs. data, 121

philosophy in statistical methodology, xii , 4 – 5 , 8 – 14 , 28 , 49 , 73 , 114 , 432 and cheating, 46 , 270 , 332

in identifying good science, 77

severe testing philosophy, 23 , 195 , 437 , 444

Bayesian philosophy, 24 , 26 , 396

Pickrite method, 19 , 30 , 51 , 276

piecemeal testing, 162 , 308 , 380 , 400 , 443 division of labor, 85 , 392 , 423

Pigliucci, M., 78

Playfair, L., 17 – 18

Pleiades, 373

Poole, C., 26 , 256 – 257 , 264 , 406

Popper, K., 8 – 9 , 27 , 40 , 59 , 66 – 68 , 72 – 73 , 75 – 80 , 82 – 93 , 95 – 96 , 114 , 119 , 125 – 126 , 159 , 195 , 209 , 227 , 229 , 237 , 259 , 294 , 390 , 433 ; see also falsification

positive predictive value (PPV), see diagnostic screening

posterior predictive distribution, 433 – 434 Duhem’ s Problem in, 435 ; see also M-S tests

Potti, A., 6 , 13 , 18 , 97 , 230

power, 135 – 139 3 roles accorded by N-P, 324

and the clinically relevant/irrelevant difference, 326 – 327

Cohen’ s snafu, 324

detailed discussion, 323 – 341

fallacious transposition in, 331

and Fisher, 325

how to increase, 325

incomplete concept, 353

low power and violated assumptions, 361

predata and postdata, 323 – 325

retrospective (post hoc), 353 – 359 , 359 – 361

and severity, 323 – 332

and Type II error probability, 138 ; see also power attained

power analysis, 323 and CIs, 356 – 358

fallacy of, 353 – 356

Jacob Cohen on, 324 , 338

and Neyman, 339

ordinary, 340

ordinary vs. shpower, 355

vs. severity, 338 , 343 , 350

and significance test reasoning, 339

power attained (att power), and attained sensitivity Π ( γ ), 151 , 164 , 196 , 324 , 342 – 343 , 355n2 , 358

Power Peninsula, 323 , 353 – 354 , 382

Pratt, J., 44 , 175 , 240 , 248 , 252 , 254 , 339

precautionary principle, 341

prespecification/predesignation, 40 , 106 , 269 , 373 – 377 , 438 and error probabilities, 286 , 320 ; see also novelty requirement

principle of indifference, 386 , 391 , 400

prions, 81 – 82 , 85 , 88 , 109 – 110 , 238 prion protein (PrP), 109

protein folding (pN), 109

protein misfolding (pD), 110

probabilistic instantiation fallacy, 367

probabilistic reduction (Spanos), 312 dememorized data, 316

detrended data, 307 , 316

lags, 314 , 316

menu of assumptions, 312

reparameterizations, 320n5

respecification, 313 ; see also M-S tests

probability, roles of probabilism, performance, and probativeness, 13 – 14 , 24 – 27 , 33 , 77 , 436 – 437

avoiding the need for different, 54 – 55 , 429

events vs. hypotheses, 407

formal vs. informal meanings, 10 , 194 , 214 , 427

performance vs. severity, 15 , 26 – 27 , 50 , 54 , 162

probabilism and performance, 428

probativism vs. probabilism, 127 , 226

variability vs. belief, xi , 54 , 80 , 428 ; see also methodological probability

probable errors, 124

probare , 10 , 226 , 423

protein misfolding cyclical amplification (PMCA), 110

Prusiner, S., 81 – 82 , 109 – 110 , 238 , 369

pseudoscience, see Demarcation Problem

questionable research practices (QRPs), 20 , 78 , 98 , 267 , 271 , 292 , 439

quicksand, 183 , 187 – 188 , 367 , 402

radical skepticism, 229

Raftery, A., 305

Raiffa, H., 44

randomization, 286 – 289 possible Bayesian home for, 288 – 289

and cloud seeding, 126

and deliberation, 292 , 294

in GWAS, 293

and C. S. Peirce, 18 , 267 , 288n9

and the philosophers, 289 – 290

Poverty Action Lab (MIT), 290 – 291

randomized controlled trials (RCTs), 98

RCT4D, 290 – 292

rational reconstruction, 8 , 73 , 85 , 162

Ratliff, K., 101

real random experiments, 111 , 298 – 299

realism vs. antirealism, 79 severe tester agnostic on, 297

theoretical mistakes, 297

Reich, E., 210

Reid, C., 120 – 121 , 137 , 139 , 141 , 146 , 189 – 190 , 372 , 387 – 388 , 404

Reid, N., 54 , 186 , 392 , 396 , 429

rejection ratio, 337 – 338

repertoire of errors, 89 , 234 , 308 , 400 , 414 , 442 in selection effects, 279

replicability/reproducibility, 6 , 20 , 28 ASA definition, 97

and diagnostic testing, 368

equivocation in, 246

and predesignation, 270 , 320

and Popper, 82 – 83

replication (crisis), 59 , 89 , 156 , 221 , 361 in GWAS, 295

in psychology, 78 , 97 – 107 ; see also paradox of replication

residuals, 298 , 303 , 310 – 311 , 317 small residuals vs. adequacy, 318

rigged hypothesis, 108

Robbins, H., 390 , 404

Robert, C., 401 , 402 , 406 , 407 , 413 , 428

Romano, J., 172 , 175 , 191

Rosenkrantz, R., 40 , 69 , 269 , 319 – 320 , 419n9

Rosenthal, R., 239

Rothman, K., 264 , 276 , 272n3

Royall, R., 33 – 39 , 41 , 44 , 50 , 52 , 68 , 70n5 , 82 , 212 , 225 , 243 , 283 , 319 , 332 , 421

rubbing-off construal, 65 , 194 , 244 , 391 , 429

Rubin, D., 47 , 433

Salmon, W., 64 , 95 , 114 , 310

Samanta, T., 51 , 124 , 305n2 , 402 – 403 , 405 , 431 , 434

sampling distribution, 32 , 142 and error probabilities, 130 , 173 , 428 , 438

and bootstrapping, 306

and frequentist objectivity, 231

relevant, 199

as testable meeting ground, 178

sampling plan, freedom from, see Likelihood Principle stopping rules

sampling theory/philosophy, 55 , 172

Sanna, L., 284

Savage Forum 1959, 46 – 48 , 420 – 421 , 430

Savage, L., 8 , 41 – 43 , 44 – 50 , 173 , 214 , 228 , 230n3 , 248 , 252 , 256 , 260 , 269 , 287 , 302 , 397 , 401 , 417 , 420 , 424 , 430 , 432

Schachtman, N., 272n3

schizophrenia and split personality, 409 , 414 , 424 , 436 , 434

Schlaifer, R., 44

Schnall, S., 100

Schweder, T., 195 , 392n10

SCOTUS, 272n3

Sebastiani, P., 293

Seidenfeld, T., 411n6

selection effects, biasing, 3 , 19 , 21n2 , 40 – 41 , 78 defn., 92 , 285 , 437 ; see also stopping rules

adjusting for, 154 , 268 , 275 – 277 , 364 – 365 , 418

and auditing, 234 , 267 , 269

NHST, 95

preregistration, 106 , 266 , 275 , 286 ; see also Bonferroni correction

self-correcting, 20 , 162 induction as, 114 , 307 ; see also arguments from error and coincidence

self-sealing fallacy, 103

Sellke, T., 175 – 176 , 184 , 248 – 252 , 258 – 260 , 338

Selvin, H., 274 , 279

semantic entailment: severity version, 65

Senn, S., 151 , 162n11 , 247 , 251 – 253 , 259n8 , 264 – 265 , 266 , 287 – 288 , 290 , 293n12 , 326 – 327 , 336 , 345 – 346 , 365n9 , 366 , 413 – 414 , 417 – 419

sensitivity, achieved or attained, 151 function, Π ( γ ), 151 – 152

and severity, 152 ; see also power attained

sequential trials, 47 ; see also stopping rules

severe tester, tribal features, 9 , 27 , 114 , 437 on comparativism, 79 , 421 , 441

on the demarcation of science, 88 – 89

and Duhem’ s problem, 85 – 86

on improving confidence intervals, 194 , 244 – 245 , 358 , 429 , 442

interpretation of probable flukes, 217

on large-scale theories, 129

on Likelihoodism and the LP, 39 – 41 , 48 – 50 , 72

new name, 55

vs. the N-P behavioristic prison, 140

vs. Popperian severity, 83

on the revolution in psychology, 100 , 103 – 104 , 107 , 370

solving the problem of induction, 107 – 114

on statistical falsification, 235

on statistical objectivity, 320

and translation guide, 52

severity, 5 applied, water plant accident, 143 – 145

attained power, 342 – 343

and confidence levels, 193

and difference between two means, 345 – 346

disobeys the probability axioms, 423

and explanatory content/informativeness, 79 – 80 , 237

and Fisherian tribes, 146

function (SEV), 143

and large-scale theories, 128 , 162

in meta-methodology, 9 , 32

and Popperian corroboration, 72 , 75 , 87

vs. power analysis, 343

and replicability, 370

and sensitivity, 152

when not calculable, 200

severity curves, 348 – 349 , 360n4

severity interpretation of negative results (SIN), 143 – 145 , 152 , 212 , 343 , 346 – 347 , 351 ; see also severity

severity interpretation of rejection (SIR), 143 , 265 – 266 , 351

severity requirement/principle, 5 , 92 , 125 , 258 and biasing selection effects, 92

and error control, 269

and failed replication, 158 , 266

to block fallacies of rejection, 144 , 357

as heuristic tool, 12 , 264

informal, 109

from low P -value, 209

as minimal principle of evidence, 5 , 396

for non-significance (Higgs), 212

in terms of solving a problem, 300

vs. fit measures, 72

weak and strong, 22 , 108

strong, 14

Sewell, W., 280

sexy science: severe testing in large-scale theories, 121 , 163 , 300

Shaffer, J., 275

Shalizi, C., 27 , 432 , 434

Sharpe, G., 137

shpower (retrospective power analysis), 354 – 356 howlers of, 355 – 356

vs. severity, 356 ; see also power analysis.

significance levels as predesignated, 137

attained vs. predesignated, 173 – 175 , 177 ; see also P -values

significance tests vs. comparativism, 35

Cox definition of, 93

criticisms of, 93 – 95 , 438 ; see also chestnuts and howlers of tests

fallacies of rejection/non-rejection

falsifying alternatives in, 159

Fisherian (pure/simple), 132 , 150

in Higgs, 202

roles for model testing and discovery, 298 – 304 ; see also M-S tests

simple or point hypotheses, 33

test T+, 144 ; see also Neyman and Pearson (N-P) Tests

Silberstein, L., 127

Silver, N., 232 – 233

similar tests, 385 , 386n6

Simmons, J., 43 , 237 , 270

Simonsohn, U., 43 , 237 , 270 , 284 , 285

Singh, K., 391

skin off your nose, 273

Skyrms, B., 62 , 73

Slovic, P., 422

Smeesters, D., 284

Smith, C., 47

Smith, H., 339n4

Sober, E., 35 – 36 , 47 , 92n3 , 242 , 317 – 318 , 380 – 381

Spanos, A., 120 , 133 – 134 , 139 , 146n5 , 200 , 254 – 255 , 305 , 308 , 312 – 313 , 317 – 319 , 331 , 352n10 , 355 , 367 , 387 , 426

Spiegelhalter, D., 204 – 205 , 401 , 404

spike and s
mear priors, 239 , 248 , 250 – 251 , 259 , 336 Bayesian justification for, 251

coffee shop, 257

criticisms of, 252n4 , 256 , 259 , 406 , 440

cult of the holy, 252

severe tester on, 257 – 258

spongiform diseases, 81 , see also kuru

Sprenger, J., 307

Sprott, D., 180 , 399 , 421

spurious associations, 3 batch effects, 293

in longevity study, 293 , 362

population and number of shoes, 308 , 317

sea level and price of bread, 317

Staley, K., 203 , 235 – 236

standard model (SM) physics, 203 , 206 , 214 , 215

Stapel, D., 78 , 97 , 100 , 276

statistical battles, current state of play, 11 – 12 , 23 – 28 , 395 – 397 , 400 – 402 , 444 proxy, 437 ; see also getting beyond statistics wars

statistical fluctuations/flukes and Higgs, 202 – 205 , 210 – 212 interpreting, 214 – 215

statistical inference, 7 , 20 , 42 , 65 – 66 , 174 and scientific theory appraisal, 119 , 202 ; see also inductive inference

Statistical Methods for Research Workers (SMRW) , 387 Fisher backtracks, 182

Statistical Power Analysis for the Behavioral Sciences (Cohen), 324

statistical tests, elements of, 129 – 130 test hypothesis, 31 , 109 , 130 , 133 , 341

test rule, 130

test statistic, 34 , 94 , 129 , 132 , 167

test statistic, pivotal, 378

statistical tests, properties of consistent, 136

monotonicity, 134

powerful, 135 – 136

unbiased, 136 , 141

Steegen, S., 105

Stern, H., 433

Stevens, S., 100

Stigler, S., 288n9

Stone, M., 254 , 413

stopping rules/optional stopping, 42 – 53 , 170 , 270 and Bayesian intervals, 430

principle, 43 , 54 , 431

proper, 42

statisticians on: Armitage, 47

G. Barnard, 48

J. Berger /Wolpert, 49 , 187 , 430 – 431

G. Box, 303

D. Cox and Hinkley, 45

Savage (E, L, & S), 43 , 46 ; see also intentions

‹ Prev Next ›