Statistical Inference as Severe Testing

Page 44

by Deborah G Mayo

As always, different inquiry types have their own error repertoires that need to be mastered. Allow me to try to climb neatly through the vegetation of one of the more treacherous modeling fields: econometrics. Relying on seven figures from Mayo and Spanos (2004), I’ ll tell the story of a case Spanos presented to me in 2002.

Nonsense Regression

Suppose that in her attempt to find a way to understand and predict changes in the US population, an economist discovers an empirical relationship that appears to provide almost a “ law-like” fit:

yt = 167 + 2xt + ût ,

where yt denotes the population of the USA (in millions), and xt denotes a secret variable whose identity Spanos would not reveal until the end of the analysis. The subscript t is time. There are 33 annual data points for the period 1955– 1989 (t = 1 is 1955, t = 2 is 1956, etc.) The data can be represented as 33 pairs z 0 = {(xt , yt ), t = 1, 2, … , 33} . The coefficients 167 and 2 come from the least squares fit, a purely mathematical operation.

This is an example of fitting a Linear Regression Model (LRM), which forms the backbone of most statistical models of interest:

M 0 :yt = β 0 + β 1 xt + ut , t = 1, 2, … , n .

The term β 0 + β 1 xt is viewed as the systematic component (and is the expected value of yt ), and ut = yt − β 0 − β 1 xt is the error or non-systematic component. The error ut is a random variable assumed to be Normal, Independent, and Identically Distributed (NIID) with mean 0, variance σ 2 . This is called Normal white noise. Figure 4.2 shows what NIID looks like.

Figure 4.2 A typical realization of a NIID process.

A Primary Statistical Question: How Good a Predictor Is xt ?

The empirical equation is intended to enable us to understand how yt varies with xt . Testing the statistical significance of the coefficients shows them to be highly significant: P -values are zero (0) to a third decimal, indicating a very strong relationship between the variables. The goodness-of-fit measure of how well this model “ explains” the variability of yt , R 2 = 0.995, an almost perfect fit (Figure 4.3 ). Everything looks hunky dory. Is it reliable? Only if the errors are approximately NIID with mean 0, variance σ 2 .

Figure 4.3 Fitted line plot, y = 167.1 + 1.907x.

The null hypotheses in M-S tests take the form

H 0 : the assumption(s) of statistical model M hold for data z ,

as against not-H 0 , which, strictly speaking, would include all of the ways one or more of its assumptions can fail. To rein in the testing, we consider specific departures with appropriate choices of test statistic d( y ).

Residuals Are the Key

Testing Randomness.

The non-parametric runs test for IID is the test I showed Wes Salmon in relation to the Bernoulli model and justifying induction (Section 2.7 ). It falls under “ omnibus” tests in Cox’ s taxonomy (Section 3.3). It can apply here by re-asking our question regarding the LRM. Look at the graph of the residuals (Figure 4.4 ), where the “ hats” are the fitted values for the coefficients:

Figure 4.4 Plot of residuals over time.

If the residuals do not fluctuate like pure noise, it’ s a sign the sample is not IID. Instead of the value of each residual, record whether the difference between successive observations is positive (+) or negative (−),

Each sequence of pluses only, or minuses only, is a run . We can calculate the probability of the number of runs just from the hypothesis that the assumption of randomness holds. It serves only as an argumentative (or i) assumption for the check. The expected number of runs, under randomness, is (2n − 1)/3 , or in our case of n = 35 values, 23. Running out of letters, I’ ll use R again for the number of runs. The distribution of the test statistic, , under IID for n ≥ 20 , can be approximated by N(0, 1). We’ re actually testing H 0 : E (R ) = (2n − 1)/3 vs. H 1 : E (R ) ≠ (2n − 1)/3 , E is the expected value. We reject H 0 iff the observed R differs sufficiently (in either direction) from E (R ) = 23. (SE = √ (16n −29)/90.)

Our data yield 18 runs, around 2.4 SE units, giving a P -value of approximately 0.02. So 98% of the time, we’ d expect an R closer to 23 if IID holds. Arguing from severity, the data indicate non-randomness. But rejecting the null only indicates a denial of IID: either independence is a problem or identically distributed is a problem. The test itself does not indicate whether the fault lies with one or the other or both. More specific M-S testing enables addressing this Duhemian problem.

The Error in Fixing Error.

A widely used parametric test for independence is the Durbin– Watson (D-W) test. Here, all the assumptions of the LRM are retained, except the one under test, independence, which is “ relaxed.” In particular, the original error term in M 0 is extended to allow for the possibility that the errors ut are correlated with their own past, that is, autocorrelated ,

ut = ρ ut −1 + ε t ,t = 1, 2, … , n , …

This is to propose a new overarching model, the Autocorrelation-Corrected LRM :

Proposed M 1 : y t = β 0 + β 1 xt + ut ,ut = ρ ut −1 + ε t ,t = 1, 2, … , n , …

(Now it’ s ε t that is assumed to be a Normal, white noise process.) The D-W test assesses whether or not ρ = 0, assuming we are within model M 1 . One way to bring this out is to view the D-W test as actually considering the conjunctions:

H 0 : {M 1 & ρ = 0} vs. H 1 : {M 1 & ρ ≠ 0}.

With the data in our example, the D-W test statistic rejects the null hypothesis (at level 0.02), which is standardly taken as grounds to adopt H 1 . This is a mistake. This move, to infer H 1 , is warranted only if we are within M 1 . True, if ρ = 0, we are back to the LRM, but ρ ≠ 0 does not entail the particular violation of independence asserted in H 1 . Notice we are in one of the “ non-exhaustive” pigeonholes (“ nested” ) of Cox’ s taxonomy. Because the assumptions of model M 1 have been retained in H 1 , this check had no chance to uncover the various other forms of dependence that could have been responsible for ρ ≠ 0. Thus any inference to H 1 lacks severity. The resulting model will appear to have corrected for autocorrelation even when it is statistically inadequate. If used for the “ primary” statistical inferences, the actual error probabilities are much higher than the ones it is thought to license, and such inferences are unreliable at predicting values beyond the data used. “ This is the kind of cure that kills the patient,” Spanos warns. What should we do instead?

Probabilistic Reduction: Spanos

Spanos shows that any statistical model can be specified in terms of probabilistic assumptions from three broad categories: Distribution, Dependence, and Heterogeneity. In other words, a model emerges from selecting probabilistic assumptions from a menu of three groups: a choice of distribution; of type of dependence, if any; and a type of heterogeneity, i.e., how the generating mechanism remains the same or changes over the ordering of interest, such as time, space, or individuals. The LRM reflects just one of many ways of reducing the set of all possible models that could have given rise to the data z 0 = {(xt , yt ), t = 1, … , n }: Normal, Independent, Identically Distributed (NIID). Statistical inference need not be hamstrung by the neat and tidy cases. As a first step, we partition the set of all possible models coarsely:

Distribution Dependence Heterogeneity

LRM Normal Independent Identically distributed

Alternative (coarse partition) Non-normal Dependent Non-IID

Since we are partitioning or reducing the space of models by means of the probabilistic assumptions, Spanos calls it the Probabilistic Reduction (PR) approach (first in Spanos 1986 , 1999 ). The PR approach to M-S testing weaves together threads from Box– Jenkins, and what some dub the LSE (London School of Economics) tradition. Rather than give the assumptions by means of the error term, as is traditional, Spanos will specify them in terms of the observable random variables (xt , yt ). This has several advantages. For one thing, it brings out hidden assumptions, notably assuming the parameters ( β 0, β 1, σ 2 ) do not change with t . This is called t-homogeneity or t -inv
ariance. Second, we can’ t directly probe errors given by the error term, but we can indirectly test them from the data.

We ask: What would be expected if each data series were to have come from a NIID process? Compare a typical realization of a NIID process (Figure 4.2 ) with the two series (t, xt ) and (t, yt ), in Figures 4.5 and 4.6 , called t -plots.

Figure 4.5 USA population (y ) over time.

Figure 4.6 Secret variable (x ) over time.

Clearly, neither data series looks like the NIID of Figure 4.2 . In each case the mean is increasing with time – there’ s a strong upward trend. Econometricians never use a short phrase when a long one will do, they call the trending mean: mean heterogeneity . The data can’ t be viewed as a realization of identically distributed random variables: ID is false. The very assumption of linear correlation between x and y is that x has a mean µx , and y has mean µy . If these are changing over the different samples, your estimate of correlation makes no sense.

We respecify, by adding terms of form t, t 2 , … , to the model M 0 . We don’ t know how far we’ ll have to go. We aren’ t inferring anything yet, just building a statistical model whose adequacy for the primary statistical inference will be tested in its own right. With these data, adding t suffices to capture the trend, but in building the model you may include higher terms, allowing some to be found unnecessary later on. We can summarize our progress in detecting potential departures from the LRM model assumptions thus far:

Distribution Dependence Heterogeneity

LRM Normal Independent Identically distributed

Alternative ? ? Mean heterogeneity

What about the independence assumption? We could check dependence if our data were ID and not obscured by the influence of the trending mean. We can “ subtract out” the trending mean in a generic way to see what it would be like without it. Figures 4.7 and 4.8 show the detrended xt and yt . Reading data plots, and understanding how they connect to probabilistic assumptions, is a key feature of the PR approach.

Figure 4.7 Detrended population data.

Figure 4.8 Detrended secret variable data.

The detrended data in both figures indicate, to a trained eye, positive dependence or “ memory” in the form of cycles – this is called Markov dependence. So the independence assumption also looks problematic, and this explains the autocorrelation detected by the Durbin– Watson test and runs tests. As with trends, it comes in different orders, depending on how long the memory is found to be. It is modeled by adding terms called lags. To yt add yt −1 , yt −2 , … , as many as needed. Likewise to xt add xt – 1 , xt – 2 , … Our assessment so far, just on the basis of the graphical analysis is:

Distribution Dependence Heterogeneity

LRM Normal Independent Identically distributed

Alternative ? Markov Mean heterogeneity

Finally, if we can see what the data z 0 = {(xt , yt ), t = 1, 2, … , 35} would look like without the heterogeneity (“ detrended” ) and without the dependence (“ dememorized” ), we could get some ideas about the appropriateness of the Normality assumption. We do this by subtracting them out “ on paper” again (shown on graphs). The scatter-plot of (xt , yt ) shows the elliptical pattern expected for Normality (though I haven’ t included a figure). We can organize our respecified model as an alternative to the LRM.

Distribution Dependence Heterogeneity

LRM Normal Independent Identically distributed

Alternative Normal Markov Mean heterogeneity

While there are still several selections under each of the menu headings of Markov dependence and mean heterogeneity, the length of the Markov dependence (m), and the degree (ℓ ) of the polynomial in t , I’ m imagining we’ ve carried out the subsequent rounds of the probing strategy.

The model derived by re-partitioning the set of all possible models, using the new reduction assumptions of Normality, Markov, and mean heterogeneity is called the Dynamic Linear Regression Model (DLRM). These data require one trend and two lags:

M 2 :yt = β 0 + β 1 xt + ct [trending mean] + ( γ 1 yt− 1 + γ 2 yt− 2 ) + ( γ 3 xt− 1 + γ 4 xt −2 ) [temporal lags] + ε t .

Back to the Primary Statistical Inference.

Having established the statistical adequacy of the respecified model M 2 we are then licensed in making primary statistical inferences about the values of its parameters. In particular, does the secret variable help to predict the population of the USA (yt )? No. A test of joint significance of the coefficients of (xt , xt −1 , xt −2 ), yields a P -value of 0.823 (using an F test). We cannot reject the hypothesis that they are all 0, indicating that x contributed nothing towards predicting or explaining y . The regression between xt and yt suggested by models M 0 and M 1 turns out to be a spurious or nonsense regression. Dropping the x variable from the model and reestimating the parameters we get what’ s called an Autoregressive Model of order 2 (for the 2 lags) with a trend

yt = 17 + 0.2t + 1.5yt −1 – 0.6yt −2 + ε t .

The Secret Variable Revealed.

At this point, Spanos revealed that xt was the number of pairs of shoes owned by his grandmother over the observation period! She lives in the mountains of Cyprus, and at last count continues to add to her shoe collection. You will say this is a quirky made-up example, sure. It serves as a “ canonical exemplar” for a type of erroneous inference. Some of the best known spurious correlations can be explained by trending means. For live exhibits, check out an entire website by Tyler Vigen devoted to exposing them. I don’ t know who collects statistics on the correlation between death by getting tangled in bed sheets and the consumption of cheese, but it’ s exposed as nonsense by the trending means. An example from philosophy that is similarly scotched is the case of sea levels in Venice and the price of bread in Britain (Sober 2001 ), as shown by Spanos (2010d, p. 366). In some cases, x is a variable that theory suggests is doing real work; discovering the misspecification effectively falsifies the theory from which the statistical model is derived.

I’ ve omitted many of the tests, parametric and non-parametric, single assumption and joint (several assumptions), used in a full application of the same ideas, and mentioned only the bare graphs for simplicity. As you add questions you might wish to pose, they become your new primary inferences. The first primary statistical inference might indicate an effect of a certain magnitude passes with severity, and then background information might enter to tell if it’ s substantively important. At yet another level, the question might be to test a new model with variables to account for the trending mean of an earlier stage, but this gets beyond our planned M-S testing itinerary. That won’ t stop us from picking up souvenirs.

Souvenir V: Two More Points on M-S Tests and an Overview of Excursion 4

M-S Tests versus Model Selection: Fit Is Not Enough.

M-S tests are distinct from model selection techniques that are so popular. Model selection begins with a family of models to be ranked by one or another criterion. Perhaps the most surprising implication of statistical inadequacy is to call into question the most widely used criterion of model selection: the goodness-of-fit/prediction measures. Such criteria rely on the “ smallness” of the residuals. Mathematical fit isn’ t the same as what’ s needed for statistical inference. The residuals can be “ small” while systematically different from white noise.

Members of the model selection tribe view the problem differently. Model selection techniques reflect the worry of overfitting: that if you add enough factors (e.g., n − 1 for sample size n ), the fit can be made as good as desired, even if the model is inadequate for future prediction. (In our examples the factors took the form of trends or lags.) Thus, model selection techniques make you pay a penalty for the number of factors. We share this concern – it’ s too easy to attain fit without arriving at an adequate model. The trouble is that it remains easy to jump through the model selector’ s hoops, and still not achieve model adequacy, in the sense of adequately capturing the systematic inform
ation in the data. The goodness-of-fit measures already assume the likelihood function, when that’ s what the M-S tester is probing.

Take the Akaike Information Criterion (AIC) developed by Akaike in the 1970s (Akaike 1973 ). (There are updated versions, but nothing in our discussion depends on this.) The best known defenders of this account in philosophy are Elliott Sober and Malcolm Forster (Sober 2008 , Forster and Sober 1994 ). An influential text in ecology is by Burnham and Anderson (2002 ). Model selection begins with a family of models such as the LRM: yt = β 0 + β 1 xt + ut . They ask: Do you get a better fit – smaller residual – if you add x 2 t ? What about adding both x 2 t and x 3 t terms? And so on. Each time you add a factor, the fit improves, but Akaike kicks you in the shins and handicaps you by 1 for the additional parameter. The result is a preference ranking of models by AIC score. 4 For the granny shoe data above, the model that AIC likes best is

yt = β 0 + β 1 xt + β 2 x 2 t + β 3 x 3 t + ut .

‹ Prev Next ›