Book Read Free

The Bell Curve: Intelligence and Class Structure in American Life

Page 74

by Richard J. Herrnstein


  zAge +0.13876137 0.1336384 1.08 0.2991

  Appendix 5

  Supplemental Material for Chapter 13

  Four issues raised in Chapter 13 are elaborated here: test bias, narrowing of the black-white difference in academic test scores, the broader argument for racial differences advanced by Philippe Rushton, and possible language bias against Latinos in the NLSY data.

  MORE ON TEST BIAS

  In Chapter 13, we reported that the scientific evidence demonstrates overwhelmingly that standardized tests of cognitive ability are not biased against blacks. Here, we elaborate on the reasoning and evidence that lead to that conclusion.

  More on External Evidence of Bias: Predictive Validity

  Everyday commentary on test bias usually starts with the observation that members of various ethnic (or socioeconomic) groups have different average scores and leaps to the assumption that a group difference is prima facie evidence of bias. But a moment’s thought should convince anyone that this is not necessarily so. A group difference is, in and of itself, evidence of test bias only if we have some reason for assuming that an unbiased test would find no average difference between the groups. What might such a reason be? We cast the answer in terms of whites and blacks, since that is the context for most charges of test bias. Inasmuch as the context also usually involves a criticism of the use of the test in selection of persons for school or job, the most pertinent reason for assuming equality in the absence of test bias would be that we have other data showing that a randomly selected black and white with the same test score have different outcomes. This is what the text refers to as external evidence of bias.

  If for example, blacks do better in school than whites after choosing blacks and whites with equal test scores, we could say that the test was biased against blacks in academic prediction. Similarly, if they do better on the job after choosing blacks and whites with equal test scores, the test could be considered biased against blacks for predicting work performance. This way of demonstrating bias is tantamount to showing that the regression of outcomes on scores differs for the two groups. On a test biased against blacks, the regression intercept would be higher for blacks than whites, as illustrated in the graphic below. Test scores under these conditions would underestimate, or “underpredict,” the performance outcome of blacks. A randomly selected black and white with the same IQ (shown by the vertical broken line) would not have equal outcomes; the black would outperform the white (as shown by the horizontal broken lines). The test is therefore biased against blacks. On an unbiased test, the two regression lines would converge because they would have the same intercept (the point at which the regression line crosses the vertical axis).

  When a test is biased because it systematically underpredicts one group’s performance

  But the graphic above captures only one of the many possible manifestations of predictive bias. Suppose, for example, a test was less valid for blacks than for whites.1 In regression terms, this would translate into a smaller coefficient (slope in these graphics), which could, in turn, be associated either with or without a difference in the intercept. The next figure illustrates a few hypothetical possibilities.

  All three black lines have the same low coefficient; they vary only in their intercepts. The gray line, representing whites, has a higher coefficient (therefore, the line is steeper). Begin with the lowest of the three black lines. Only at the very lowest predictor scores do blacks score higher than whites on the outcome measure. As the score on the predictor increases, whites with equivalent predictor scores have higher outcome scores. Here, the test bias is against whites, not blacks. For the intermediate black line, we would pick up evidence for test bias against blacks in the low range of test scores and bias against whites in the high range. The top black line, with the highest of the three intercepts, would accord with bias against blacks throughout the range, but diminishing in magnitude the higher the score.

  When a test is biased because it is a less valid predictor of performance for one group than another

  Readers will quickly grasp that test scores can predict outcomes differently for members of different groups and that such differences may justify claims of test bias. So what are the facts? Do we see anything like the first of the two graphics in the data—a clear difference in intercepts, to the disadvantage of blacks taking the test? Or is the picture cloudier—a mixture of intercept and coefficient differences, yielding one sort of bias or another in different ranges of the test scores? When questions about data come up, cloudier and murkier is usually a safe bet. So let us start with the most relevant conclusion, and one about which there is virtual unanimity among students of the subject of predictive bias in testing: No one has found statistically reliable evidence of predictive bias against blacks, of the sort illustrated in the first graphic, in large, representative samples of blacks and whites, where cognitive ability tests are the predictor variable for educational achievement or job performance. In the notes, we list some of the larger aggregations of data and comprehensive analyses substantiating this conclusion.2 We have found no modern, empirically based survey of the literature on test bias arguing that tests are predictively biased against blacks, although we have looked for them.

  When we turn to the hundreds of smaller studies that have accumulated in the literature, we find examples of varying regression coefficients and intercepts, and predictive validities. This is a fundamental reason for focusing on syntheses of the literature. Smaller or unrepresentative individual studies may occasionally find test bias because of the statistical distortions that plague them. There are, for example, sampling and measurement errors, errors of recording, transcribing, and computing data, restrictions of range in both the predictor and outcome measurements, and predictor or outcome scales that are less valid than they might have been.3 Given all the distorting sources of variation, lack of agreement across studies is the rule.

  But even taken down to so fine a level, the case against predictive bias against blacks remains overwhelming. As late as 1984, Arthur Jensen was able to proclaim that “I have not come across a bona fide example of the opposite finding [of a test that underpredicts black performance].”4 Jensen’s every finding regarding racial differences in IQ is routinely subjected to intense scrutiny by his critics, but no one has contradicted this one. We are not absolutely sure that our literature review has identified every study since 1984, but our search revealed no examples to counter Jensen’s generalization.5

  Insofar as the many individual studies show a pattern at all, it points to overprediction for blacks. More simply, this body of evidence suggests that IQ tests are biased in favor of blacks, not against them. The single most massive set of data bearing on this issue is the national sample of more than 645,000 school children conducted by sociologist James Coleman and his associates for their landmark examination of the American educational system in the mid-1960s. Coleman’s survey included a standardized test of verbal and nonverbal IQ, using the kinds of items that characterize the classic IQ test and are commonly thought to be culturally biased against blacks: picture vocabulary, sentence completion, analogies, and the like. The Coleman survey also included educational achievement measures of reading level and math level that are thought to be straightforward measures of what the student has learned. If IQ items are culturally biased against blacks, it could be predicted that a black student would do better on the achievement measures than the putative IQ measure would lead one to expect (this is the rationale behind the current popularity of steps to modify the SAT so that it focuses less on aptitude and more on measures of what has been learned). But the opposite occurred. Overall, black IQ scores overpredicted black academic achievement by .26 standard deviations.6

  One inference that might be drawn from this finding is that black children were for some reason not taking as much from school as their ability would permit, or that black children went to worse schools than white children, or any of several other interpretations. But whatever the ex
planation might be, the results directly contradict the hypothesis that IQ tests give an unfairly low estimate of black academic performance.

  A second major source of data suggesting that standardized tests over-predict black performance is the SAT. Colleges commonly compare the performance of freshmen, measured by grade point average, against the expectations of their performance as predicted by SAT scores. A literature review of studies that broke down these data by ethnic group revealed that SAT scores overpredicted freshman grades for blacks in fourteen of fifteen studies, by a median of .20 standard deviation.7 In five additional studies where the ethnic classification was “minority” rather than specifically “black,” the SAT score overpredicted college performance in all five cases, by a median of .40 standard deviation.8

  For job performance, the most thorough analysis is provided by the Hartigan Report, assessing the relationship between the General Aptitude Test Battery (GATB) and job performance measures. Out of seventy-two studies that were assembled for review, the white intercept was higher than the black intercept in sixty of them—that is, the GATB overpredicted black performance in sixty out of the seventy-two studies.9 Of the twenty studies in which the intercepts were statistically significantly different (at the .01 level), the white intercept was greater than the black intercept in all twenty cases.10

  These findings about overprediction apply to the ordinary outcome measures of academic and job performance. But it should also be noted that “overprediction” can be a misleading concept when it is applied to outcome measures for which the predictor (IQ, in our continuing example) has very low validity. Inasmuch as blacks and whites differ on average in their scores on some outcome that is not linked to the predictor, the more biased it will be against whites. Consider the next figure, constructed on the assumption that the predictor is nearly invalid and that the two groups differ on average in their outcome levels.

  A predictor with low validity may seem to be biased against whites if there is a substantial difference in the outcome measure

  This situation is relevant to some of the outcome measures discussed in Chapter 14, such as short-term male unemployment, where the black and white means are quite different, but IQ has little relationship to short-term unemployment for either whites or blacks. This figure was constructed assuming only that there are factors influencing outcomes that are not captured by the predictor, hence its low validity, resulting in the low slope of the parallel regression lines.11 The intercepts differ, expressing the generally higher level of performance by whites compared to blacks that is unexplained by the predictor variable. If we knew what the missing predictive factors are, we could include them in the predictor, and the intercept difference would vanish—and so would the implication that the newly constituted predictor is biased against whites. What such results seem to be telling us is, first, that IQ tests are not predictively biased against blacks but, second, that IQ tests alone do not explain the observed black-white differences in outcomes. It therefore often looks as if the IQ test is biased against whites.

  More on Internal Evidence of Bias: hem Analysis

  Laymen are often skeptical that IQ test items could measure anything as deep as intelligence. Knowing the answers seems to them to depend less on intelligence than on having been exposed to certain kinds of cultural or historical information. It is usually a short step from here to the conclusion that the tests must be biased. Pundits of varying sorts reinforce this intuition about test item bias, claiming that the middle-and upper-class white culture infuses test items even after vigorous efforts to expunge it.

  The data confirming Spearman’s hypothesis, which we discussed at some length in Chapter 13, provide the most convincing conceptual refutation of this allegation by providing an alternative explanation that has been borne out by many studies: the items on which blacks and whites differ most widely are not those with the most esoteric cultural content, but the ones that best measure the general intelligence factor, g.12 But many other studies have directly asked whether the cultural content of items is associated with the magnitude of the black-white difference, which we review here.

  One of the earliest of the studies, a 1951 doctoral thesis at Catholic University, proceeded on the assumption that some test items are more dependent on exposure to culture than others.13 Frank McGurk, the study’s author, consequently had large numbers of independent judges rate many test items for their cultural loading. On exploratory tests, he was able to establish each item’s general difficulty, which is defined simply as the proportion of a population that gets the item wrong. He could therefore identify pairs of items, one highly loaded with cultural information and the other not highly loaded but of equal difficulty. Now, finally, the crucial evaluation could be made with a sample of black and white high school students matched for schooling and socioeconomic background. The black-white gap, he discovered, was about twice as large on items rated as low in cultural loading as on items rated as high in cultural loading. Consider, for example, a pair of equally difficult test items. The one that is culturally loaded is probably difficult because it draws on esoteric knowledge; the other item is probably difficult because it calls on complex cognitive processing—g. McGurk’s results undermined the proposition that access to esoteric knowledge was to blame for the black-white difference.

  Another approach in the pursuit of test-item bias is based on which items blacks and whites find hard or easy. Conceptually, this is much like McGurk’s approach, except that it does not require us to have items rated by experts, a subjective procedure that some might find suspect. Instead, if the cultural influence matters and if blacks and whites have access to different cultural backgrounds, then items that pick up these cultural differences should split the two groups. Items drawing on cultural knowledge more available to whites than to blacks should be, on average, relatively easier for whites than for blacks. Items lacking this tip for whites or items with a tip for blacks should not be differentially easier for whites and may be easier for blacks.

  This idea is tested by ranking the items on a test separately for whites and for blacks, in order of difficulty. That is, the easiest item for whites is the one with the highest proportion of correct answers among whites; the next easiest item for whites is the one with the second highest proportion of correct answers for whites; and so on. Now repeat the procedure using the blacks’ proportions of correct answers. This will result in two sets of rank orders for all the items. The rank-order correlation between them is a measure of the test-item bias hypothesis: The larger the correlation is, the less support the hypothesis finds. Alternatively, the proportions of correct responses within each group are transformed into standard scores and then correlated by some other measure of correlation, such as the Pearson product-moment coefficient.

  Either way, the result is clear. Relative item difficulties are essentially the same for both races (by sex). That is, blacks and whites of the same sex come close to finding the same item the easiest, the same item next easiest, all the way down to the hardest item.14 When the rank order of difficulty differs across races, the differences tend to be small and unsystematic. Rank order correlations above .95 are not uncommon for the items on the Wechsler and Stanford-Binet tests, which are, in fact, the tests that provide most of the anecdotal material for arguing that test items are biased. Pearson correlations are often somewhat lower but typically still above .8. Moreover, when items do vary in difficulty across races, most of the variation is eliminated by taking mental age into account. Since blacks and whites of the same chronological age differ on average in mental age, allowing a compensating lag in chronological age will neutralize the contribution of mental age. Compare, say, the item difficulties for 10-year-old blacks with that for 9-year-old or 8-year-old whites. When this is done, the correlations in difficulty almost all rise into the .9 range and above.15

  Because “item bias” ordinarily defined has failed to materialize, the concept has been extended to encompass item characteristic
s that are intertwined with the underlying rationale for thinking that an item measures g. For example, one researcher has found that the black-white gap is diminished for items that call for the subject to identify the one false response, compared to items requiring the subject to identify the one correct response.16 Is this a matter of bias, or a matter of how well the two types of items tap the construct called intelligence? This in turn brings us full circle to Spearman’s hypothesis discussed in Chapter 13, which offers an interpretative framework for explaining such differences.

  More on Other Potential Sources of Bias

  We turn now to one of the least precisely but most commonly argued reasons for thinking that tests are biased: Tests are a sort of game, and, as in most games, it helps to have played the testing game, it helps to get coaching, and it helps to be playing on the home field. Privileged groups get more practice and coaching than underprivileged groups. They have a home-court advantage; the tests are given in familiar environments, administered by familiar kinds of people. A major part of the racial differences in test scores may be attributed to these differences. In this discussion, we begin with coaching and practice, then turn to some of the other ways in which the testing situation might influence scores.

  PRACTICE AND COACHING. For IQ tests, coaching and practice are not a significant issue because coaching and practice effects exist only under conditions that virtually never apply. To get a sizable practice effect for an IQ test, it is necessary to use subjects who have never taken an IQ-like test, administer the identical test twice, and do so quickly (preferably within a few weeks).17 If the subjects fail to meet any of those conditions, the chances of finding a practice effect are small, and the size of any effect, if one is found, will be just a few points. Coaching effects are even harder to obtain. We are unable to identify any IQ data in any study, large or small, in which the results are compromised because the IQ scores of part of the sample have been obtained after this kind of experience. That’s not the way that IQ tests have been administered anywhere to any significant sample at any time during the history of IQ testing—except to the samples used to assess practice and coaching effects, and sometimes to the subjects of intensive remedial programs such as those discussed in Chapter 17.

 

‹ Prev