by Filip Palda
The same logic applies to treatment and control groups. If you know they started out similarly but see a very large difference between their performance after one group gets government help, then you can use the normal distribution to give you a probability that such a difference could be generated solely by chance. If this probability is low, you can conclude it was not chance but the government program at work, just as with the coin you can conclude that tampering and not chance is responsible for the extended run of heads.
The main shortcoming of the normal distribution is that if you want to use it to assess probability you need to know some detailed facts about the true average of the population and how it spreads out, or its variance. These population averages and spreads are called parameters and they are usually costly to find because you need to gather a lot of information about the whole population. Through decades of study costing millions of dollars we know what the average and variance of IQ scores are in North America. When it comes to farm yields in Senegal we draw a blank. There is a low rent alternative to the normal distribution that does not require a knowledge of population averages and spreads but simply of sample averages and spreads. Here is what you do.
Take the average performance of your two groups and divide their difference by an approximate measure of its spread called the estimated standard deviation. An estimated standard deviation is the square root of the average of squared differences between each individual score and the average score. The difference in the average performance of the two groups divided by the estimated standard deviation of this difference gives you a number called a t-statistic.
This t is basically a signal-to-noise ratio. You have experienced this sort of dilemma when listening to a fuzzy radio signal. Was that a voice you heard or was it just static? In trying to understand whether foreign aid increases crop yields, the “signal” in this case is how big the difference is between the performance of the treatment and control villages. The “noise” is how spread out the results are around the average. If the ratio of signal relative to noise is high, then the t is high. Large t’s are very unlikely if only noise is responsible for differing crop yields. If differences are due purely to chance, then we would expect these to largely cancel each other out and we would observe very little average difference between yields. A large difference in yields means that there is a strong probability that there really is a signal rising above the noise.
The surprising aspect of the t-distribution is that the only information you need calculate the probability that the t-value derived from the difference between your treatment and control group was due entirely to chance, is the number of people in your sample. You do not need to know population average or variances. This is why the t-test is called a non-parametric test. A parameter is a value that goes into some equation to determine how inputs to that equation bleed out into results.
What is the catch? Statistics, like economics is a science of trade-offs. Because the t distribution requires less information than the normal distribution, it also yields less information about the nature of the differences between groups. Suppose the t-value we calculate based on the difference between groups and their spreads, has a one in ten probability of being observed if only chance were responsible for creating the difference, and not some systematic force like government aid. But if we had sufficient background information to use the normal distribution we might be able to say that the chance of observing such a difference purely by chance is one in a hundred. In other words, when there is a difference between groups generated by something other than chance, the normal will let you see that better than the t. The catch is that you need more background information on the distributions of the experimental and control groups in order to use the normal. To use the t you do not need this information. The good news according to statisticians is that if you are dealing with groups of more than thirty individuals, or farms, or whatever is your unit of analysis, then the t-distribution starts “converging” to, or being almost as good as, the normal in telling you whether a difference is due to chance or to some systematic effect.
To summarize, there are rules that govern chance. By applying these rules economists can quantify their belief that a certain cause had a certain effect. Once we have assured ourselves that both control and treatment groups have similar measurable, consistent characteristics, statistics steps in and allows us to assess how much of the remaining result is due to chance and how much to the government program. If statistics tells us that the probability that chance was responsible for the difference is very small, say five per cent, then economists speak of a statistically significant result.
The message that can get lost in this discussion of statistics is that there is a logical model for establishing that one independent factor or variable, such as a government program, influences some target or dependent variable, such as crop yields or poverty. This logical construct has nothing at all to do with statistics. It simply says that you should start with two similar groups. You expose the treatment group to a stimulus, such as a government program to get the destitute back to employment. You compare that group’s average result to that of a similar group of the destitute. Any difference you note could then be due to the program because the influence of all other possible factors will cancel out due to the similarity in the two groups.
Statistics enters because any inferences you draw from the comparison of averages needs to be qualified. You can make comparison groups that have similar ages, sex, education, and so on, but you can never eliminate the chance factor that will tilt individuals one way or the other. While you cannot eliminate chance, you can use statistics to tell you how probable it is that the differences you observe are due purely to chance. If, say, there is only a one-in-a-hundred chance that the differences between the control and treatment groups would be generated by chance, then you deduce with this statistical confidence that the program had an effect.
Finding similar groups
THE LOGIC OF finding causality depends on starting with similar treatment and control groups, subjecting the treatment group to, say, a government program, and subtracting the subsequent enhancement in the performance of both groups to arrive at a net effect for the government program. All of that is fairly straightforward except for “similar groups”. What does that mean? We need to look at some early attempts to put the idea of similarity into practice.
During the recessions of the early 1970s and then again in the 1990s, welfare rolls swelled with chronically unemployed but able-bodied people. Previous research had shown that the longer a bout of unemployment lasted, the less chance any given person had of finding a job. Their skills might degrade, their motivation could wilt, and employers would start asking pointed questions. To pull the unemployed out of this welfare trap governments devised short-term programs that paid private employers to hire them. All the unemployed person had to do was volunteer for the program and agree to be interviewed at six-month intervals for several years afterwards to see how he or she was getting on after the end of the subsidized job. The group of volunteers was, in effect, the treatment group.
Researchers had access to data on each person pertinent to how well he or she might be expected to perform in the labor market. Age, sex, education, and length of previous bouts on welfare were among these available data. Yet more information was needed. To gauge the success of job subsidies, researchers could not simply look at the job market record of participants after the program. If many succeeded in remaining employed after the program they might have done so because the economy was improving, and not because of the experience they had gained during their subsidized spell of work.
Researchers decided to create a control group. But where to get it? The answer was to search through diverse government databases. There they found unemployed welfare dependents who had not signed up for subsidized jobs, but who were an exact match in the factors deemed to be important for job market performance. Age, sex, education, and the list of regular
suspects were criteria used to make matches. The color of socks one wore was not, presumably because that had no influence on job market success.
Because they were identical in the characteristic that mattered, this control group of non-participants would presumably react to general economic conditions in the same manner as the participants. Once you subtracted the employment performance in the period after the program from the success of participants you would filter out the influence of the all other factors except the subsidy and chance. Other studies using this methodology, which came to be known as “quasi-experimentation”, found that government programs where people volunteered to participate had strong and undeniably positive effects.
Enthusiasm with such findings deflated after economist Robert Lalonde (1986) pierced them with a critique. He argued that the programs were showing an effect that was not due to the government program but rather to some elusive, yet very real difference between groups. Perhaps the volunteers who comprised the treatment group were more motivated to find work than were non-participants. If you do not match treatment and control groups on motivation, then the apparent success of the program may not be due to government subsidies but rather to motivation. Incomplete matching is the same thing as lack of control, which is also the same thing as a failure to filter potentially confounding effects out from the result. In the case where people choose by themselves to participate in a program, this confounding effect is called “self-selection bias”. In practice, Lalonde found this bias so severe that it cast doubt on causal inferences drawn from all experiments where people had a say in whether they would be part of the treatment group or would not participate. His critique was similar to that launched against Tinbergen’s regression agenda. In regression, leaving some important variables out of the formula can bias your estimates of cause and effect.
Despite these critiques, economists felt they were on the right track. They just needed to prevent people from biasing the experiments. Since this bias arose from the choice to either participate or not participate, the answer seemed clear. Create treatment and control groups without reference to people’s wishes. This could be done by adopting a long-established practice from biology and multi-billion dollar pharmaceutical research, called randomization. As the name implies, it is the whim of chance, and not the determination of the individual that guides the group, treatment or control, to which he or she is assigned.
Randomized experiments
RANDOMIZATION IS SO ridiculously simple that you can perform it in your living room. Suppose you want to test whether hanging paintings on the walls of otherwise drab hospital rooms speeds patient recovery. Go to a ward and get a list of the numbers of the hundred or so rooms they have there. Write each room number on a piece of paper and put all papers in an urn, a bucket, or even a gym bag. Put on a blindfold, dip into the container, and place the first slip you grab on a pile called “treatment”. Place the next slip on a pile called “control”. Do this a hundred times and you will have a list of who will be in your treatment group and who will be in the control group. You have just laid the basis for a randomized experiment. What you will find after your selection is that the characteristics of participants and non-participants in the experiment tend to be very similar, provided you have more than a few dozen in each group.
People in rooms to be assigned a painting and those to be left without one have similar average ages, education, number of children, afflictions, and other factors that could influence their “performance”—in this case, the speed of recovery. The similarity arises because chance does not dictate you will pick rooms with women for the treatment group with any greater frequency than you will pick them for the control group. The same holds for other characteristics of the people being studied.
When people say that chance is blind, what they are really trying to express is that it is impartial (by the way, this was a true study, and it found that, with high statistical confidence, the presence of paintings on the walls sped up recovery; I have had an impressionist on my wall ever since). The reason randomization is a credible way of setting up experiments is that, being impartial, chance will allocate participants and non-participants equally, not simply on the basis of their observable characteristics but also on their unobservable characteristics.
If welfare recipients had been allocated randomly to either participate or not participate in subsidized jobs, self-selection bias might have been avoided. Chance would dictate that both treatment and control groups would have similar levels of motivation, or morale. Being able to match groups on unobservable characteristics is the main reason for using randomization to create treatment and control groups.
Here, as in the case of opting for the t-statistic instead of the normal distribution, there is a trade-off to be considered. The price you pay for matching fairly closely on unobserved characteristics is that control and treatment groups are close, but do not have exactly the same average features on the observed characteristics as was the case in quasi-experiments. The small differences that remain could bias your results. Recall that differences in characteristics other than participation in the program can influence differences between the performances of the two groups.
This is a trade-off many are willing to make. Many are willing to accept a small exaggeration, or possibly obfuscation of the result (bias can run both ways), due to imperfect matching, instead of being deceived by a very large bias resulting from perfect matching on characteristics we can observe in quasi-experiments where people chose to participate or not rather than having chance determine participation.
A few examples
ONCE YOU UNDERSTAND the basic logic of randomized experiments you need not be an economist to tap into a rich flow of studies on the effects of public policies. All you need to understand is how to compare averages. Even to economists versed in the mysteries of econometrics this approach is a welcome relief from the nuanced and highly subjective attempts at analysis using non-experimental data to which they are accustomed, or even inured.
Consider the question of whether private schools provide better education than public schools. The average performance of students at private schools generally exceeds that of students at public schools. However, is this due to better quality of education or simply because the students who tend to go to private schools come from richer families that have more resources outside classes, such as private tutoring, to improve their children’s academic performance? Simple comparisons of public and private schools tells us little about their relative quality because these comparisons lack control. You need to compare similar students in public and private schools.
As educational scholar Paul Peterson and his colleagues describe, such a comparison was accomplished in a 1997 randomized experiment by the School Choice Scholarships Foundation (SCSF) in New York City. The foundation invited parents to apply for “1,300 scholarships worth up to $1,400 annually for at least three years to children from low-income families then attending public schools. The scholarship could be applied toward the cost of attending a private school, either religious or secular. After announcing the program, SCSF received initial applications from over 20,000 students between February and late April 1997” (2003, 109). The researchers then randomly chose 6,000 names of the 20,000 who had applied: 3,000 received the scholarship and 3,000 were the control group. Similar experiments were conducted around the same time in Washington, DC, and Michigan. Non-African-Americans saw no improvement in their test scores, but for African-Americans the result was different. According to Peterson and his colleagues, “Overall, the effects of attending a private school on student test scores are moderately large … black students who switched to private schools scored, after one year, 0.18 standard deviations higher than the students in the control group. After two and three years, the size of the effect grew to 0.28 and 0.30 standard deviations” (2003, 121). In statistics, when you rise 0.30 standard deviations above the average, you find yourself roughly in the top third of students
.
While randomized studies of this sort can cost millions of dollars, a little imagination can lead to fascinating findings without the researcher experiencing sticker shock. Economist Joel Slemrod and his colleagues analyzed data from a Minnesota Department of Revenue experiment to enhance tax compliance. They write that, “in 1995 a group of 1724 randomly selected Minnesota taxpayers was informed by letter that the returns they were about to file would be ‘closely examined’. Compared to a control group that did not receive this letter, low and middle-income taxpayers in the treatment group on average increased tax payments compared to the previous year” (2001, 455).
These are the types of easy-to-understand studies that are responsible for the “credibility revolution” in economic data analysis. All the work goes into making sure that confounding forces are controlled and that randomness is neutralized through large sample size. The rest is just a comparison of averages to reveal whether some program had any effect.
Natural experiments
THE LOGIC OF randomized controlled experiments carries over to so-called “natural experiments”. A natural experiment is one in which some clearly understood random process determines the manner in which a cause, such as government intervention, is applied to a target population. The random aspect of who participates or when people participate helps to create similar groups in precisely the same manner as did the experimental randomization described earlier. This abstract idea is best understood through examples.
Sebastian Galiani and his colleagues studied the military draft lottery in Argentina to see if military service makes men violent even after they have left the army. Since the lottery is random it creates nearly identical groups of young men who are chosen for the army and others who do not perform military service. They put their idea of a natural experiment as follows: “In order to identify the causal effect of conscription on crime, we need to identify a variable that affects participation in conscription but does not affect crime through any other mechanisms. The draft lottery in Argentina offers an opportunity to address this question. The lottery randomly assigned eligibility of all young males” (2011, 121).