Reign of Error: The Hoax of the Privatization Movement and the Danger to America's Public Schools
Page 14
• peer culture and achievement
• prior teachers and schooling, as well as other current teachers
• differential summer learning loss, which especially affects low-income children
• the specific tests used, which emphasize some kinds of learning and not others, and which rarely measure achievement that is well above or below grade level17
Value-added ratings, they emphasized, are not stable. They vary from class to class, from year to year, and from one way of measuring to another. There are different ways to calculate value added, and the results will vary depending on which method is used. Different applications of value-added methodology produce different teacher ratings. When students take a different test, the teacher ratings also change.
The report by these two professional associations found that “teachers’ value-added ratings are significantly affected by differences in the students who are assigned to them.” Students are not randomly assigned. Those who teach students who are English-language learners or who have disabilities or who are homeless or who have poor attendance might have lower value-added ratings. Also, teachers of the gifted are likely to see small value added, because their students begin with high scores. “Even when the model includes controls for prior achievement and student demographic variables, teachers are advantaged or disadvantaged based on the students they teach.” Thus, to the extent that teachers’ job evaluations and compensation are tied to value-added measures, they may feel encouraged to avoid the neediest students, the students who are going to jeopardize their reputations, their careers, and their salaries.
Advocates of value-added assessment claim that they want to improve education for the neediest students by identifying the most effective teachers. They presume over time, as the weakest teachers are fired, only effective teachers would remain. But given the instability of the measures, and the threat to teachers’ livelihoods, value-added assessment may well harm the most vulnerable students. Current levels of inequality will deepen if teachers are incentivized to shun the students with the highest needs. Schools in high-poverty districts already have difficulty retaining staff and replacing them. Who will want to teach in schools that are at risk of closing because of the students they enroll?
The very concept of “value added” assumes that it is possible to isolate the effects of a single teacher on student achievement. But, says the joint AERA-NAE panel, this is overly simplistic:
No single teacher accounts for all of a student’s learning. Prior teachers have lasting effects, for good or ill, on students’ later learning, and current teachers also interact to produce students’ knowledge and skills. For example, the essay writing a student learns through his history teacher may be credited to his English teacher, even if she assigns no writing; the math he learns in his physics class may be credited to his math teacher. Specific skills and topics taught in one year may not be tested until later, if at all. Some students receive tutoring, as well as help from well-educated parents. A teacher who works in a well-resourced school with specialist supports may appear to be more effective than one whose students don’t receive these supports.
Children are not corn or tomatoes, and no statistical methodology can successfully control for all the factors that influence changes in students’ test scores. When analyzing the growth of cornstalks, we take into account the quality of the seeds, soil, water, wind, sunlight, weather, nutrients, pests, and perhaps other factors as well as the skill of the farmer. Measuring learning is far more complex than measuring agricultural production and involves many more factors because human beings are even less predictable than plants.
The champions of value-added assessment could learn from Harvey Schmidt and Tom Jones, whose musical The Fantasticks got it right:
Plant a radish.
Get a radish.
Never any doubt.
That’s why I love vegetables;
You know what you’re about!
Plant a turnip.
Get a turnip.
Maybe you’ll get two.
That’s why I love vegetables;
You know that they’ll come through!
They’re dependable!
They’re befriendable!
They’re the best pal a parent’s ever known!
While with children,
It’s bewilderin’.
You don’t know until the seed is nearly grown
Just what you’ve sown.
Another complicating factor in the creation of value-added rankings is that they are based completely on standardized tests. But are the tests robust enough to serve as a proxy for teacher quality? The tests are not barometers or yardsticks. They are designed and constructed by humans and subject to error. Testing experts warn about measurement error, statistical error, human error, and random error. Given all the problems with standardized tests, and given the limited range of knowledge and skills that they test, can we be sure that they are truly an adequate or appropriate measure of student learning or teacher effectiveness? One can easily imagine a teacher who spends most of the year drilling her students to take the state tests. That teacher may get a high value-added rating yet be an uninspiring teacher. Do we want to honor and reward only those teachers who excel at teaching to the test? Or do we want to honor those teachers who are best at getting their students to think and ask good questions?
The cardinal rule of psychometrics is this: a test should be used only for the purpose for which it is designed. The tests are designed to measure student performance in comparison to a norm; they are not designed to measure teacher quality or teacher “performance.” Teaching is multifaceted and complex. Good teachers want students to participate in discussion and debate in the classroom; they want students to be active and engaged learners and to take the initiative in exploring more than what was assigned. Can standardized, multiple-choice tests accurately reflect teacher quality? What students have learned may be gauged more accurately by their classroom work and by their independent projects—their essays, their research papers, and other demonstrations of their learning—than by their test scores.
Certainly teachers should be evaluated, but evaluating them by the rise or fall of their students’ test scores is fraught with perverse consequences. It encourages teaching to multiple-choice tests; narrowing the curriculum only to the tested subjects; gaming the system by states and districts to inflate their scores; and cheating by desperate educators who don’t want to lose their jobs or who hope to earn a bonus. When the tests become more important than instruction, something fundamental is amiss in our thinking.
Some districts and states are trying to avoid narrowing the curriculum by expanding testing beyond reading and mathematics; they intend to test the arts, physical education, science, and everything else that is taught. They are doing this to create the data to evaluate all teachers. Students will be tested more so their teachers’ can be evaluated more. As the current national obsession with testing intensifies, we can expect to see more testing, more narrowing of the curriculum, more narrowing of instruction to only what is tested, more cheating, and less attention to teaching students to think, to discuss, to consider different ways to solve problems, and to be creative.
Linda Darling-Hammond, a Stanford University professor who is one of the nation’s leading experts on the subject of preparing and evaluating teachers, lost her enthusiasm for evaluation by test scores as she saw the confusing and misleading results in such places as Tennessee, Houston, the District of Columbia, and New York City.18
Darling-Hammond concluded that the teacher ratings “largely reflect whom a teacher teaches, not how well they teach. In particular, teachers show lower gains when they have large numbers of new English-learners and students with disabilities than when they teach other students. This is true even when statistical methods are used to ‘control’ for student characteristics.”
Why punish teachers for choosing to teach the students with the greatest needs or for
being assigned to a class with such students?
If the goal of teacher evaluation is to help teachers improve, this method doesn’t work. It doesn’t provide useful information to teachers or show them how to improve their practice. It just labels and ranks them in ways that teachers find demeaning and humiliating. Darling-Hammond noted that Houston used a value-added method to fire a veteran who had been the district’s teacher of the year. Another teacher in Houston said: “I teach the same way every year. [My] first year got me pats on the back. [My] second year got me kicked in the backside. And for year three, my scores were off the charts. I got a huge bonus. What did I do differently? I have no clue.”
In 2010, the Los Angeles Times commissioned its own value-added analysis, based on nothing but test scores, and published the rankings of thousands of teachers. This initiated a national controversy about the ethics of publishing teachers’ job ratings. No one claimed that instruction improved as a result.19 The flaws of value-added analysis set off another heated debate in early 2012, when the New York City Department of Education publicly released the names and ratings of thousands of teachers. Rupert Murdoch’s New York Post filed a freedom-of-information request for the ratings, which the teachers’ union opposed, citing the ratings’ inaccuracy. Mayor Michael Bloomberg contended that parents and the public had a right to know the teacher ratings. After the newspaper won the court battle with the union, the scores were released and widely published. The Department of Education warned the public about a large margin of error: On a 100-point scale, the margin of error in mathematics was 35 percentage points; the margin of error in reading was 53 points. In other words, a teacher of mathematics might be ranked as a 50 but might in fact be anywhere from the 15th percentile to the 85th percentile. In reading, the same teacher might improbably be at the -3rd percentile or the +103rd percentile, which demonstrates how useless the rankings were.20
The New York Post printed a story and photograph of the city’s “best” teacher and its “worst” teacher. The teacher who was allegedly the worst was hounded by reporters at her home, as was her father. A few days later, it was revealed that she was a teacher of new immigrant students, who left her class as they learned English. She worked in a good school, and the principal said she was an excellent teacher. What was gained by giving her a low rating and putting her name and photograph in the newspaper? She suffered public humiliation because she taught English-language learners.21
Stated as politely as possible, value-added assessment is bad science. It may even be junk science. It is inaccurate, unstable, and unreliable. It may penalize those teachers who are assigned to teach weak students and those who choose to teach children with disabilities, English-language learners, and students with behavioral problems, as well as teachers of gifted students who are already at the top.
So, we circle back to the assertion that is common among reformers: Will three great teachers in a row close the achievement gap? It is possible, but there is no statistical method today that can accurately predict or identify which teachers are “great” teachers. If by great, we mean teachers who awaken students’ desire to learn, who kindle in their students a sense of excitement about learning, scores on standardized tests do not identify those teachers. Nothing about a multiple-choice test is suited to finding the most inspiring and the most dedicated teachers in every school. In every school, students, teachers, and supervisors know who those teachers are. We need more of them. We will not get them by continuing to turn teachers into testing technicians or judging teachers by inappropriate statistical models.
If by great, we mean the ability to get students to produce higher scores every time they are tested, the current value-added assessments may identify some teachers who can do this. But, to my knowledge, there is no school in which every teacher achieves this target. Claiming, as reformers do, that one day every classroom will have a teacher who can produce extraordinary test score gains for every student, no matter what his or her circumstances, is simply not leveling with the American public. No nation in the world has achieved 100 percent proficiency. And no other nation in the world evaluates its teachers by the rise or fall of their students’ test scores.
It is not even clear that this is a worthy goal.
Aside from the absence of evidence for this way of evaluating teachers, there remains the essential question of why scores on standardized tests should displace every other goal and expectation for schools: character, knowledge, citizenship, love of learning, creativity, initiative, and social skills.
CHAPTER 12
Why Merit Pay Fails
CLAIM Merit pay will improve achievement.
REALITY Merit pay has never improved achievement.
The reformers at the big foundations and the U.S. Department of Education decided that they could raise test scores and change the nature of the teaching profession by offering a type of financial incentive known as merit pay.
Like other corporate reformers, they believed that American public education is failing because of its teachers. They believed that the wrong kinds of people entered teaching, the kinds who did not graduate from elite colleges and universities and did not rank in the top third of their graduating classes. The reformers thought that the chance to earn performance pay would attract recruits with strong academic backgrounds. This, they thought, would solve the teacher-quality problem.
Reformers want education to become more like business, governed by the same principles of competition, with compensation tied to results. They see teacher tenure and teacher seniority as obstacles to achieving the flexible, results-oriented workforce that they believe is needed. They see teachers’ unions as an obstacle, because the unions defend job protections for teachers and demand a salary scale with increases related to experience and education. In the reformers’ ideal scenario, teachers would serve at the pleasure of their supervisors, as do employees in the business world. If they produce results, they win bonuses. If they do not produce results, their jobs will be on the line. Two or three years of lackluster results, and they will be fired. In some districts, such as the District of Columbia, a teacher may be fired based on a single year of poor results.
Teachers usually find this line of argument objectionable. They see themselves as professionals, not just employees. They don’t like the idea that non-educators (and most of the reformers are non-educators or have taught for only a few years) are redesigning the rules of their workplace and profession. They don’t like merit pay, because they know it will destroy the collaboration that is necessary for a healthy school climate. They are dumbfounded that the public discourse about education is fixated on blaming teachers for the ills of society. They don’t understand why so much political energy is now being expended to remove their job security and put their careers in jeopardy.
Teachers are right to feel aggrieved. The remedies now promoted as cures for the teaching profession are unlikely to have a beneficial effect; they are almost certain to make the profession less attractive to those who want to make a career of teaching. The constant criticism that has dominated reform discourse for the past few years has discouraged and demoralized teachers.
Many teachers were disheartened by No Child Left Behind, which overemphasized standardized testing. Obama’s Race to the Top proved even more discouraging than NCLB because it directly targets teachers as the source of student success or failure. Race to the Top offers incentives for school districts to fire the teachers in schools with low test scores as a remedy. The U.S. Department of Education set aside $1 billion for merit pay in the Teacher Incentive Fund.
Merit pay is not an innovative idea. It has been tried in school districts across the nation for the past century. Richard J. Murnane and David K. Cohen surveyed the history of merit pay in the mid-1980s and concluded that it “does not provide a solution to the problem of how to motivate teachers.” In 1918, they reported, 48 percent of the school districts in the United States had some kind of merit pay plan, but few of them su
rvived. By 1923, the proportion of districts with a merit pay plan had fallen to 33 percent, and in 1928 it was down to 18 percent. During the 1940s and 1950s, interest in merit pay declined, and by 1953 only 4 percent of cities with a population over thirty thousand offered merit pay. This could not have been because of the power of teachers’ unions, because there were few unionized teachers at the time, and where unions existed, they were poorly organized and weak. After Sputnik in 1957, there was again a flurry of interest in merit pay, and 10 percent of districts offered it. But many of these programs disappeared, and by the mid-1980s, when Murnane and Cohen wrote their article, 99 percent of the nation’s teachers were in districts that had a uniform salary schedule, based on education and experience.1
Murnane and Cohen found two types of merit pay plans. One offered bonuses to teachers if their students got higher test scores. The other offered bonuses to teachers who got superior evaluations from their principals. They described efforts to tie teachers’ pay to student test scores as the “new style” of merit pay; they called it a “piece-rate compensation system.” This method avoids the subjectivity of the “old style,” which depends on principal judgment, but the “new style” does not fit the nature of teachers’ work. Piece-rate work, they noted, is better suited to manufacturing jobs, where it is relatively easy to measure the true contribution of the individual worker to the firm’s output at low cost. So, for example, a commercial laundry can pay workers based on how many shirts they iron in an hour or a day. Quality can be determined by customer complaints.
But piece-rate compensation doesn’t work with teachers, said Murnane and Cohen. First, it encourages teachers to spend more time with the students who will respond to their coaching and to spend less time with those who will not. This problem was observed in the 1970s, when the U.S. government offered performance contracts to private firms to manage schools; evaluations revealed that the firms overlooked the students at the top, who could manage on their own, and the students at the bottom, who were the toughest challenge. Most of their efforts were concentrated on the students in the middle, who would show the biggest improvement.