by Ian Ayres
John Bergsagel, a senior oncologist at the hospital, was troubled by light brown spots on the boy’s skin that didn’t quite fit the normal symptoms of leukemia. Still, Bergsagel had lots of paperwork to get through and was tempted to rely on the blood test that clearly indicated leukemia. “Once you start down one of these clinical pathways,” Bergsagel said, “it’s very hard to step off.”
By chance, Bergsagel had recently seen an article about Isabel and had signed up to be one of the beta testers of the software. So instead of turning to the next case, Bergsagel sat down at a computer and entered the boy’s symptoms. Near the top of the “Have you considered…?” list was a rare form of leukemia that chemotherapy does not cure. Bergsagel had never heard of it before, but sure enough, it often presented with brown skin spots.
Researchers have found that about 10 percent of the time, Isabel helps doctors include a major diagnosis that they would not have considered but should have. Isabel is constantly putting itself to the test. Every week the New England Journal of Medicine includes a diagnostic puzzler in its pages. Simply cutting and pasting the patient’s case history into the input section allows Isabel to produce a list of ten to thirty diagnoses. Seventy-five percent of the time these lists include what the Journal (usually via autopsies) verifies as the correct diagnosis. And manually entering findings into more tailored input fields raises Isabel’s success rate to 96 percent. The program doesn’t pick out a single diagnosis. “Isabel is not an oracle,” Britto says. It doesn’t even give the probable likelihood or rank the most likely diagnosis. Still, narrowing the likely causes from 11,000 to 30 unranked diseases is substantial progress.
I love the TV show House. But the central character, who has unsur-passed powers as a diagnostician, never does any research. He relies on his experience and Sherlockian deductive powers to pull a diagnostic rabbit out of the hat each week. House makes excellent drama, but it’s no way to run a health care system. I’ve suggested to my friend Lisa Sanders, who recommends script ideas for the series, that House should have an episode in which the protagonist vies against data-based diagnostics—á la Kasparov vs. the IBM computer. Isabel’s Dr. Joseph Britto doesn’t think it would work. “Each episode would be five or seven minutes instead of an hour,” he explains. “I could see Isabel working much better with Grey’s Anatomy or ER where they have to make a lot of decisions under a lot of time pressure.” Only in fiction does man beat the machine.
And Super Crunching is going to make the diagnostic predictions even better. At the moment these softwares are still basically crunching journal articles. Isabel’s database has tens of thousands of associations but at the end of the day it is solely a compilation of information published in medical journal articles. A team of doctors aided with Google-like natural language searches for published symptoms that have been associated with a particular disease and enters the results into the diagnostic database.
As it currently stands, if you go to see a doctor or are admitted to the hospital, the results of your experience have absolutely no value to our collective knowledge of medicine—save for the exceptional case in which your doctor decides to write it up for a journal or when your case happens to be included in a specialized study. From an information perspective, most of us die in vain. Nothing about our life or death helps the next generation.
The rapid digitalization of medical records means that for the first time ever, doctors are going to be able to exploit the rich information that is embedded in our aggregate health care experience. Instead of giving an undifferentiated list of possible diagnoses, Isabel will, within one or two years, be able to give the likelihood of particular diseases that are conditioned on your particular symptoms, patient history, and test results. Britto grows animated as he describes the possibilities. “You have someone who comes in with chest pains, sweating, palpitations, and is over fifty years old,” he says. “You as a doctor might be interested to know that in the last year, at Kaiser Permanente Mid-Atlantic, these symptoms have turned out to be much more often myocardial infarction and perhaps less commonly a dissecting aneurysm.”
With digital medical records, doctors don’t need to type in symptoms and query their computer. Isabel can automatically extract the information from the records and generate its prediction. In fact, Isabel has recently teamed with NextGen to create a software with a structured flexible set of input fields to capture essential data. Instead of the traditional record keeping where doctors would non-systematically dictate what in retrospect seemed relevant, NextGen collects much more systematic data from the get-go. “I don’t like saying this loudly to my colleagues,” Britto confides, “but in a sense you engineer this physician out of the role of having to enter these data. If you have structured fields, you then force a physician to go through them and therefore the data that you get are much richer than had you left him on his own to write case notes, where we tend to be very brief.”
Super Crunching these massive new databases will give doctors for the first time the chance to engage in real-time epidemiology. “Imagine,” Britto said, “Isabel might tell you that an hour ago on the fourth floor of your hospital a patient was admitted who had similar features of infection and blisters.” Some patterns are much easier to see in aggregate than from casual observation by individual participants.
Instead of relying solely on expert-filtered data, diagnosis should also be based on the experience of the millions of people who use the health care system. Indeed, database analysis might ultimately lead to better decision making about how to investigate a diagnosis. For people with your symptoms, what tests produced useful information? What questions were most helpful? We might even learn the best order in which to ask questions.
When Britto started learning how to fly an airplane back in 1999, he was struck by how much easier it was for pilots to accept flight support software. “I asked my flight instructor what he thought accounted for the difference,” Britto said. “He told me, ‘It is very simple, Joseph. Unlike pilots, doctors don’t go down with their planes.’”
This is a great line. However, I think physician resistance to evidence-based medicine has much more to do with the fact that no one likes to change the basic way that they have been operating. Ignaz Semmelweis found that out a long time ago when he had the gall to suggest that doctors should wash their hands repeatedly throughout the day. The same reaction is at play when the EBM crowd suggests that doctors should do patient-specific research about the most appropriate treatment. Many physicians have effectively ceded a large chunk of control of treatment choice to Super Crunchers. Lisa Sanders distinguishes diagnosis, which she calls “my end,” from the question of the appropriate therapy, which she says “is really in the hands of the experts.” When she says “the experts,” she means Super Crunchers, the Ph.D.s who are cranking out the statistical studies showing which treatment works best. Very soon, however, Isabel will start to invade the physicians’ end of the process. We will see the fight move to evidence-based diagnosis. Isabel Healthcare is careful to emphasize that it only provides diagnostic support. But the writing is on the wall. Structured electronic input software may soon force physicians to literally answer the computer’s questions.
The Super Crunching revolution is the rise of data-driven decision making. It’s about letting your choices be guided by the statistical predictions of regressions and randomized trials. That’s really what the EBM crowd wants. Most physicians (like just about every other decision maker we have and will encounter) still cling to the idea that diagnosis is an art where their expertise and intuition are paramount. But to a Super Cruncher, diagnosis is merely another species of prediction.
CHAPTER 5
Experts Versus Equations
The previous chapters have been awash with Super Crunching predictions. Marketing crunchers predict what products you will want to buy; randomized studies predict how you’ll respond to a prescription drug (or a website or a government policy); eHarmony predicts who you’ll
want to marry.
So who’s more accurate, Super Crunchers or traditional experts? It turns out this is a question that researchers have been asking for decades. The intuitivists and clinicians almost ubiquitously argue that the variables underlying their own decision making can’t be quantified and reduced to a non-discretionary algorithm. Yet even if they’re right, it is possible to test independently whether decision rules based on statistical prediction outperform the decisions of traditional experts who base their decisions on experience and intuition. In other words, Super Crunching can be used to adjudicate whether experts can in fact outpredict the equations generated by regressions or randomized experiments. We can step back and use Super Crunching to test its own strength.
This is just the thought that occurred to Ted Ruger, a law professor at the University of Pennsylvania, as he was sitting in a seminar back in 2001 listening to a technical Super Crunching article by two political scientists, Andrew Martin and Kevin Quinn. Martin and Quinn were presenting a paper claiming that, by using just a few variables concerning the politics of the case, they could predict how Supreme Court justices would vote.
Ted wasn’t buying it. Ted doesn’t look anything like your usual anemic academic. He has a strapping athletic build with a square chin and rugged good looks (think of a young Robert Redford with dark brown hair). As he sat in that seminar room, he didn’t like the way these political scientists were describing their results. “They actually used the nomenclature of prediction,” he told me. “I am sitting in the audience as somewhat of a skeptic.” He didn’t like the fact that all the paper had done was try to predict the past. “Like a lot of legal or political science research,” he said, “it was retrospective in nature.”
So after the seminar he went up to them with a suggestion. “In some sense, the genesis of this project was my talking to them afterwards and saying, well why don’t we run the test forward?” And as they talked, they decided to run a horse race, to create “a friendly interdisciplinary competition” to compare the accuracy of two different ways to predict the outcome of Supreme Court cases. In one corner stood the Super Crunching predictions of the political scientists and in the other stood the opinions of eighty-three legal experts. Their assignment was to predict in advance the votes of the individual justices for every case that was argued in the Supreme Court’s 2002 term. The experts were true legal luminaries, a mixture of law professors, practitioners, and pundits (collectively thirty-eight had clerked for a Supreme Court justice, thirty-three held chaired professorships, and five were current or former law school deans). While the Super Crunching algorithm made predictions for all the justices’ votes in all the cases, the experts were called upon just to predict the votes for cases in their area of expertise.
Ted didn’t think it was really a fair fight. The political scientists’ model took into account only six factors: (1) the circuit court of origin;(2) the issue area of the case; (3) the type of petitioner (e.g., the United States, an employer, etc.); (4) the type of respondent; (5) the ideological direction (liberal or conservative) of the lower court ruling; and (6) whether the petitioner argued that a law or practice is unconstitutional. “My initial sense,” he said, “was that their model was too reductionist to capture the nuances of the decision making and thus legal experts could do better.” After all, detailed knowledge of the law and past precedent should count for something.
This simple test implicates some of the most basic questions of what law is. Justice Oliver Wendell Holmes created the idea of legal positivism by announcing, “The life of the law has not been logic; it has been experience.” For Holmes, the law was nothing more than “a prediction of what judges in fact will do.” Holmes rejected the view of Harvard’s dean (and the champion of the Socratic method for legal education) Christopher Columbus Langdell that “law is a science, and that all the available materials of that science are contained in printed books.” Holmes felt that accurate prediction had a “good deal more to do” with “the felt necessities of the time, the prevalent moral and political theories, intuitions of public policy, avowed or unconscious, even the prejudices which judges share with their fellow-men.”
The dominant statistical model of political science is Holmesian in that it places almost exclusive emphasis on the judge’s prejudices, his or her personal ideological views. Political scientists often assumed these political ideologies to be fixed and neatly arrayed along a single numeric spectrum from liberal to conservative. The decision trees produced by this kind of Super Crunching algorithm are anything but nuanced. Using historical data on 628 cases previously decided by these nine justices, Martin and Quinn first looked to see when the six factors predicted that the decision would be a unanimous affirmance or reversal. Then, they used the same historic cases to find the flowchart (a conditional combination of factors) that best predicted the votes of the individual justices in non-unanimous cases. For example, consider the following flowchart that was used to forecast Justice Sandra Day O’Connor’s votes in the actual study:
SOURCE: Andrew D. Martin et al., “Competing Approaches to Predicting Supreme Court Decision Making,” 2 Perspectives on Politics 763 (2004).
This predictive flowchart is incredibly crude. The first decision point predicts that O’Connor would vote to reverse whenever the lower court decision was coded as being “liberal.” Hence, in Grutter v. Bollinger, the 2002-term case challenging the constitutionality of Michigan Law School’s affirmative action policy, the model erroneously forecasted that O’Connor would vote to reverse simply because the lower court’s decision was liberal (in upholding the law school’s affirmative action policy). With regard to “conservative” lower court decisions, the flowchart is slightly more complicated, conditioning the prediction on the circuit court origin, the type of respondent, and the subject area of the case. Still, this statistical prediction completely ignores the specific issues in the case and the past precedent of the Court. Surely legal experts with a wealth of knowledge about the specific issue could do better.
Notice in the statistical model that humans are still necessary to code the case. A kind of expertise is essential to say whether the lower court decision was “liberal” or “conservative.” The study shows how statistical prediction can be made compatible with and dependent upon subjective judgment. There is nothing that stops statistical decision rules from depending on subjective opinions of experts or clinicians. A rule can ask whether a nurse believes a patient looks “hinky.” Still, this is a very different kind of expertise. Instead of calling on the expert to make an ultimate prediction, the expert is asked to opine on the existence or absence of a particular feature. The human expert might have some say in the matter, but the Super Crunching equation limits and channels this discretion.
Ted’s simple idea of “running the test forward” set the stage for a dramatic test that many insiders watched with interest as it played out during the course of the Court’s term. Both the computer and the experts’ predictions were posted publicly on a website before the decision was announced, so people could see the results come as opinion after opinion was handed down.
The experts lost. For every argued case during the 2002 term, the model predicted 75 percent of the Court’s affirm/reverse results correctly, while the legal experts collectively got only 59.1 percent right. Super Crunching was particularly effective at predicting the crucial swing votes of Justices O’Connor and Kennedy. The model predicted Justice O’Connor’s vote correctly 70 percent of the time while the experts’ success rate was only 61 percent.
How can it be that an incredibly stripped-down statistical model outpredicted not just lawyers, but experts in the field who had access to detailed information about the cases? Is this result just some statistical anomaly? Does it have something to do with idiosyncrasies or the arrogance of the legal profession? These are the central questions of this chapter. The short answer is that Ted’s test is representative of a much wider phenomenon. For decades, social scientists
have been comparing the predictive accuracies of Super Crunchers and traditional experts. In study after study, there is a strong tendency for the Super Crunchers to come out on top.
Meehl’s “Disturbing Little Book”
Way back in 1954, Paul Meehl wrote a book called Clinical Versus Statistical Prediction. This slim volume created a storm of controversy among psychologists because it reported the results of about twenty other empirical studies that compared how well “clinical” experts could predict relative to simple statistical models. The studies concerned a diverse set of predictions, such as how patients with schizophrenia would respond to electroshock therapy or how prisoners would respond to parole. Meehl’s startling finding was that none of the studies suggested that experts could outpredict statistical equations.
Paul Meehl was the perfect character to start this debate. He was a towering figure in psychology who eventually became president of the American Psychological Association. He’s famous for helping to develop the MMPI (the Minnesota Multiphasic Personality Inventory), which to this day is one of the most frequently used personality tests in mental health. What really qualified Meehl to lead the man-versus-machine debate was that he cared passionately about both sides. Meehl was an experimental psychologist who thought there was value to clinical thinking. He was driven to write his book by the personal conflict between his subjective certainty that clinical experience conferred expertise, and “the disappointing findings on the reliability and validity of diagnostic judgments and prognostications by purported experts.”
Because of his book’s findings, some people inferred that he was an inveterate number cruncher. In his autobiography, Meehl tells about a party after a seminar where a group of experimental psychologists privately toasted him for giving “the clinicians a good beating.” Yet they were shocked to learn that he valued psychoanalysis and even had a painting of Freud in his office. Meehl believed the studies that showed that statisticians could make better predictions about many issues, but he also pointed to the interpretation of dreams in psychoanalysis as a “striking example of an inferential process difficult to actuarialize and objectify.” Meehl writes: