The Half-Life of Facts

Page 12

by Samuel Arbesman

There is also Charles Babbage, who designed the first mechanism for a programmable computer, but who had the misfortune of living during the Industrial Revolution, when he was unable to construct his invention, mainly because it was far too expensive at the time. His Difference Engine No. 2 actually had parts that corresponded exactly to the memory and processors found in modern computers.

So facts can remain hidden for a long time, whether the problem is because they are very advanced or because they come from a different discipline. But is there a way to measure this? Specifically, how often is knowledge skipped over?

A recent study in the Annals of Internal Medicine examined this phenomenon in a quantitative and rigorous fashion. Karen Robinson and Steven Goodman, located at the Johns Hopkins University, wanted to see how often scientists were aware of previous research11 before they conducted a clinical trial. If science properly grows by accreting information, it should take into account everything that has come before it. But do scientists actually do that? Based on everything I’ve mentioned so far, the answer likely will be no. But then, what fraction of the time do we ignore (or simply don’t know about) what has come before us?

Robinson and Goodman set out to see how often scientists who perform a clinical trial in a specific field cite the relevant literature when publishing their results. For example, if a clinical trial related to heart attack treatment is performed, Robinson and Goodman wanted to see how many of these trials cite the trials in that area that had come before it. While a clinical trial needn’t cite every paper that preceded it, it should provide an overview of the relevant literature. But how to decide which papers are relevant and which ones aren’t? Rather than be accused of subjectivity, or have to gain expertise in countless specific areas, Robinson and Goodman sidestepped these problems by doing something clever: They looked at meta-analyses.

Meta-analysis is a well-known technique that can be used to extract more meaning from specific papers than could be gained from looking at each one alone. A meta-analysis combines the results of papers in a specific area in order to see if there is a consensus or if more precise results can be found. They are like the analyses of thermal conductivity for different elements mentioned in chapter 3, which use the results from lots of different articles to get a better picture of the shape of what we know about how these elements conduct heat.

Assuming the meta-analyses bring together all the relevant trials, Robinson and Goodman simply looked through all the studies examined in each meta-analysis to see how many of the studies cited in the meta-analyses were also mentioned each of the newer studies being examined.

What they found shouldn’t be surprising. Scientists cite fewer than 25 percent of the relevant trials when writing about their own research. The more papers in the field, the smaller the fraction of previous papers that were quoted in a new study. Astonishingly, no matter how many trials had been done before in that area, half the time only two or fewer studies were cited.

Not only are a small fraction of the relevant studies being cited, there’s a systematic bias: The newer ones are far more likely to be mentioned. This shouldn’t be surprising after our discussion of citation decay and obsolescence in chapter 3. And it is hardly surprising that scientists might use the literature quite selectively, perhaps to bolster their own research. But when it comes to papers that are current, relevant, and necessary for the complete picture of the current state of a scientific question, this is unfortunate.

Imagine if we actually combined all the knowledge in a single field, and if scientists actually read all the analyses that their work was based on. What would happen to facts then? Would it make any difference?

Quite a bit, it turns out.

• • •

IN 1992, a team of scientists from the hospitals and schools12 associated with Harvard University performed a new type of analysis. These researchers, Joseph Lau and his colleagues, examined all the previously published randomized clinical trials related to heart attacks. Specifically, they looked at all trials that were related to the use of a drug called a streptokinase to treat these heart attacks. Combing through the literature, they found that there were thirty-three trials between the years 1959 and 1988 that used this treatment and examined its effectiveness.

Why did they stop at 1988 instead of going all the way up to 1992? Because 1988 was the year that a very large study was published, finally showing definitively that intravenous streptokinase helped to treat heart attacks. But Lau and his colleagues did something clever.

Lau lined up the trials chronologically and examined each of their findings, one after the other. The team discovered something intriguing. Imagine you have just completed a clinical trial with your drug treatment of choice. But instead of just analyzing the results of your own trial, you combine your data with that of all of the studies previously completed up until then, making the dataset larger and richer. If you did that, Lau and his colleagues discovered, a researcher would have known that intravenous streptokinase was an effective treatment years before this finding was actually published. According to their research, scientists could have found a statistically significant result in 1973, rather than in 1988, and after only eight trials, if they had combined the disparate facts.

This type of analysis is known as cumulative meta-analysis. What Lau and his colleagues realized was that meta-analyses can be viewed as a ratchet rather than simply an aggregation process, with each study moving scientific knowledge a little closer to the truth. This is ultimately what science should be: an accumulation of bits of knowledge, moving ever forward, or at least sweeping away error as best we can. Lau and his colleagues simply recognized that to be serious about this idea of cumulative knowledge, you have to truly combine all that we know and see what new facts we can learn.

While Don Swanson combined papers from scientific areas that should have overlapped but didn’t, Lau and his colleagues combined papers from very similar areas that had never been combined, looking at them more carefully than they had been examined up until then. By using cumulative meta-analysis, hidden knowledge could have been revealed fifteen years earlier than it actually was and helped improve the health of countless individuals.

Modern technology is beginning to aid cumulative meta-analysis and its development, and we can even now use computational techniques to employ Swanson’s methods on a grand scale.

• • •

WE are not yet at the stage where we can loose computers upon the stores of human knowledge only to return a week later with discoveries that would supplant those of Einstein or Newton in our scientific pantheon. But computational methods are helpful. Working in concert with people—we are still needed to sort the wheat from the chaff—these programs can connect scientific areas that ought to be speaking to one another yet haven’t. These automatic techniques help to stitch together different fields until the interconnectivity between the different areas becomes clear.

In the fall of 2010, a team of scientists in the Netherlands published the first results of a project called CoPub Discovery. Their previous work had involved the creation of a massive database13 based on the co-occurrence of words in articles. If two papers both have the terms p53 and oncogenesis, for example, they would be linked more strongly than words with no two key terms in common. CoPub Discovery involved14 creating a new program that mines their database for unknown relationships between genes and diseases.

Essentially, CoPub Discovery automates the method that Don Swanson used to detect the relationship between fish oil and Raynaud’s syndrome but on a much larger scale. It can detect relationships between thousands of genes and thousands of diseases, gene pathways, and even the effectiveness of different drugs. Doing this automatically allows many possible discoveries to be detected. In addition, CoPub Discovery also has a careful system of checks designed to sift out false positives—instances where the program might say there is
an association when there really isn’t.

And it works! The program was able to find a number of exciting new associations between genes and the diseases that they may cause, ones that had never before been written about in the literature.

For example, there is a condition known as Graves’ disease that normally causes hyperthyroidism, a condition in which the thyroid produces too much hormone. Symptoms include heat intolerance and eyes that stick out more prominently, yielding a somewhat bug-eyed appearance for sufferers. CoPub Discovery, when automatically plowing through the large database, found a number of genes that had never before been implicated in Graves’ that might be involved in causing the disease. Specifically, it found a large cluster of genes related to something known as programmed cell death.

Programmed cell death is not nearly as scary as it sounds. Our bodies often require the death of individual cells in order to perform correctly, and there is a set of genes in our cells tailored for this purpose. For example, during embryonic development, our hands initially have webbing between the fingers. But prior to birth the cells in the webbing are given the signal to die, causing us to not have webbed hands. Webbed hands and feet only occur when the signal is given incorrectly, or when these genes don’t work properly.

What CoPub Discovery computationally hypothesized is that when these programmed cell death genes don’t work properly in other ways, a cascade of effects might follow, eventually leading to the condition known as Graves’ disease. CoPub Discovery has also found relationships between drugs and diseases and determined other previously unknown effects of currently used drugs. For example, while a medicine might be used to help treatment for a specific condition, not all of its effects might be known. Using the CoPub Discovery engine and the concept of undiscovered public knowledge, it becomes possible to actually see what the other effects of such a drug might be.

The researchers behind CoPub Discovery did something even more impressive. Rather than simply put forth a tool and a number of computationally generated hypotheses—although this is impressive by itself—they actually tested some of the discoveries in the laboratory. They wanted to see if these pieces of newly revealed knowledge are actually true. Specifically, CoPub Discovery predicted15 that two drugs, dephostatin and damnacanthal, could be used to slow the reproduction and proliferation of a group of cells. They found that the drugs actually worked—the larger the dose of these drugs, the more the cells’ growth was inhibited. This concept is known as drug repurposing, where hidden knowledge is used to determine that medicines are useful in treating conditions or diseases entirely different from their original purposes. One of the most celebrated examples of drug repurposing is Viagra, which was originally designed to treat hypertension. While Viagra proved effective for that condition, many of the participants in the clinical trials reported a certain intriguing side effect, also making Viagra trials one of the only cases where the pills left over at the end of the study were not returned by the participants.

There are many other examples of computational discovery that combine multiple pieces of knowledge to reach novel conclusions. From software designed to find undiscovered patterns16 in the patent literature to the numerous computerized systems devoted to drug repurposing,17 this approach is growing rapidly. In fact, within mathematics, there is even a whole field of automated theorem proving. Armed with nothing but various axioms and theorems well-known to the mathematics community, as well as a set of rules for how to logically infer one thing from another, a computer simply goes about combining axioms and other theorems in order to prove new ones.

Given enough computational power, these systems can yield quite novel results. Of course, most of the output is rather simple and pedestrian, but they can generate new and interesting18 provably true mathematical statements as well. One of the earliest examples of these is Automated Mathematician, created by Doug Lenat in the 1970s. This program constructed regularities and equalities, with Lenat even claiming that the Automated Mathematician rediscovered a fundamental unsolved problem (though, sadly, did not solve it) in abstract number theory known as Goldbach’s Conjecture. Goldbach’s Conjecture is the elegant hypothesis that every even number greater than two can be expressed as the sum of two prime numbers. For example, 8 is 5 + 3 and 18 is 7 + 11. This type of program has provided a foundation for other automated proof systems, such as TheoryMine, briefly mentioned in chapter 2, which names a novel, computationally created19 and proved theorem after oneself or a friend, for a small price.

TheoryMine was created by a group of researchers in the School of Informatics at the University of Edinburgh. While some people might be excited to simply have something named after themselves and ignore the details, TheoryMine will give you not only the theorem but also a capsule summary of how the theorem was proven. The theorems are all related to the properties of functions and for most people are rather opaque. Nonetheless, it’s great that a mechanism to discover a piece of hidden knowledge is available for a consumer audience.

In addition to automated discoveries, there are now even automated scientists, software capable of detecting regularities in data and making more abstract discoveries. A computer program known as Eureqa was developed by Mike Schmidt, a graduate student at Cornell University (and current president of Nutonian, Inc.), and Hod Lipson, a professor at Cornell. It does something a bit different from the other projects mentioned already: Given lots of data, it attempts to find meaning in an otherwise meaningless jumble of facts.

Eureqa takes in a vast quantity of data points. Let’s say you’re studying a bridge and trying to understand why it wobbles. Or an ecosystem, and how the relative amounts of predators and prey change over time. You dump all the data you’ve collected into Eureqa—how many predators there are on each day, as well as the quantity of prey, for example—and it attempts to find meaning.

Eureqa does this by using a simple technique known as evolutionary programming; due to its computing power, this technique is very powerful. Eureqa randomly generates a large variety of equations that could conceivably explain relationships between the changes in data. For example, it will create an equation that attempts to mathematically combine the inputs of your system and show how they can yield the outputs. Of course, if it’s given a random equation, the odds are very good that it will have absolutely no insight into the underlying phenomena it is trying to explain. Instead of explaining the data, it will spit out gibberish.

But what is randomly generated doesn’t have to be satisfactory. Instead, a population of random equations can be evolved. Just as biological evolution can result in a solution—an organism that is well adapted to its environment—the same thing can be done with digital organisms. In this case, the equations are allowed to reproduce, mutate, swap bits of their formulas, and more. And this is all in the service of explaining the data set. The better the equations adhere to the data, the more they are allowed to reproduce.

Doing this over and over results in a population of good and fit equations, formulas that are far cries from the initial, randomly generated ones. Eureqa can even yield equations that can actually generate findings as complex as the concept of the conservation of energy, one of the foundations of thermodynamics.

In the case of these automated-discovery programs, the more knowledge we have available, the more raw materials we have for these programs. The more data, the more new facts these programs can in turn reveal.

So it’s important for us to understand how knowledge is maintained if we want to make sure we can have the maximum amount of data for these automated-discovery programs. Specifically, is most knowledge actually preserved? Or are the raw materials for hidden knowledge that we have available only a remnant of what we might truly know?

• • •

THE Middle Ages, far from being the Dark Ages, as some of us might have been taught, was a time of science and innovation. Europeans developed medical techniques and m
ade advances in such areas as wind energy and gunpowder.

But it was also a time for preservation. That many of the texts written in ancient times, or even in the early Middle Ages, would make it to the modern era was by no means a foregone conclusion. As mentioned in the previous chapter, prior to the printing press manuscripts had to be copied by hand in order for information to spread.

I have a book on my shelf entitled The Book of Lost Books: An Incomplete History of All the Great Books You’ll Never Read. It’s a discussion by Stuart Kelly of books that have been lost to time, whose names we know or from which excerpts have been passed down, but whose full texts are unknown. There is a long history of such references, even going back to The Book of the Wars of the Lord, a lost book that the Bible itself references when quoting a short description of the location of ancient tribal boundaries in Numbers 21:14–15.

The Book of Lost Books is organized by author, and the names of those whose books we don’t have is astonishing: Alexander Pope, Gottfried Leibniz, William Shakespeare, Charles Dickens, Franz Kafka, Edward Gibbon, and many more. And, of course, there are many ancient and medieval writers. From Ovid and Menander to Ahmad ad-Daqiqi and Widsith the Wide-Traveled, we know of many writers whose works have not been preserved. And then there is the Venerable Bede.

The Venerable Bede was a Christian scholar and monk from England in the late seventh and early eighth centuries. In addition to being a very holy man in the eyes of the Catholic Church—he was made a doctor of the Church in 1899, a title indicating a person’s importance on theological and doctrinal thought, as well as being given his Venerable title only a few decades after his death, and later canonized—he was also a man of history and science. In fact, he has become known as the father of English history due to his History of the English Church and People. He also wrote books about mathematics, such as De temporum ratione, which discusses how to quickly perform mathematical calculations by hand.

‹ Prev Next ›