by Pbo, Svante
The Sanger sequencing method relies on the sequential incorporation of the four nucleotides by a DNA polymerase, an enzyme that makes new DNA strands using old ones as templates. In such a sequencing reaction, the DNA polymerase starts its synthesis of DNA strands from a primer at a defined point in the DNA. A small fraction of each of the four nucleotides are labeled with different fluorescent dyes and chemically modified so that when the DNA polymerase builds them in, the synthesis stops. This process creates DNA strands of different lengths, each with the dye at its end indicating which nucleotide sits there. The fragments thus terminated and labeled can be separated by electrophoresis in a gel according to their size. This reveals which dye and thus nucleotide is present—for example, ten positions away from where the synthesis started, eleven positions away, twelve positions away, and so on. The best machines employed to sequence DNA—for example, by the Human Genome Project—can sequence almost a hundred pieces of DNA at a time, for stretches as long as 800 nucleotides. What Pål developed in Mathias’s lab was a method called pyrosequencing. Though still in its infancy, this method was potentially much faster and simpler than Sanger sequencing.
Pyrosequencing also uses a DNA polymerase to build DNA sequences, but it detects each nucleotide incorporated into the DNA not by cumbersome separation of fragments according to size but by a flash of light emitted after each nucleotide is built into the DNA chain. The trick Pål had devised was to add just one of the four nucleotides at a time to the reaction mix. For example, if an A (adenine) is added, and the strand that is being used as a template at that point carries a T (thymine, which pairs with adenine), the DNA polymerase builds the A into the growing strand, and an enzymatic system in the reaction causes a light signal to be generated. This flash of light is detected by a powerful camera and registered by a computer. If the template strand carries not a T but another base, no flash of light is generated. Pål added each of the four nucleotides consecutively, cycle after cycle. By noting the light flashes, he could read the order of nucleotides in a DNA fragment. It was a brilliant method, relying only on pumping the nucleotides and other reagents into a reaction chamber and taking pictures with a camera. Even more importantly, it could easily be automated. When Mathias told me about it, I became as enthusiastic as he was.
A little later, Mathias asked me to serve on the scientific advisory board of a company called Pyrosequencing, which he and Pål had founded to produce a commercial instrument for performing this technique. I gladly agreed, since doing so would give me an opportunity to keep up with the development of an exciting technology that I thought might transform how we studied ancient DNA. I joined the advisory board in 2000, a year after the company had produced its first commercial instrument, which could simultaneously sequence ninety-six different DNA fragments, each isolated in a well on a plastic plate. However, from each fragment it could read only about thirty consecutive nucleotides. This was deeply unimpressive compared with contemporary machines relying on the Sanger principle, but pyrosequencing was a young technology and had not reached the limits of its possibilities. In fact, although I did not fully appreciate this at the time, it represented the beginning of a revolution, known as “second-generation sequencing,” that would fundamentally change not just our investigations of ancient DNA but many aspects of biology.
I very much wanted to try out pyrosequencing, so I asked Henrik Kaessmann to spend some time in Mathias’s lab at the Royal Institute of Technology in Stockholm. Henrik welcomed the opportunity to surprise people in Stockholm with his flawless Swedish; although he grew up in southern Germany, he speaks the language fluently thanks to his Swedish mother. He was also able to generate data from present-day human populations in Europe and Asia that would help in showing how they were related to one another. As with all new techniques, this required learning new skills and some troubleshooting, but it worked well.
In August 2003, the board of Pyrosequencing decided to license the technology to 454 Life Sciences, a US company founded by the biotech entrepreneur Jonathan Rothberg. 454 Life Sciences intended to enhance pyrosequencing with state-of-the-art fluidics. Its innovation relied on adding short synthetic pieces of DNA to the ends of DNA molecules. Single strands of DNA were then captured on beads and ingeniously amplified in little oil bubbles, allowing hundreds of thousands of different strands to be amplified separately but simultaneously in one big reaction. Then the beads were separated from one another on a plate with hundreds and thousands of wells for the pyrosequencing step. Finally (and crucially), to keep tabs on which wells were emitting flashes of light from cycle to cycle, the company used image-tracking methods borrowed from astronomers, who track millions of stars in the night sky. This allowed it to simultaneously sequence not ninety-six but two hundred thousand DNA fragments at a time!
Given that kind of power, I thought perhaps we could simply sequence the random DNA fragments we had in an extract from an ancient bone and see everything that was in there. This brute-force approach would be completely different from the PCR-based method, where we tried to fish out each piece of sequence we wanted to study. The PCR method was not only tedious but also (since we had to decide in advance what to look for) effectively blinded us to all other sequences in the extract. Although the 454 Life Sciences instruments could not sequence DNA fragments longer than 100 nucleotides, the nuclear DNA fragments we had seen in Alex’s work on the mammoths and Hendrik’s work on the ground sloth had never been longer than 100 nucleotides anyway. I longed to try out a 454 machine.
Mathias and the people working on pyrosequencing were not the only ones I talked to about new approaches. Another was Edward M. Rubin, a dynamic and ebullient genomicist who visited our lab in Leipzig in July 2005. I was eager for his advice. A professor at Lawrence Berkeley National Laboratory in Berkeley, California, and director of the US Department of Energy’s Joint Genome Institute, Eddy was certain that the way forward was to clone the DNA in bacteria, by much the same methods as I had used back in the 1980s when I was working with mummies in Uppsala. These methods, he told me, were now much more efficient than they had been back then. I agreed to try this out on cave bears, and we made extracts from two fossilized cave-bear bones we knew contained lots of bear mtDNA and sent them to Eddy’s lab in Berkeley. There the DNA molecules in the extracts were fused to carrier molecules, just as I had done back in 1984, and introduced into bacteria. When the bacteria grew, they constituted what is called a “library,” in which each bacterial colony, or “clone,” contained millions of copies of one unique DNA molecule from the cave-bear bone extract. The DNA from each such colony in the library could be isolated and sequenced, and thus “read,” just like a book in a library. Eddy’s people used traditional Sanger chemistry to sequence about 14,000 random DNA clones from two such libraries—orders of magnitude more than had been possible back in 1984. From the 14,000 clones, a total of 389, only 2.7 percent, carried DNA sequences that were similar to those found in dog DNA and, therefore, were likely to have come from the cave bear. The rest were from bacteria and fungi that had colonized the bones after the animal’s death. Although the proportion of endogenous DNA in these bone extracts was pathetically small, the result was nonetheless exciting because it showed that bones from European caves did indeed contain some nuclear DNA.
We published this result, with Eddy and his Berkeley group as the main authors, in Science in 2005.{44} In that paper we somewhat grandiosely claimed that this meant that genome sequencing from ancient remains was possible. But after the paper was published, some people in my own group considered more deeply what had been done, did some calculations, and came to a sobering conclusion. The Berkeley group had sequenced every bit of DNA libraries we had sent, and found a total of 26,861 nucleotides from the cave-bear genome. Given that we had used a few tenths of grams of bones for making these libraries and that the genome is composed of some 3 billion nucleotides, we would have to use more than a hundred thousand times more bone than we had already—more than te
n kilos, or twenty-five pounds of bone, in other words—to arrive at even a rough overview of the cave-bear genome. Grinding up that much bone and transforming it into extracts for sequencing libraries would be feasible, if tedious in the extreme, but the massive amount of sequencing then required would be very expensive. And even if it worked, barring unforeseen technical breakthroughs it would never be possible for us to apply this brute-force approach to the truly interesting fossils for which we had only minuscule samples. Sequencing a Neanderthal genome by cloning in bacteria did not seem the way to go, at least to me. Indeed, it seemed impossible. I imagined that most of the DNA must be lost when the bacterial libraries were constructed, probably because it never entered the bacteria in the first place or because it was chewed up by enzymes inside the bacteria. Eddy, however, continued to be enthusiastic about it and suggested that the low efficiency with which the DNA sequences were produced from the DNA extracts was unusual. He argued that future tries would certainly work better and require less starting material.
Despite Eddy’s enthusiasm, and in addition because I was averse to relying upon only one approach, I was certain that we needed to try pyrosequencing. It seemed feasible to directly apply the 454 version of pyrosequencing to all the DNA in an extract, thus eliminating the losses caused by getting the DNA into temperamental bacteria. What’s more, Jonathan Rothberg and 454 had produced a machine that could sequence hundreds of thousands of DNA molecules in a day. But it was not easy to contact him, as he wisely eschewed easy contact to insulate himself from the crank scientists who might otherwise barrage him with demands for access to his new technology. I tried several avenues and got nowhere. Finally I talked to Gene Myers, the bioinformatics wizard who had helped the famous genomicist Craig Venter assemble the human genome in 2000. I had met Gene at a bioinformatics meeting in Brazil in 2001 and immediately liked his irreverent attitude toward any problem with which he was confronted. We had bonded over a shared interest in skiing and scuba diving. Gene was now a professor at UC Berkeley and an adviser to Rothberg’s company, so in July 2005 he was able to put me in e-mail contact with Jonathan.
Jonathan arranged a conference call with me and Michael Egholm, a Danish scientist who ran the operations at 454. When Jonathan got on the line, I began to worry. He was as energetic and intense as I had expected from an entrepreneur of his caliber, but he seemed interested in only one thing—sequencing dinosaur DNA! I was unsure how to handle this annoying predilection, since I was on record as saying that sequencing dinosaur DNA was and would remain impossible. I tried to reiterate that assertion without burning my bridges, emphasizing that there were other cool genomes one could sequence, particularly that of Neanderthals. Fortunately, Jonathan quickly became intrigued by the idea that we might use such an investigation to identify what changes made us fully human. I also convinced him and Egholm that it would be good to start with a mammoth and a cave bear.
A week later we shipped off a mammoth extract and a cave-bear extract to 454 Life Sciences. At about the same time, Richard E. (Ed) Green, a hard-working and talented bioinformatician, joined our lab from UC Berkeley, where he had just finished his PhD. He had been awarded a prestigious, well-paid fellowship from the National Science Foundation in the United States to undertake a project comparing RNA splicing in humans and great apes. Splicing is the process by which the RNA copies of genes are cut up and joined to form the messenger-RNA molecules that direct protein synthesis. The idea was that differences in how genes were spliced together might account for many of the differences between humans and chimpanzees. But just as Ed got this under way, the first data from 454 Life Sciences came in.
The people at 454 had produced DNA sequences from hundreds of thousands of pieces of DNA from the mammoth and cave-bear bones. I asked Ed to look into the first problem with the DNA sequences, separating those that came from the specimens themselves and those that came from contaminating bacteria and other organisms. It was not a trivial issue. He compared the DNA sequences from the bones to the genome sequences of the elephant and the dog, the two present-day animals most closely related to mammoths and cave bears, respectively, with available genome sequences. But the ancient DNA sequences were short and they were likely to carry errors induced by chemical modifications that had occurred over millennia. In addition, the number and identity of the bacteria and fungi in the bones were unknown. But the challenge of this ancient DNA sideline proved irresistible to Ed; soon he had forgotten all about RNA splicing. Eventually he wrote a letter to the administrator who handled his fellowship at the National Science Foundation, describing how the goals of his project had shifted. Unfortunately, the NSF lacked the vision to recognize that the Neanderthal genome was an awesome opportunity for a computational biologist; instead, it cut his fellowship. Fortunately, our budget was large enough to allow us to keep Ed.
He had in the meantime found that about 2.9 percent of the DNA extracted from the mammoth bone had actually come from the mammoth and about 3.1 percent of the DNA from the cave-bear bone had come from the cave bear. This meant that our earlier results when working with Eddy Rubin, in which only about 5 percent of the cave-bear sequences produced after cloning actually came from the cave bear, were actually pretty good. Three or 5 percent doesn’t sound like much, but in total we now had 73,172 different mammoth DNA sequences and 61,667 different cave-bear sequences. This meant that in one single experiment, for which we did not even use up the entire extract, the 454 approach had produced almost ten times as much data as we had obtained from the bacterial cloning of the DNA from the cave bear in Berkeley. This seemed like a real breakthrough to me, but the approach was not without risk. Our original, PCR-based method enabled us to repeat the experiments many times, both to make sure we got the same sequences and to detect errors in them. Our new approach let us see each sequence only once, and because both genomes were so big, we were unlikely to see another copy of the same segment in the mammoth or cave-bear genome among our sequences. As a result, we couldn’t immediately determine the extent to which chemical damage in the ancient DNA, and resultant errors in the sequence, might influence our results.
Detecting errors was not a new problem, however, and we had already made some progress. A few years earlier, in 2001, Michael Hofreiter, then a graduate student in my lab, had together with others in the group shown that the most common form of DNA damage resulting in errors in ancient DNA sequences was the loss of an amino group from the nucleotide cytosine. This occurs spontaneously in DNA whenever even small amounts of water are present. When cytosine (C) loses its amino group, it becomes uracil, a nucleotide usually found in RNA. DNA polymerases read it as a T. By comparing our mammoth and cave-bear sequences to those from the elephant and dog, we could check whether we were seeing more T’s than expected in places where the present-day animals had C’s. We did see a clear overabundance of T’s. But to our surprise we saw also a more modest increase in guanine (G) relative to adenine (A), suggesting that A’s might also lose their amino groups in ancient DNA, just as C’s did. To test this notion, we used synthetic pieces of DNA into which we introduced both C’s and A’s without their respective amino groups, to see how they would be read by the DNA polymerase that 454 Life Sciences had used to amplify the DNA in pyrosequencing. The DNA polymerase not only read C’s without amino groups as T’s but also A’s without amino groups as G’s. So we wrote in a paper that appeared in the Proceedings of the National Academy of Sciences in September 2006 that not just C’s but also A’s might lose their amino groups.{45} Rather soon, however, it would turn out that we were wrong.
In the meantime, subtle frictions had developed between my group and Eddy Rubin’s group at Berkeley. It was now clear to us in Leipzig that pyrosequencing was at least ten times more efficient than bacterial cloning. It seemed to us that the process of bacterial cloning led to great losses of DNA, probably in the step where bacteria were coaxed to take up DNA. Eddy, however, was convinced that the low efficiency seen in the cave-bear expe
riment was a fluke. He was characteristically enthusiastic about this in the phone conferences we had with his group. I was torn by our disagreement. It seemed, after many years of frustration, not only that it might be possible to arrive at a sequence of the entire Neanderthal genome but that there were multiple ways to do it. Yet I felt that the project would be tenable only if it required grams and not kilograms of bone, as Eddy’s technique would require. 454’s pyrosequencing seemed to fit the bill already, but eventually Eddy convinced me to give bacterial cloning another chance. So I decided to test the two approaches, bacterial cloning and direct sequencing of molecules, head to head, and to do it with the real thing: Neanderthal DNA.
We prepared two extracts from what we considered to be our best Neanderthal sample, a bone known as Vi-80. David Serre had sequenced a highly variable part of its mitochondrial DNA in 2004. In mid-October 2005, we sent off one extract to be directly sequenced by Michael Egholm and his crew at 454 Life Sciences and one extract to Eddy Rubin’s group to be cloned in bacteria and then sequenced. The extracts had been prepared by Johannes Krause, working in our clean room. It was unnerving to then send them on to the labs in Connecticut and California where they might get contaminated. Once the tests proved which method was best, we would need to establish that method in our clean room.
Meanwhile, another new graduate student, Adrian Briggs, had arrived in our group. Fresh from undergraduate studies at Oxford, Adrian was the nephew of Richard Wrangham, the well-known Harvard primatologist. Both Adrian’s family ties and his Oxbridge education had me worried that he would turn out to be snobbish and arrogant, but my fears were totally unfounded. Even better, Adrian had an amazing ability to think quantitatively about problems in a way no one else in our group did. Best of all, he never made the rest of us feel stupid, although he thought more quickly and accurately about problems than any of us. Whereas I had no more than a hunch that most of the DNA had been lost in the process of making the Berkeley cave-bear libraries, Adrian calculated that only about 0.5 percent of the cave-bear DNA we sent to Eddy Rubin’s group had actually ended up in the bacterial libraries they produced. Adrian also calculated that in order to sequence the 3 billion–plus base pairs of a cave-bear or Neanderthal genome it would be necessary to isolate and sequence about 600 million bacterial clones, a logistical impossibility even at Eddy’s Joint Genome Institute. It put my concerns about the cloning on a solid footing; obviously, the process of cloning in bacteria was nowhere near as efficient as would be needed to get the Neanderthal genome. In a rather tense phone conference in January 2006, Adrian presented these results to Eddy’s group. Eddy, however, still felt that something had gone wrong with his laboratory’s cave-bear libraries. In the meantime, work at both 454 and Eddy’s lab went forward.