by Pbo, Svante
We were interested in seeing how often the Neanderthals carried recent, derived alleles that are also seen among present-day humans, as this would allow us to estimate when Neanderthal ancestors split from modern human ancestors. Essentially, more derived alleles shared by modern humans and Neanderthals means that the two lines diverged more recently. During the summer of 2007, Ed looked at our new data from 454 Life Sciences and he was alarmed. Just as observed by Wall and others in the smaller test data set published in 2006, Ed saw that the longer Neanderthal DNA fragments—those of more than 50 or so nucleotides—carried more derived alleles than shorter ones. This suggested that the longer fragments were more closely related to present-day human DNA than the shorter ones, a paradoxical finding that once again could have been the result of contamination.
Like many of the crises before it, this one dominated our Friday meetings. For weeks we discussed it endlessly, suggesting one possible explanation after another, none of which led us anywhere. In the end, I lost my patience and suggested that maybe we did have contamination, that maybe we should just give up and admit that we could not produce a reliable Neanderthal genome. I was at my wit’s end, feeling like crying like a child. I did not, but I think many in the group realized it was a real crisis nonetheless. Perhaps this gave them new energy. I noticed that Ed looked as though he had not slept at all for a few weeks. Finally he was able to puzzle it out.
Recall that a derived allele starts out as a mutation in a single individual—a fact that, by definition, makes derived alleles rare. Examined in aggregate, one person’s genome will show derived alleles at about 35 percent of the positions that vary whereas about 65 percent will carry ancestral alleles. Ed’s breakthrough came when he realized that this meant that when a Neanderthal DNA fragment carried a derived allele, it would differ from the human genome reference sequence 65 percent of the time and match it only 35 percent of the time. This, in turn, meant that a Neanderthal DNA fragment was more likely to match the correct position if it carried the ancestral allele! He also realized that short fragments with a difference to the human genome would more often go unrecognized by the mapping programs than longer fragments, because the longer fragments naturally had many more matching positions that allowed them to be correctly mapped even if they carried a difference or two. As a result, shorter fragments with derived alleles would more often be thrown out by the mapping program than longer ones, and short fragments would therefore incorrectly seem to carry fewer derived alleles than longer fragments. Ed had to explain this to me many times before I understood it. Even so, I did not trust my intuition and hoped that he could prove to us in some direct way that his idea was correct.
I guess Ed did not want to see me cry in the meeting, so in the end he came up with a clever experiment that proved the point. He simply took the longer DNA fragments he had mapped and cut them in half in the computer, so that they were now half as long. He then mapped them again. Like magic, the frequency with which they carried derived alleles decreased when compared to the longer ones from which they were generated (see Figure 14.1). This was because many of the fragments that carried derived alleles could not be mapped when they were shorter. Finally, we had an explanation for the pattern of apparent contamination in our data! At least some of the patterns of contamination seen in the original test data published in Nature could also now be explained. I quietly let out a sigh of relief when Ed presented his experiment. We published these insights in a highly technical paper in 2009.{55}
Ed’s findings reinforced my conviction that direct assays for contamination were necessary, and our Friday discussions again and again came back to how we could measure nuclear DNA contamination. But now I was somewhat more relaxed when these discussions came up. I felt convinced that we were on the right track.
Chapter 15
From Bones to Genome
________________________________
By early 2008, the people at 454 Life Sciences in Connecticut had performed 147 runs from the nine libraries we had prepared from the Vi-33.16 bone, yielding 39 million sequence fragments. This was a lot but still not as much as I had hoped to have by this time, and certainly far too little to make it worthwhile to begin to reconstruct the nuclear genome. Nevertheless, I was keen to test the mapping algorithms, so we undertook the much less formidable task of reconstructing the mitochondrial genome. All we or anyone else had done by that point was sequence some 800 nucleotides of the most variable parts of Neanderthal mtDNA. Now we wanted to do all 16,500 nucleotides.
Ed Green began by sieving through the 39 million DNA fragments to identify those that were similar to the mtDNAs of present-day humans. He then compared these sequence fragments to one another to find where they overlapped, enabling him to build a preliminary Neanderthal mtDNA sequence. He next trolled through the 39 million sequences again, this time looking for the ones most similar to his preliminary Neanderthal mtDNA sequence that he might have missed the first time. In total, he identified 8,341 Neanderthal mtDNA sequences, averaging 69 nucleotides in length. From them he assembled a complete mtDNA molecule of 16,565 nucleotides, the longest contiguous Neanderthal mtDNA sequence ever reconstructed.
This gave me a comforting sense of having achieved something concrete, although the analysis of the Neanderthal mtDNA genome revealed nothing about Neanderthals that we had not already known. The useful insights that we did find were of a technical nature. For example, we found that the number of fragments we retrieved varied across the genome. Ed realized that this correlated with the amounts of G and C nucleotides relative to A’s and T’s in the fragments. This meant that DNA molecules rich in G and C survived better in the bone—or, perhaps, survived our extraction of DNA from the bone better. But the good news was that no parts of the mtDNA were missing. I began to feel that many of the technical problems in analyzing Neanderthal DNA fragments were now under control. We also found 133 positions where the Neanderthal mtDNA differed from all, or almost all, human mtDNAs today.{56} Before this, we had known of only three such positions in the short segment of Neanderthal mtDNA we had published in 1997. Using the 133 positions, we could now more confidently estimate the level of modern mtDNA contamination in our new data. It was 0.5 percent. We also went back and estimated the mtDNA contamination in the old test data in our 2006 Nature paper as well as the additional data generated while our Nature paper was still in press. Out of 75 mtDNA fragments, 67 were of the Neanderthal type. So there was 11 percent contamination in that library, more than we had hoped but much less than the 70 to 80 percent suggested in the Kim and Wall paper. We included all this information in a paper we submitted to Cell, the journal that published our initial Neanderthal mtDNA results in 1997. Again we stressed that a direct way to estimate contamination specifically for the nuclear genome would be better. And we renewed our discussions in the Friday meetings about how best to do this.
Once that analysis was finished, however, I became worried that the mtDNA paper had diverted our attention from the fact that the accumulation of Neanderthal DNA sequences was going slowly. We were now well into the second year of the project, and only months away from the two-year deadline we had publicly set ourselves for producing the 3 billion Neanderthal nucleotides. Being a bit overtime would be a small thing, but unfortunately I felt we were on track to make a much worse finish. As a result, lab meetings were becoming increasingly tense. I sometimes found myself becoming loud and sarcastic (which I sorely regretted later), generally because of some illogical arguments or because someone was unable to succinctly describe what he or she had done in the lab. But the deeper reason for my bad temper was my perception that the project was not moving forward. Part of the reason for the slow progress was that few extracts contained enough Neanderthal DNA to enable production of DNA libraries at the speed we had hoped, but it was also obvious that the sequencing at 454 was not going quickly. Michael Egholm certainly remained committed to the project, but 454 Life Sciences had been sold to the Swiss pharmaceutical giant Roc
he in March 2007. As a result, the person who handled the day-to-day sequencing at Branford left the company the following fall, and I suspected that it had become hard for Egholm and the others to devote their full attention to the Neanderthal genome. For the first time, I flirted with the idea of working with 454’s competition.
One of those competitors was David Bentley, an accomplished human geneticist I met at the Cold Spring Harbor meeting in May 2007. In 2005 he had moved from the Wellcome Trust Sanger Institute to Solexa, a new company spun out of the chemistry department of the University of Cambridge. At Solexa he oversaw the development of a DNA sequencing machine that presented the strongest competition yet for Jonathan Rothberg’s 454. Just like the 454 technique, the Solexa technique used adaptors stuck to the ends of molecules to create DNA libraries that could be amplified and sequenced. Unlike the 454 technique, however, each library molecule was amplified not in little oil droplets but by primers attached to a glass surface. So from each initial DNA strand that had landed on the glass, a little spot, or cluster, composed of millions of copies of the original library molecule would emerge. These clusters were then sequenced by the addition of a sequencing primer, a DNA polymerase, and the four bases, each labeled with a different fluorescent dye.
The first test versions of these machines were delivered to sequencing centers in 2006. They could sequence a stretch of only twenty-five nucleotides, and I had heard at the time that the machines tended to break down a lot. But the great potential advantage of this technology was that each run of the machine allowed the sequencing of not hundreds of thousands of individual DNA fragments, as on the 454 machines, but a few million, and this number could potentially increase as the machines improved. Soon, too, the read lengths became 30 nucleotides, and there was talk about an improvement that would allow each DNA fragment to be sequenced from each of its ends so that a total of 60 nucleotides could be read. This began to sound very interesting indeed for ancient DNA researchers. Others were also interested. In November 2006, Illumina, a US-based biotech company, bought Solexa. David Bentley was now the new company’s chief scientist and vice president.
At the Cold Spring Harbor meeting, I discussed our project with David. He agreed that I could send him a mammoth or Neanderthal extract to test how Illumina’s technology worked. In fact, we had already begun such a test. We’d been so eager to try this technology that a few months earlier, in February 2007, we sent one of our best mammoth DNA extracts to Jane Rogers at the Sanger Institute in Cambridge, who was in charge of their Solexa machines. We had not heard back from her yet, however, so I returned from Cold Spring Harbor with a new sense of urgency and began pestering our contacts at Sanger about the results. In early June, the data came back. We were a bit disappointed to see from the sequences that the technology seemed error prone. The company worked hard on improving this, but I also realized that the error rate could be compensated for by the very large numbers of DNA fragments the machines could sequence. In principle, we could simply sequence every DNA fragment in a library multiple times so that the errors would be easy to spot and disregard. Unfortunately, Illumina did not operate its own sequencing center, as did 454, so we would need our own machine, which, due to high demand, we only got six months later. By now it could read 70 nucleotides, but it still made many errors, which became more and more frequent further along into the sequences it read. A technical upgrade of the machine in 2008 allowed us to sequence each DNA fragment in our libraries from each of its ends. Since our Neanderthal DNA fragments were on average just 55 nucleotides long, we could therefore read each DNA sequence twice, once from each side. This meant that for most of each fragment, we had reliable sequence information.
The person who took on the challenge of analyzing the Illumina data was Martin Kircher, who had joined Janet Kelso’s bioinformatics group as a graduate student in the summer of 2007. His boyish looks and charming smile belied what I felt was an overconfidence in himself that bordered on arrogance, perhaps inspired by his unofficial mentor Udo. This had greatly irritated me initially, but gradually I came to realize that his opinions were actually often right. I learned to appreciate his ability to quickly grasp technical issues, organize the flow of data from the sequencing machine into the cluster of computers, and give feedback to the technicians operating the machines. And he worked incredibly hard. Not only I but also Janet and everybody else came gradually to rely more and more on Martin for keeping our Illumina sequencing machine going and pushing analyses through the computers.
By early 2008 it was clear that we needed to abandon the 454 technology altogether to have a chance to finish the Neanderthal genome in a reasonable amount of time. The strong point of the 454 technology was its ability to sequence long DNA fragments, but since our DNA fragments were short, this was of no interest to us. Our goal was to sequence many short DNA fragments as quickly as possible. And in such mass production, Illumina had a real advantage over 454. But moving on from the 454 approach would not be straightforward, as Ed Green and the others had been busily building programs designed to handle 454 data. Switching to Illumina would mean revamping the data-handling procedures and merging the different versions of sequence data. The technologies were so new that off-the-shelf software to solve such problems was not available. We had to do it all ourselves.
As the summer of 2008 approached, these issues came to a head. In mid-July it would be two years since our press conference. Clearly, we would not make that deadline, but if journalists called and asked, I wanted at least to have a new timeline. We now had enough bones and DNA extracts to make sufficient sequencing libraries to arrive at 3 billion nucleotides, but the only way to sequence the entire genome as we had promised was to move to Illumina. Finally, I took a substantial amount of the money we had set aside to pay 454 for sequencing and instead ordered four more Illumina machines. With five machines working simultaneously, I thought we could do it, and if the machines were delivered promptly, we might even be able to do it by the end of the year. Again, I had to end a collaboration, and toward this end I met Michael Egholm from 454 at a meeting in Denmark. Fortunately, he understood my reasoning, but he predicted that we would regret having to deal with “error-ridden microreads,” as he disparagingly termed the shorter reads produced by Illumina machines.
In the middle of all these emotional ups and downs, I was happy to take a private respite from the Neanderthal work. On July 1, Linda and I flew off to Kona, on Hawaii. The official reason for the trip (and the explanation I gave to the lab) was that I had been invited to something called the Academy of Achievement, an annual conference at which musicians, politicians, scientists, and authors come together in a secluded and intimate environment to share ideas and experiences with one another and with a hundred graduate students from all over the world. Excited as I was to spend a few days with many famous and wise people, this was not the main reason I had come here. Linda and I had decided to use this opportunity to get married. It was something we had long postponed, mainly because I considered marriage an old-fashioned formality. We had decided to marry now in part because of practical reasons having to do with German pension plans in case I would predecease her, but we wanted to do it in private and with a slightly antic touch. We arranged a ceremony with a New Age pastor on the beach, a setting as beautiful as one might imagine. The pastor began by invoking Hawaiian spirits, blowing long tones on a conch shell to the four cardinal points of the compass. We made our private vows to each other and she pronounced us man and wife. In spite of the practical considerations that had prompted our decision to marry, I felt the ceremony manifested a deep commitment that had developed between us. My life with Linda had proved to be much richer than had my monk-like existence as a professor in Munich, particularly with our son Rune’s birth in 2005.
After the ceremony on the beach, we took off on a hike. Linda had found an area in a park on the Big Island that was both lonely and beautiful. We started out walking with our heavy backpacks in the glaring sun ove
r moon-like lava beds until we reached the coast. There we spent four days strolling naked on pristine beaches, snorkeling with fishes and turtles, and making love on the beach and under the palm trees. When I fell asleep to the soothing swell of the Pacific Ocean and the rustling of the palm leaves overhead, the subarctic steppe where the Neanderthals had lived seemed very far away. It was the perfect interruption in an extremely intense phase of my life.
But the Hawaiian interlude was short. The week after Linda and I came back from Hawaii, I gave a talk at the World Congress of Genetics, in Berlin. I described our technical progress toward sequencing the genome and our mtDNA results. It was frustrating to not have more to say. Another speaker at the meeting was Eric Lander, one of the main thinkers and driving forces behind the public effort to sequence the human genome. I tended to find him almost intimidatingly insightful and sharp. I had often encountered Eric at Cold Spring Harbor and in Boston, where he headed the Broad Institute, a very successful research institute devoted to genomics, and I had frequently profited from his advice. After the congress, he came to Leipzig to visit our group. We had still not taken delivery of our four new Illumina machines, and with our single machine we could not generate data fast enough, since each run took two weeks plus processing time on the computers. Fortunately, Eric, a great champion of the Illumina machines, had several at the Broad Institute, and he offered to help. We were only days from our self-imposed deadline, so I took him up on his offer without hesitation. We would obviously not be done by our deadline, but if we generated the data by the end of 2008, we would at least have done it within the year we said we would.
As our two-year deadline came around, both Nature and Science started courting us to submit our Neanderthal genome paper to them. I considered doing what I had done in 1996 with the first Neanderthal mtDNA sequences-publishing in Cell, which is a more serious molecular biology journal. But there was something to be said for publishing in either of the two other prestigious journals, where everybody expected to see the work—especially the students and postdocs, who tended to think that their careers might benefit from publishing there. In June, Laura Zahn, an editor from Science, visited us to discuss the Neanderthal paper. Science is published by the American Association for the Advancement of Science, and shortly after Laura’s visit the AAAS invited me to give a plenary presentation on the Neanderthal work at its annual meeting, which would take place in Chicago on February 12–16, 2009. This provided a definitive deadline to work against, one I felt certain we would make. So I agreed to the talk, and I realized that this meant we would most likely publish in Science.