by Frank Ryan
The ambitious Celera commissioned 200 of the fastest automated sequencing machines, which would combine the speed of mass production with Venter's shotgun method, blasting the 46 chromosomes (containing some 6.4 billion nucleotides) into much smaller fragments that would be capable of decipherment by the banks of the sequencers before being reassembled to complete the whole genome. The Celera approach, as Venter now planned it, would reduce the time needed to complete the project from the ten years proposed by his rivals to seven. Meanwhile Collins, on behalf of the many scientists involved in the publicly funded Human Genome Project, argued that this technique would lead to unacceptable inaccuracy. The academics now raised new worries—that, despite Venter's reassurances, the commercial mindset would lead to unacceptable limitations in the freedom of access to the genomic data, hampering future research. Taken to extremes, some scientists feared that Celera might attempt to copyright our human genes.
This acrimony and debate still rankled between the rivals and infused the media, at the time of the twin declarations of discovery in 2001, with the Celera results published in the American flagship magazine Science and the Genome Project's results published in the British equivalent, Nature. In effect, we now had two readout versions of the same genome. While Celera made clear that they would permit free access to their findings to academic scientists, this would not extrapolate to commercial applications and potential. After all, they had spent hundreds of millions of dollars in the exploration, and, as a commercial company, they needed to recoup their costs and make some profit out of the enterprise. Meanwhile, the publicly funded group made explicit that all of their findings were now universally available.
Some readers might feel indignant that commercial interests should intrude into something as sacred as our genetic makeup, but in fact this parrying between commercial and public interest is commonplace in biological and medical research. While it may at times be tricky to draw any hard and fast line between the two very different approaches, in practice, research into the most important of arenas—such as vaccination, antibiotics, and the treatment of cancer—has always involved an uneasy balance between opposing interests.
The breakthrough actually came about through both avenues of research, and with equal plaudits to the two opposing sides. Through the twin publications of Nature and Science, the world of science, and humanity in general, was now privileged to learn, on February 15 and 16 respectively, about the enormously complex molecular structures that lie at the genetic heart of us. The decipherment was epochal in what it promised future generations of biological and medical scientists—indeed, in what it promised human society—but it also proved to be mind-blowing in its unexpected revelations. If, as newspapers and magazines proclaimed, here was the basic genetic landscape at the core of life, that landscape was now revealed to be a vast terra incognita.
The word “breakthrough” is often misused in relation to scientific discovery, but here, indeed, was real breakthrough after breakthrough. And the breakthroughs presented a very much unprepared world of science with not one but three major surprises, each a challenging new mystery. These will become evident if we examine the pie chart of the 2001 human genome below.
I should make clear that this pie chart is a metaphor of sorts, summing up the percentage contributions of different distinct genetic elements to the genome without reference to where things are actually situated throughout the 46 chromosomes. At this stage in our knowledge most geneticists were mainly interested in genes that coded for proteins, so it is in this aspect, the part of the genome that codes for proteins, where we discover the first of the three mysteries.
Biochemists had arrived at a rough assessment that there were something like 100,000 proteins involved in the structure and functioning of the human body. Thus we anticipated that there would be the same number of protein-coding genes. What many geneticists wanted to know was exactly how many protein-coding genes there really were and where they were situated on the chromosomes. It was an almighty shock to discover that these protein-coding genes amounted to less than 2 percent of the entire genome, perhaps as little as 1.5 percent. It hardly seemed possible that this minuscule inheritance could possibly translate to the 100,000 different proteins that went into the makeup of the human body.
How had we got things so very wrong?
This modest 1.5 percent of the genome coding for protein-coding genes was found to comprise roughly 20,500 genes. For geneticists, and biologists in general, this observation was astonishing. According to Beadle and Tatum's maxim of one-gene-one-protein—universally believed up to this moment—there should have been 80,000 to 100,000 protein-coding genes. It appeared to make no sense. To confuse matters further, another of the dawning implications of the first draft of the genome was the fact that, in Craig Venter's estimation, at least 40 percent of the protein-coding genes they had discovered had no known function. “We have no idea what they do. They have not been seen in biology before.” He went on to admit: “It is incredibly humbling.”
The paltry 20,500 genes seemed downright humiliating. To put it into perspective, we had roughly ten times as many genes as the average bacterium, four times as many as a fruit fly, and just twice as many as a nematode worm. In terms of genes, we seemed hardly more complex than these humble life-forms. A related revelation was the number of genes we have in common with these simpler organisms. We now discovered that we share 2,758 of our genes with the fruit fly and 2,031 with the nematode worm; and the three of us—human, fly, and worm—have 1,523 genes in common.
Darwin had been the first to dare to imagine that all of life on Earth was intimately related, through the process of evolution that he had himself pioneered. Here at once was both the confirmation of his brilliance in the letters of the code of life, our human DNA, but also a new and astonishing incongruity.
How could science possibly explain how roughly 20,500 genes could code for the estimated 100,000 proteins?
Up to this point, we believed that the protein-coding genes, made up of long strands of DNA, were copied to their exact matches in terms of complementary messenger RNA—with the exception that the fourth nucleic acid, thymine in the DNA, was replaced by uracil in the RNA—and this matching long strand of messenger RNA was then ferried out of the nucleus and taken to the protein-manufacturing ribosomes in the cytoplasm, where it was translated, using the triplet codes, to proteins whose amino acids corresponded faithfully to the original DNA code of the gene in the nucleus. Thus the number of genes should correspond to the number of proteins.
The key to this enigma proved to be a startling discovery made by two scientists back in 1977.
Richard J. Roberts graduated from my own alma mater, the University of Sheffield, with a bachelor of science degree in chemistry, completing his PhD in 1965. He subsequently went to work at Cold Spring Harbor Laboratory, New York. Phillip Allen Sharp graduated at the University of Illinois with a PhD in chemistry in 1969. He also ended up working at Cold Spring Harbor. Roberts and Sharp were exploring how the genes of a virus, called adenovirus 2, were expressed as protein within the cells of tissue cultures. What they discovered was that the actual messenger RNA strand that arrived at the ribosomes ready to code for the protein was significantly shorter in terms of its nucleotide sequence than the DNA-based gene in the viral core. This told them that only a portion of the so-called protein-coding gene actually coded for the amino acid sequences of the translated protein. Something very strange must have taken place during the chain of communication from the viral gene, within the viral core, and the expression of that gene within the host cell in the tissue culture.
As with the phage research a generation earlier, the tiniest of microbes, the viruses, had opened up a window onto a more general biological truth. Roberts and Sharp had discovered what we now call “introns” and “exons” and the importance of their role in a genetic mechanism known as “splicing”—discoveries that led to their sharing the Nobel Prize in Physiology or Medicine in 199
3.
What then are introns and exons? And how do they solve the puzzle of the discordance between the number of protein-coding genes and the anticipated number of proteins coded by the human genome?
Perhaps it is time we climbed back aboard our imaginary train to take a new journey into that ultramicroscopic landscape, with its astonishing twin track of alternating phosphates and deoxyribose sugars and those all-important sleepers.
We arrive at our destination in the blink of an eye to find ourselves chugging along large stretches of a chromosome. We know that within this chromosome there are distinct stretches of DNA called genes. Since this is a wonderland, with magical potential, we can wish that some forthcoming gene should show itself by glowing with a green light. With this in mind, we slow down sufficiently to see that exactly such a stretch is looming in front us, pulsating a beautiful emerald green, which tells us that we have arrived at the beginning of a gene. We throw the engine into low gear and travel along the twin-track rails, observing that the green glow is actually coming from the sleepers. After a while, we see that the track has reverted to the normal brown of sleepers again. I must now suggest that we haven't actually come to the end of the gene. The green-glowing track we have just traversed is merely the first exon.
You are inclined to ask: “So where exactly are we now?”
“The normal stretch, with the brown sleepers, is the first intron.”
As we chug slowly along this section, we find it is, if anything, longer than the green-glowing previous section. Then it too ends abruptly as we arrive at another green-glowing section—a second exon. As we continue our journey, we count some three stretches of exons with two intervening stretches of introns. There are no further green-glowing sections. So what we have been looking at is a protein-coding portion of a gene comprising three separate exons with two introns, somewhat like spacers, in between them. It really is that simple. What Roberts and Sharp discovered is that the whole DNA of a single “gene” does not necessarily code for a single protein. The gene is actually broken down into smaller chunks, the exons, separated by intervening introns. To code for a specific protein only a particular cluster of the gene's exons will be expressed—they will be copied to messenger RNA, complete with the intervening introns, but the intervening introns will be removed from the coding sequence before the exons are then “spliced” together to fashion the final messenger RNA that will code for a protein.
It might help us to remember if we think of the exons as “exiting” the nucleus to make proteins, while the introns stay “in” the nucleus. The total number of exons in any one human gene is very variable, with an average of 8.4. So in order to make a specific protein, the genome must know how to pick out the right gene, and then, within the gene, it must be capable of choosing which exons to splice together to code for the relevant protein.
Take, for example, our human beta-globin, which is part of the molecule hemoglobin. We now know that hemoglobin contains a single iron atom at its core, surrounded by two alpha protein subunits and two beta protein subunits. So the protein as a whole is made up of four different parts—it's a so-called quaternary protein. Now if we look at one of the two identical beta-hemoglobin subunits, the same protein subunit that is mutated in sickle cell disease, we find that the DNA that codes for these comprises three exons with two intervening introns.
It might help at this stage to know how a gene is activated.
If we were to alight from our train and take a look at the actual stretch of DNA that codes for beta-hemoglobin, we would find that somewhere close to the start of the first exon (remember that the decoding mechanism moves from the left and moves along the DNA molecule to the right), we find a section of DNA known as the “promoter.” Somewhere more distant—maybe at some considerable distance—there are other stretches of DNA that act as “up-stream regulatory elements”—another office or maybe several offices full of administrative bureaucrats. The bureaucrats send a signaling wire to the promoter to say, “Time to express the gene.” Whether or not a specific gene is expressed will vary from cell to cell, tissue to tissue, organ to organ, within the human body, and so will the timing of gene expression and the amount of gene expressed. That's what the bureaucrats control. The promoter then instructs the gene to express its DNA. In the case of the beta-globin protein, the three exons, together with the two intervening introns, are converted to the matching messenger RNA, after which, and still within the nucleus, the two introns are excised and the remaining three exons are joined up together. Only now does the messenger RNA leave the nucleus and travel to the protein-manufacturing ribosomes in the cytoplasm.
The largest known human gene codes for a protein called “dystrophin,” which has 79 exons separated from one another by 78 introns. Dystrophin is important for normal muscle function. As with sickle cell disease, mutations affecting this very long protein can give rise to an inherited form of disease. For example, in Becker and Duchenne muscular dystrophies, a whole exon is usually missing. This damages the membrane surrounding the muscle fiber, resulting in impaired muscle function.
Understanding of the genetics of diseases like these can help medical scientists to work on a treatment and, perhaps in the not too distant future, work toward a genetic cure. Moreover, understanding of how exons and introns work now affords an explanation of how just 20,500 genes could possibly code for 80,000 to 100,000 proteins.
A gene that, for example, had 14 exons separated by 13 introns, is likely to code for more than one protein. All that is necessary is that the regulatory mechanisms, which decide on which exons to splice together to make the messenger RNA, choose different combinations of exons. We now know that this is exactly what happens. The ability of a single gene to code for more than one protein is known as “alternative splicing.” We also know that this is ubiquitous in eukaryotic life—including all the animals, plants, fungi, and simpler forms whose genome is contained within a nucleus.
Now we understand why the Nobel authorities decided to award Roberts and Sharp with the Nobel Prize in Physiology or Medicine in 1993. In 2005, a multi-million-pound expansion to the chemistry department of the University of Sheffield, where I once studied, was named after Richard J. Roberts.
As we have seen, the first of the major enigmas thrown up by the 2001 blueprint of the human genome had a ready solution. But the other two, the vast virus-related segments and the empty 50 percent, will take a good deal more explaining. Before we journey into these more difficult territories, we require a basic understanding of the mechanisms that are capable of changing the genomes of existing species, and in doing so, of creating new life-forms. This will require a basic understanding of evolutionary biology together with some very recent discoveries within this broad, exciting discipline.
…in a dozen years, The Origin of Species has worked as complete a revolution in biological science as the Principia did in astronomy—and it has done so, because, in the words of Helmholtz, it contains “an essentially new creative thought.”
THOMAS HENRY HUXLEY
When, in 1859, Darwin first published his theory of evolution in his book, The Origin of Species by Means of Natural Selection, it provoked a tsunami of shock throughout the civilized world. Although he made little or no reference to human evolution in this, his first book, the implications for human evolution were implicit in every thought and line. Given that there was no real understanding of how heredity worked, his thinking remains remarkably prescient today. In essence he proposed that nature selects for key characters, or “traits,” that enhance the potential for survival in the same way that breeders of domestic animals and crops had long selected for traits such as size of kernel, coat of wool, meatiness of muscle, resistance to disease or drought, and so on. The way nature did so was brutal, though: it was through attrition. Most parents, for example in animals or plants, produced far more than two offspring. Yet by and large the numbers within a species stayed roughly constant. Darwin realized that the offspring had to c
ompete with one another for scarce resources or to avoid predators. This created fierce competition for survival; those who had a slight edge in the tooth and claw of nature were more likely to survive. If this edge was determined by heredity, the survivors would pass it on to their offspring. In time—and Darwin was well aware that this would most likely involve a gradual and incremental sum of small advantages over very long periods of time—the advantaged would be more likely both to multiply and eventually to generate descendants sufficiently different from the original parental strain as to give rise to a new species. Dilution of the hereditary advantage would be reduced if the emerging species was geographically isolated from the parental strain—for example through isolation on islands, or through separation by mountains or major rivers. In time, the new species would be sufficiently different physically, and reproductively, to breed true within its own population.
Natural selection was a very simple and convincing hypothesis. Darwin had observed differences in the beaks of the birds on the different Galapagos islands. Soon other naturalists—what today we call biologists—would observe and confirm his findings in animals and plants, fungi, protists (what we once called protozoa), and much simpler organisms, such as bacteria and viruses.
While many scientists were intrigued by and largely supportive of Darwin's theory, some, such as the distinguished Swiss American Jean Louis Rodolphe Agassiz, who had performed landmark work on glaciers and extinct fishes, were adamantly opposed to any theory of evolution on religious grounds. Darwin's former friend Sir Richard Owen, the renowned naturalist and founder of the Natural History Museum in London, is also presented as opposing evolutionary theory on religious grounds, but it would appear that he had his own theories about it and simply disagreed with Darwin's concept of natural selection combined with gradual change. Darwin was well aware that natural selection could only work if there were mechanisms capable of creating changes in the heredity of living organisms. To put it another way, natural selection requires hereditary variation for it to work. Some of the resistance from within science itself derived from the prevailing lack of understanding about the nature of heredity. In the subsequent opinion of Sir Julian Huxley, grandson of Thomas Henry Huxley, who championed Darwin in his lifetime, it was this lack of understanding in particular that dogged confidence in Darwinian theory as science moved toward the end of the nineteenth century. In the opening chapters of his book, Evolution: The Modern Synthesis, Julian Huxley put his finger on the heart of the problem: “The really important criticisms have fallen upon Natural Selection as an evolutionary principle and centered round the nature of inheritable variation.”