Life's Greatest Secret
Page 27
Studies of genetic variation have led to radical new drug treatments that will transform the health of millions of people around the globe. At the beginning of the century, it was noticed that some families with extremely high levels of cholesterol had a particular form of a gene known as PCSK9. It then appeared that some people with very low levels of cholesterol had a mutation in this gene. In an extremely short period, drugs were developed to target the PCSK9 protein, and these should become available in 2015.53 Scientists are now trawling through data from populations around the world, looking for genetic variants that correlate with particular health conditions and which could provide an insight into new drug development. Sequencing is beginning to transform medicine.
These technical developments highlight the ingenuity of molecular biologists, engineers and computer scientists, but they have created an intriguing new problem. We now have thousands of genomes sequenced, and the rate at which they are being completed has grown exponentially, outstripping our ability to analyse them.54 In July 2011, only 36 eukaryotic genomes had been sequenced; a year later, another 140 had been added; by 2014, 5,628 eukaryotic genome sequences had been either begun or completed, and 36,000 prokaryotic genomes had been sequenced.55 At the time of writing, the largest known genome is that of the loblolly pine tree, which comes in at a whacking 22 billion base pairs – about seven times the size of the human genome.56 In contrast, the microbe Nasuia deltocephalinicola has a genome of just 112,000 base pairs – this organism is found uniquely in the guts of leafhopper insects, so that much of its metabolic work is done by its host.57 With fewer chemical reactions to process, its genome has gradually become reduced in size over the 260 million years that the microbe has been living in the insect, losing unnecessary protein-coding genes much as a parasitic animal loses unnecessary anatomical structures. Scientists have calculated that such symbionts could get by with as few as 93 protein-coding genes, which would probably fit into a genome of merely 70,000 base pairs.58
Producing a genomic sequence is now relatively simple, at least compared with the effort involved in the pioneering studies. The problem begins when you try to understand what the genome actually does. One of the main tasks when a genome has been completed is to annotate it, identifying genes and their exons and introns, and above all finding genes that have equivalents in other organisms, preferably with some kind of known function. Often the only basis for identifying the function of a gene is because its DNA sequence is similar to a gene in a different organism where a function has been demonstrated. This has led to a new discipline called genomics, which involves obtaining genomes and understanding their nature and evolution. It includes a new set of techniques, collectively called bioinformatics, which combine computing and population genetics to make inferences about the patterns of evolution and enable us to determine which genes have a common origin or function. Training biologists in the techniques of computer science will be an important part of twenty-first-century scientific education.
One of the most far-reaching scientific consequences of sequencing came with the work of Carl Woese, who realised in the 1960s that he could use the RNA found in ribosomes (rRNA), which is common to every organism on the planet, to study patterns of evolution. Woese began studying variation in the nucleotide sequence of the 16s rRNA subunit in a range of bacteria, and by the mid-1970s he had sequenced part of this rRNA from around thirty species – the work was extremely slow and arduous. In 1977, Woese published two papers with George Fox in which they claimed that prokaryotes – single-celled organisms with no nucleus – were not a single group with a common evolutionary history. Basing their analysis on the rRNA sequences – a far more rigorous approach than the mixture of morphological, physiological and ecological data that had previously been employed – Woese and Fox proposed to split the bacteria into two groups: the Eubacteria (or true bacteria) and the Archae-bacteria.59 The data showed clearly that the Archaea, as they were later called, were no more closely related to bacteria than they were to eukaryotes, like you and me. Eventually this led Woese to propose that life evolved into what he called three domains: Bacteria, Archaea and Eukaryota.
For the past twenty years or so, this view has been widely accepted, and it appears in university-level textbooks. But it looks as though it is probably wrong. More extensive analyses of ribosomal RNA and of protein-encoding genes suggest that there are only two primary domains – Archaea and Bacteria, with the Eukaryota being formed when an Archaean microbe engulfed the bacterial ancestor of the mitochondria.60 According to this view, the Archaea and the Bacteria make up the two great branches of life, with the Eukaryota positioned as a sudden genetic bridge between them.
Studies of evolution using genetic sequences have resolved fascinating issues that have troubled biologists for decades. For example, we now have confirmation that dogs are merely domesticated wolves, and we even have competing scenarios for where and when they were first domesticated, and why.61 Meanwhile the genome of Pleurobrachia bachei, the Pacific sea gooseberry – a ctenophore, or box jelly – suggests that it evolved its neurons and its musculature separately from the rest of the animals, indicating that these two fundamental features of animal anatomy probably evolved at least twice.62 Some of the results have been surprising: insects, it appears, are simply one form of crustacean. Despite appearances (insects have three pairs of legs and do not live in the sea; crustaceans generally have lots of appendages and are mostly marine), sequence analysis suggests that insects nestle as one very large branch in the middle of an evolutionary tree of crustaceans.63
Findings such as this, which are being published every week, are confirmation of Crick’s far-seeing prediction, made in his 1957 ‘central dogma’ lecture, that the study of amino acid sequences would reveal ‘vast amounts of evolutionary information’.64 The only thing wrong with Crick’s vision was that he was not ambitious enough: we can now directly compare DNA sequences. There is even a web site and an accompanying free app called Timetree that allows ordinary members of the public to interrogate the sequence databases and discover how long ago different groups of organisms separated, enabling you to settle those annoying evolutionary arguments such as whether a hippo is more closely related to a rhino or to a whale.65*
*
One major development that has occurred over the past few years would have been dismissed as science fiction by Crick and by virtually every other twentieth-century scientist. We can now look back deep into evolutionary time: if the preservation conditions are favourable (preferably very cold, and not acidic), and extremely careful techniques are used to avoid contamination, then reliable DNA sequences can be obtained from samples that are up to 700,000 years old – that is the age of a horse bone from the Yukon permafrost that has been analysed.66 It is quite possible that older samples will be sequenced in the future, although the prospect of extracting DNA from dinosaur fossils or from amber from the age of the dinosaurs will almost certainly remain the realm of science fiction.67
The advent of what is called palaeogenomics has led to a wave of evolutionary genetic studies of extinct organisms and above all a focus on our closest relatives in the human lineage, the Neanderthals – extinct members of the human lineage who lived in Europe until about 30,000 years ago. The results have been truly astounding.68 The driving force behind much of this work has been the German-based Swedish–Estonian scientist, Svante Pääbo, who for nearly three decades has pioneered the study of ancient DNA.69 In 1997, Pääbo, who is based at the Max Planck Institute for Evolutionary Anthropology in Leipzig, shocked the scientific world by sequencing mitochondrial DNA from Neanderthals; in 2010, Pääbo’s grand ambition was realised when the draft sequence of the Neanderthal nuclear genome was published, all 3 billion base pairs of it.70
The sequencing of the Neanderthal genome was a technical tour de force and provided astonishing information about human history. To everyone’s surprise, including Pääbo’s, it revealed that, somewhere along the line, Neanderthals mated
with humans, and vice versa. This sexual activity – which we now think first occurred up to 58,000 years ago – produced babies who survived and left offspring.71 The genetic traces of these people can be found in the genomes of all humans in the world except Africans: the ancestors of non-African human populations left Africa and then met the Neanderthals; Africans never encountered Neanderthals. In their final phase, the Neanderthals lived in a mosaic of populations across Europe; over a period of at most 5,400 years, humans and Neanderthals overlapped and interacted.72 The genetic consequences of some of those interactions live on in us.
Around 3 per cent of the genome of non-African human populations is composed of Neanderthal genes, with some of the characters involved being those to do with skin colour and the immune response, which apparently gave our ancestors an advantage; in contrast, some Neanderthal genes caused reduced male fertility in a human genetic background.73 We are only beginning to explore this unexpected part of our past and its consequences for our understanding of what it is to be human. There are only about ninety genetic differences that lead to different amino acids in humans and Neanderthals – so at most ninety of our proteins are different.74 Whatever differences there were between us were probably based more on variation in the regulatory parts of our genomes, which control how, where and when genes are active.
The most surprising proof of the power of palaeogenomics came in 2011, when Pääbo announced the existence of an entirely new and unsuspected group of extinct human relatives, known as the Denisovans who, like the Neanderthals, interbred with humans.75 This discovery was based solely on the DNA analysis of a 40,000-year-old tiny finger bone from a young girl that had been found in a cave in Denisova in Siberia. The Denisovans branched off from the Neanderthals about 300,000 years ago but still interbred with humans, leaving their traces in today’s Polynesian populations, who may have encountered them as the ancestors of the Polynesians slowly migrated through South-East Asia. We have no idea what the Denisovans looked like (all we have is the finger bone, a tooth and a toe bone), but we know that they interbred with humans, leaving traces of their DNA in us. One of the clearest examples of natural selection in the human genome – the existence of an adaptation to living at high altitude, seen in modern Tibetan populations – turns out to have originated with the Denisovans.76 At some point, ancestors of the Tibetans mated with the Denisovans and acquired a gene that enabled their present-day descendants to survive in low oxygen conditions.77 Intriguingly, genomic comparison of modern humans, Denisovans and Neanderthals suggests that interbreeding may have also taken place in Asia and in sub-Saharan Africa with other, unknown members of our lineage.78 This is a golden age in the study of human evolution, thanks to the breakthroughs produced by the advent of ancient DNA sequencing.*
*
For the general public, even more intriguing than the spectre of our ancestors mating with Neanderthals is how those billions of letters in our DNA sequence make us both human and therefore alike, and unique and therefore different. The study of the organisation of the genetic code has led to some unexpected and mysterious discoveries, as well as a great deal of scientific argument.
In 2000, Dr Ewan Birney set up a sweepstake inviting scientists to guess how many protein-encoding genes would be identified once the human genome was sequenced. The Drosophila genome, which had just been published, contained around 13,500 genes, and entrants in the sweepstake chose numbers between 26,000 and 140,000. The number announced in 2003 was, to everyone’s surprise, around 21,000 (the prize was split three ways, although none of the entrants came within 5,000 genes of the actual figure).79 Despite many claims at the time that this number would creep up, perhaps to as high as 40,000, the currently accepted number of protein-encoding human genes is around 19,000.80 Alternative splicing may result in many different protein variants being produced, but even so, this is a counter-intuitively small figure. Around 10 per cent of our genome – twice as much – is composed of regulatory genes that control the activity of protein-encoding genes. In all species, biological richness resides not simply in the sheer number or variety of protein-encoding genes but above all in the way in which those genes are activated at different points in time, in different tissues and in response to different environmental stimuli.
When Jacob and Monod put forward their operon model in 1961, they initially suggested that the lac operon repressor, the gene product that affects the activity of the lac gene, was an RNA molecule. Although it soon became apparent that the lac repressor was in fact a protein, their suggestion that RNA might be involved in gene regulation was extremely prescient. At the end of the 1960s, Roy Britten and Eric Davidson argued that networks of RNA-producing genes controlled the activity of genes in different tissues at different points in development.81 Britten and Davidson suggested that most genes were what Jacob and Monod had called regulator or regulatory genes, producing RNA that would control the activity of protein-encoding structural genes. The activity of those genes is far more complex than Jacob and Monod’s repressor.
We now know that there are many different forms of RNA, often consisting of very short sequences of less than a couple of dozen nucleotides, that are produced as part of the complex networks that control gene expression.82 These RNA sequences bind to the DNA of the structural gene and are often produced by the complementary DNA strand in the double helix – this is called anti-sense RNA.83 A large proportion of RNA transcripts produced by mammalian genomes have an anti-sense counterpart that seems to be involved in gene regulation.84 Among the different kinds of regulatory sequences that are known to exist, promoters are sequences that are found just before the beginning of the coding gene; they allow an enzyme to begin the process of transcribing DNA into RNA – they effectively act as an ‘on’ switch.85 Promoter sequences can also be targets for transcription factors – proteins produced by regulatory genes. In eukaryotes, some regulatory stretches of DNA called enhancer regions activate the promoter; these enhancers can be thousands of bases distant from the protein-encoding part of the gene. They exert their effect when the DNA forms a loop, bringing two distant parts of the molecule into relative proximity. So not only are eukaryotic genes in pieces, their constituent parts can also be spread far outside the area containing the protein-encoding exons.
The multiple roles of nucleic acids have expanded far beyond the initial definition of a gene as the fundamental unit of inheritance and show the inadequacy of Beadle and Tatum’s 1941 suggestion that each gene encodes an enzyme. As a consequence, some philosophers and scientists have suggested that we need a new definition of ‘gene’, and have come up with various complex alternatives.86 Most biologists have ignored these suggestions, just as they passed over the argument by Pontecorvo and Lederberg in the 1950s that the term ‘gene’ was obsolete.87
In 2006, a group of scientists came up with a cumbersome definition of ‘gene’ that sought to cover most of the meanings: ‘A locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions.’88 In reality, definitions such as ‘a stretch of DNA that is transcribed into RNA’, or ‘a DNA segment that contributes to phenotype/function’, seem to work in most circumstances.89 There are exceptions, but biologists are used to exceptions, which are found in every area of the study of life. The chaotic varieties of elements in our genome resist simple definitions because they have evolved over billions of years and have been continually sieved by natural selection. This explains why nucleic acids and the cellular systems that are required for them to function do not have the same strictly definable nature as the fundamental units of physics or chemistry.
*
The fact that about 5 per cent of our genome is made up of protein-encoding structural genes, whereas around 10 per cent is composed of regulatory genes producing RNA or protein-based transcription factors, raises the question of what the other 2.7 billion base pairs are there for. Long before the hum
an genome was sequenced, it was obvious that our genome is not simply made up of protein-encoding DNA. Apart from the stretches of apparently functionless introns that break up most of our protein-encoding genes, scientists realised that there were substantial parts of genomes that have no apparent function and that represent genetic fossils, remnants of the evolutionary past.
Genes that no longer give an organism an advantage tend to accumulate mutations and eventually cease to function altogether, but their DNA ghost remains in the genome, a sign of what once was. The human genome, like all genomes, contains many of these non-functional elements, which are called pseudogenes. One of the clearest examples of how this process works can be seen in whales and dolphins (cetaceans). These aquatic mammals use their noses only to breathe every few minutes when they surface briefly; they therefore have very little use for the sense of smell that their terrestrial ancestors possessed 40 million years ago. The genomes of cetaceans contain genes that once coded for smell receptors (olfactory receptor genes) but which now no longer function because the animals spend virtually all their lives with their heads under the water, unable to smell in the air. Over millions of years these olfactory receptor genes have become riddled with mutations and are now pseudogenes – non-coding stretches of DNA that nonetheless retain certain sequences of their functional ancestors and can be identified by comparison with the intact olfactory receptor genes of terrestrial mammals.90 These pseudogenes provide evidence of evolution by natural selection: the only explanation of their presence in the genome of cetaceans is that these animals were once terrestrial and had a use for their sense of smell.