The Boy Who Wasn't Short
Page 9
Less spectacular, but arguably more important, was Knome’s exome service. The exome is the 1–2 per cent of the genome that codes for protein: all the exons of all the genes, plus a bit on either side of each of them. Exome sequencing has the advantage that it’s much cheaper to read the sequence of 2 per cent of the genome than the whole thing, and, because there’s not much we can interpret that isn’t in the exome, you don’t lose much diagnostic capacity by only sequencing the exome.
On 5 October 2009, a scientist called Daniel MacArthur wrote an article in Wired magazine about the launch of Knome’s US$24,500 exome service. Just five years later, MacArthur would become famous throughout the world of genetics for his leadership of the ExAC project, a collection of exome data from more than 60,000 people. ExAC, later reincarnated as an even larger database called gnomAD, has become arguably the single most important tool for interpreting the results of genetic testing.
Back in 2009, however, exome sequencing still seemed a long way from reaching the clinic. At those prices, it remained the province of rich individuals and very well-resourced research laboratories. Even when it seemed obvious that this would become a diagnostic test at some point, it was hard to imagine how soon it would happen.
The driver behind all this was technology. The Human Genome Project had relied on Sanger sequencing, amplifying and then reading the sequence of short stretches of DNA — typically a few hundred bases at a time. This works very well if your target isn’t too big. A typical gene might have 10–20 exons. Amplifying up 10–20 small stretches of DNA, then sequencing that amplified DNA, and comparing the sequence you find with the sequence you expect is a fair bit of work, but it’s achievable. It’s a bit like being presented with a large book but being asked to proofread only the chapter titles. You can even sequence the entire genome that way (proofreading the whole book). That’s how it was done first time round, after all — but you’re talking about billions of dollars and years of work. That’s true even today, if someone were mad enough to try it. Even attempting exome sequencing on an individual human using the old technology would be an impossibly daunting task: 300,000 separate stretches of DNA would need to be amplified, sequenced, and read.
It’s obvious, then, that if you want to turn large-scale sequencing into an accessible test, you have to approach the problem in a different way. Right now, there are at least half a dozen different technologies that can do this, using a variety of different chemical tricks to do the job. At their core is a single unifying concept: reading many, many individual strands of DNA all at the same time. It’s known as massively parallel sequencing, and the first to come up with a way to do it was a now-defunct company called 454 Life Sciences. The name is a bit of a mystery: there’s a rumour that the company’s original street address was at number 454, but it has also been suggested that this is the temperature (in Fahrenheit) at which money burns.
The company was founded by Jonathan Rothberg, one of the great inventors and entrepreneurs of the modern age of genetics. While still a student, Rothberg founded one of the first genomics companies, CuraGen, which was the parent company of 454. He went on to found two other important genomics companies, (the much better named) RainDance and Ion Torrent, among many others.
Rothberg’s original motivation to get genetic sequencing going faster was the experience of one of his children being seriously ill in the newborn period. Rothberg thought that his child’s doctors ought to be able to do a rapid genetic test to make sure that babies like his didn’t have a genetic condition. He can fairly be said to have achieved that goal — the lab I work in uses an Ion Torrent sequencer, called a Proton, to do rapid exome sequencing to diagnose genetic conditions in sick babies.
The 454 company had a number of notable successes in the few years before the development of newer, faster, cheaper sequencers rendered their machines obsolete. James Watson’s genome was sequenced by 454 Life Sciences; and Svante Pääbo, the evolutionary geneticist, used 454 sequencing for his first draft of the Neanderthal genome. This was the work that revealed that, in a way, Neanderthals are not completely extinct — thanks to interbreeding, most human genomes are a few per cent Neanderthal, and about a fifth of the whole Neanderthal genome lives on in the genomes of modern humans.2
[2 We’ve since learned that our genomes hold traces of other ancient breeds of humans, including the Denisovans, named for the Russian cave in which a finger bone and a tooth were discovered in 2008.]
To sequence the genome of a Neanderthal, you need to be able to work with tiny amounts of DNA — the little that has been preserved over the millennia. That DNA has been fragmented and degraded and is at risk of contamination by even miniscule amounts of modern human DNA. Sequencing such a genome is very different from working with the genome of a living creature that can keep on making more DNA to harvest.
Which brings us to one of the great unsung heroes of modern genetics — I’m referring to none other than NA12878. That may not seem like much of a name, but it’s one that’s famous among laboratory geneticists, because it’s the code for the Genome In A Bottle. In truth, there are numerous Genomes in Bottles, not just one, but NA12878 is far and away the most famous and widely used. This code refers to DNA from a woman who lived in Utah in 1980. We know that, back then, her parents were still alive and she had 11 children (six sons and five daughters). She and her parents consented to extensive use of their DNA; consent was also given for her children (we don’t know if they were old enough to consent for themselves at the time). Some of her cells were grown in the lab in a way that generated an essentially immortal line of cells, and huge quantities of DNA have been harvested from those cells.
That DNA has had its genome sequenced over … and over … and over again. Practically everything that we can know about a person’s genome is known about NA12878. As a result, she has become the gold standard. Just as all metric measurements used to trace back to a standard kilogram and a standard metre kept in sealed containers in Paris, so virtually all of the world’s genomic laboratories refer back to this one woman’s genome as their benchmark. You can buy tubes of NA12878 DNA — hence ‘Genome In A Bottle’ — to use as standard material. The lab I work in sequences her exome twice each month as a quality measure, to make sure the accuracy of our sequencing remains high. If we sequence NA12878, we know exactly what we should find, at every position in the genome; any differences from the known sequence must be errors. If Watson, Venter, and Stoicescu were the first, second, and third humans ever to have their genomes sequenced, NA12878 is undoubtedly the most sequenced human in the world, by a huge margin. When her ‘name’ comes up, I often wonder if she is still alive. It’s 40 years since that simple act of altruism, when she gave a sample of blood for a research project. If she’s still around — does she know how important she is?
Over the past decade, the applications of the new sequencing technology have moved from the realm of science fiction through to established reality and now to routine clinical testing. The impact this has had on genetics has been transformative — it has been truly wonderful to be around to see it. As a clinician, I spent many years seeing children with intellectual disability or other complex medical problems that might possibly have a genetic cause. Occasionally, we would make a diagnosis based on a pattern of features. Most often, though, we would do the tests available to us, think hard, consult databases or maybe even the Dysmorphology Club (see chapter 7) … and still come up empty-handed.
As a result, there was a whole branch of the genetics medical literature devoted to empiric recurrence risks. The idea of this was to look at families in which a child was affected by a condition, then see what happened to other children born in the family — simply counting the number of affected and unaffected children to get a percentage. Then, if we saw a child with that condition, we could use those figures to estimate the likelihood that a future brother or sister would also be affected. For intellectual disability, the
figures varied from study to study but clustered around the 5–10 per cent mark. A 10 per cent chance that another child will have a severe problem is about as difficult as it gets for most couples who are trying to decide whether to have another child: it’s a figure that’s not very high, but also not really low. Do you take your chances? You might be waiting a long time before you know whether your next baby also has intellectual disability.
Now, our ability to make a diagnosis has dramatically improved — not because we are any better at our jobs, but because we have new and vastly improved tools to work with. If you consider children who had already had a chromosome test, we used to diagnose maybe one child in 20 with intellectual disability using the old approach. Now, for the more severely affected children at least, it’s more like 50 per cent, and there are some groups for whom we can do even better than that. Better still, it’s really common that the child’s condition is due to a ‘de novo’ change in a gene — one that wasn’t present in either parent. This is good news because it means a low chance that other children will have the same problem.
That chance isn’t zero, because of a phenomenon called mosaicism. A mosaic tiled floor has a mixture of different-coloured tiles; similarly, someone who is a mosaic for a change in a gene has a mixture of cells, some with the change and some without. As we saw in chapter 3, in one sense all of us are mosaics, because of the mistakes that happen during cell division. Usually, this has no obvious effect, except for the tiny fraction of such mistakes that lead to the development of cancer. However, a change that happens in the first few cell divisions after conception might wind up being present in a substantial fraction of a person’s cells, and can sometimes cause features of a genetic condition. These are often less severe than if the change is present in every cell, and can be confined to just one part of the body. If the condition affects the skin, it may be possible to tell that a person is a mosaic just by looking at them. In that case, we often see streaks of skin that look different, in a distinctive swirling pattern that follows the lines of Blaschko — the lines the dividing cells migrate along during the development of the embryo.3
[3 You can also think of everyone who has two X chromosomes (including most women) as being mosaic, because any variation in a gene on one of the two X chromosomes will be expressed only in the cells that have that copy of the X switched on. As a result, there are some X-linked conditions in which affected women have skin changes in a pattern that follows the lines of Blaschko. One such condition is called Goltz syndrome; another is the evocatively named incontinentia pigmenti.]
A change that happens a little later after conception might wind up being confined to just a patch of cells, somewhere in the body. If a parent has a patch of cells like this in their testicles or ovaries, they can make more than one sperm or egg that has the change, and thus have more than one child affected by the same condition, despite having no sign of the change when we test their blood. This situation is called gonadal mosaicism: a mosaic state present in the gonads (the testicles or ovaries). The chance of a second affected child being born might be quite low — if most cells in the gonad have two normal copies of the gene — or might be as high as if the change were in every cell of the parent’s body.
In practice, it is rare for more than one child in a family to be born with a genetic condition due to gonadal mosaicism — I’ve only seen it happen a handful of times — but it means we can’t ever be completely reassuring that there couldn’t be another affected child, even when the first child’s problem is due to a genetic change that we can’t detect in the parents.
Nevertheless, you might wonder why, if there were genetic diagnoses waiting to be made in so many of our patients, we were so lousy at making them in the past. There are a few different reasons for this. Some of the conditions we are seeing today were only identified recently, because of the availability of exome sequencing. The pace of discovery is so fast that, if we do exome sequencing and don’t get an answer, one of the best follow-up tests we can do is just to come back and reanalyse the original data after a year or so. Surprisingly often, new discoveries made in that time mean that we can interpret data that we couldn’t understand first time round, and we can make a diagnosis.
Some conditions were thought to be very rare indeed, but have turned out to be much more common than we thought — but also more variable, so that most cases are hard to recognise. And some conditions are just extremely rare; it’s impossible for any doctor to know about all of the thousands of rare conditions out there, and the diagnostic databases we consult are incomplete and imperfect.
At the moment, most of the time, we use the new technology either for exome sequencing or to look at a more limited, targeted list of genes — the latter is called a gene panel. If you know there are only ten genes linked to a particular condition, it might not be worth your while sequencing more than 20,000. Sometimes, that’s exactly what we do: we sequence every gene and then ignore almost all of the data, analysing only the parts we are interested in (I tell the story of a time we did this in chapter 10). It’s already clear that, within a few years, when the costs come down a bit more, we will stop bothering with exome sequencing, and perhaps we will abandon panels, too. Instead, we’ll just do whole genome sequencing, which is already a bit better than exome sequencing at giving us answers, and is likely to improve further. At that point, we’ll probably also stop doing many of the chromosome tests we currently do, since the same information — in greater detail — will be available from the genome.
So that’s all great: everything in the garden is roses. But — you knew there was a ‘but’ coming, right? — there are some problems we have to contend with. The biggest of these is dealing with the unknown.
The paper describing James Watson’s genome, published in Nature in 2008, contained a clear signal that things were not always going to be straightforward. The researchers who studied Watson’s genome saw what seemed like an anomaly, and had a stab at explaining it. With the benefit of that powerful tool the retrospectoscope, I can tell you that their explanation was completely wrong.
The clue was this: Watson was found to be a carrier of ten different genetic changes that had been reported as disease-causing in autosomal recessive conditions. These are conditions in which an affected person has changes in both copies of a gene that is located on one of the autosomes (chromosomes 1–22, i.e. not the sex chromosomes). A person who has one normal and one faulty copy of such a gene suffers no ill effects; such a person is a carrier for the condition. It seemed very likely that if Watson carried ten that were already known to genetic medicine, he probably carried at least a few others that hadn’t yet been reported. The problem with this finding was that, by studying what happens when first cousins and other relatives have children together, the longstanding scientific best guess was that each of us is a carrier for perhaps one or two different recessive conditions. Studies in fish came to a pretty similar conclusion,4 interestingly. So — why did Watson carry maybe ten times as many recessives as expected? To say the authors of the paper attempted an explanation is perhaps a bit of a stretch: they essentially said, ‘he just happens to have a lot of these … and maybe other people do too’.
[4 Fish were caught in the wild, then allowed to breed, with brother–sister matings set up in order to see what would happen. This sort of thing is frowned upon in human genetics.]
The alternate — correct — explanation would become apparent over the next few years. Watson carried ten different genetic variants that had been reported in scientific publications as disease-causing. Today, only one of those is still thought to be disease-causing.5 The others are harmless variants that were mistakenly reported as troublemakers.
[5 ‘Disease-causing’ only when there’s also a mutation affecting the other copy of the gene. If someone is a carrier of such a variant, and their other copy of the gene is normal, they will suffer no ill effects.]
How was it poss
ible for this to happen? The root cause of the problem is the enormous amount of variation in the human genome. If we were to compare your genome and mine, there would be three million places at which we differed from each other, and similarly each of us has millions of differences from the ‘reference’ genome. There is no one ‘normal’ human genome — if there are 7.7 billion people in the world today, there are perhaps 7.65 billion different human genomes (accounting for identical twins, who share their genomes). The ‘reference’ genome is a standard of sorts, but differences from that reference are not necessarily abnormal — in fact, almost all of the variations each of us has is harmless. Some are mildly helpful to us, some mildly harmful, and only a tiny proportion have the potential to cause a recognised genetic condition. A lot of that variation sits in between the genes, or is in a gene, but not in a place that affects the protein code for that gene. When we sequence a person’s exome, we tend to find about 40,000 variants that result in a difference in the parts of genes that code for protein. Some of those are very common, some are rare, and some are apparently unique. Even today, if we sequenced your genome, it is virtually certain that we would find numerous genetic variants that had never been seen before. The only exception to that would be if members of your family, particularly your parents, had already had sequencing done.
Suppose you do exome sequencing in someone who has a condition you think likely to be genetic, and caused by a change in a single gene. The starting point is that you have to sift through 40,000 protein-changing variants to find the one or two that you are looking for. You are looking for needles in a big stack of needles.