Episode 122 32 min 49 sec The ones and zeroes of life: Computational biology and the trajectory of scale
Justin Zobel is Professor of Computational Bioinformatics at the Department of Computer Science and Software Engineering, and Program leader and Principal Research Fellow at National ICT Australia (NICTA).
Professor Zobel has research interests in bioinformatics, search engines and information retrieval, algorithms and data structures, and research methods.
Justin is also Editor-in-Chief (Asia & Australasia) of the International Journal of Information Retrieval.
Ajay Royyuru heads the Computational Biology Center at IBM Research, with research groups engaged in various projects including bioinformatics, protein science, functional genomics, systems biology, and computational neuroscience. Ajay joined IBM Research in 1998, initiating research in structural biology.
He obtained his Ph. D. in Molecular Biology from Tata Institute of Fundamental Research, Mumbai and B. Sc. (Hons.) in Human Biology and M. Sc. in Biophysics from All India Institute of Medical Sciences, New Delhi. Ajay did post-doctoral work in structural biology at Memorial Sloan-Kettering Cancer Center, New York.
Working with biologists and institutions around the world, he is engaged in research that will advance personalized, information-based medicine. Ajay leads the IBM Research team working with National Geographic Society on the Genographic Project. Ajay has authored numerous research publications and several patents in structural and computational biology.
Host: Dr Shane Huntington
Producers: Kelvin Param, Eric van Bemmel
Associate Producer: Dr Chrstine Bailey
Series Creators: Eric van Bemmel and Kelvin Param
Audio Engineers: Gavin Nebauer
Voiceover: Nerissa Hannink
The ones and zeroes of life: Computational biology and the trajectory of scale
Welcome to Up Close, the research, opinion and analysis podcast from the University of Melbourne, Australia.
I’m Shane Huntington. Thanks for joining us. The study of biological structure, specifically DNA, has led to a revolution in the way we look at living species across the world. Recent advances in techniques for measuring DNA have led to an explosion of new fields of study, along with significant data management and analysis requirements.
One important endeavour in this area is the Genographic Project, which seeks to chart historical human migratory patterns through crunching enormous amounts of genetic data, collected from people around the world.
To tell us more about the project and its considerable technical challenges, we are joined from IBM headquarters by Dr Ajay Royyuru, Senior Manager of IBM's Computational Biology Centre in the Thomas J. Watson Research Center, Yorktown Heights, New York, and by Professor Justin Zobel, School of Computer Science and Software Engineering, here at the University of Melbourne, Australia.
Welcome to Up Close, Ajay and Justin.
Thank you, Shane. Good to be with you.
Ajay, I might start with you; if you wouldn’t mind just telling us a bit about the Genographic Project. First of all, who's behind it and what's its ultimate goal?
The Genographic Project started in the year 2005. It's a project created by National Geographic Society - the researcher who leads the project there, Dr Spencer Wells, in partnership with IBM. Also we brought in the Ted Waitt Family Foundation to sponsor some of the field research work.
For the last five years the project has been gathering human genetic data from across the whole planet. The goal of the project is to use genetic information to reconstruct human migratory history. We, as a species, originated on the continent of Africa, somewhere in the 250,000 years ago time frame. In the last 50 to 70,000 years human population has spread from that part of Africa to all parts of Africa and elsewhere, beyond Africa to all continents. It is that migratory history that we want to understand and reconstruct with the use of genetic data.
It turns out that our genetics is the most detailed history book that we could ever lay our hands on, because it's kept a very detailed record of who we are as people. Now that we have the ability to get to that information through advances in DNA sequencing, that you were alluding to, it is one means by which we can use that kind of genetic information; combine that with the other information - anthropological and archaeological and so on - to, hopefully, re-tell in much greater detail the story of who we are as people and how we got to be where we are today.
Ajay, can you give us an idea of why humans actually began to migrate 50 to 70,000 years ago?
Human migration has probably occurred for the entire duration of existence of mankind; so have other species, even before and after the origin of the human species.
There are various theories of what might be actually causing migration to occur. The most prevalent, or common, way of thinking about it is availability of food and nutrition and natural resources. That is not a given, and that is not a fixed thing in these time scales. If you look at climate change, for example, the ice age and the Glacial Maximum that occurred trapped a lot of water in the ice caps and actually changed the patterns of vegetation, and therefore the abundance of food. That is why reason why human migration has occurred, particularly into various latitudes. That's one reason.
The second is, actually, the change in the boundaries of land masses that has occurred due to the appearance of glaciers and Glacial Maximum. That has made some land available for migration, like, for example, the migration from the Asian continent into the Americas through Siberia and into Alaska. That's something that occurred, quite likely, in the period when it was easier to migrate - not across the ocean, but across frozen land mass, when the glaciers were at their peak.
Those are some of the example factors that have contributed to migration.
Have we gone about looking at the scientific explanations of that migration over, say, the last hundred years? We didn't have DNA until half a century ago. How did we go about piecing together, in the scientific sense, over that period?
Prior to genetic evidence, I think the bulk of scientific evidence consisted of archaeological evidence; either artefacts that have been discovered or stumbled upon. There has been a huge amount of science done to actually date and infer time. How do you infer time in a manner that is consistent across different discoveries at different sites?
A lot of science went into that. I think that's very, very useful and powerful evidence that the genetic information that we are looking at supplements. It doesn’t replace. It actually supplements those other non-genetic evidences.
Ajay, let's talk now a bit more about the specifics of how we go about using DNA, and the subsequent technologies around the discovery of DNA, to look at this particular issue. How do we actually go about using DNA to study migration? What sort of information are we looking for?
There are two aspects of use of DNA that we need to understand. One is DNA effectively functions as a molecular clock. By that, what I mean is since the DNA we inherit is the DNA that comes from our parents; and with some frequency or periodicity there are new changes or variations that get accumulated on the DNA.
Depending on which region of the DNA you are examining, one could actually use those accumulated changes to effectively mark time; time as in a certain number of generations; because one can calculate by observing populations. By calibrating this molecular clock you can actually infer that if this is the amount of change that I'm seeing between my DNA and your DNA - and using some statistical models of population genetics and this notion of a molecular clock - I can, by looking at the total count of variations that exist between you and me, I can infer that a common ancestor to you and me must have existed these many thousands, or tens of thousands of years ago, because that is the amount of genetic change that has accumulated between you and me.
So that's the first notion, that we can use genetic data to infer common ancestry and, through that inferred common ancestry, we can possibly arrive at some date estimate. There will be errors, of course, with these estimates, but at least it is an estimate. So that's the first notion.
The second is since these genetic changes are actually accumulative - that is, my descendants are going to have all my genetic changes and new ones that they acquire - populations are, essentially, getting marked with these changes as time progresses. So two individuals who have very few genetic changes between them are therefore closer in relationship, and are closer in time, with respect to a common ancestry.
Individuals that have many more genetic changes between them would have had a common ancestry much further back in time. So this is the sort of standard phylogenetic analysis that one can do with this kind of genetic data.
The novel thinking that this field of population genetics has begun to use is how do you root such branches on the phylogenetic tree; the branches on the family tree, if you will? How can we root those branches back into geography? In the Genographic Project we are attempting to do that by actually looking at genetic diversity as it occurs across the whole planet.
For example, if we can identify unique branches of the tree, and I can identify populations, that are, today, rooted in specific geography, then I can associate these particular branches with those individuals or those population groups today. Their ancestry is what is being carried by all populations that are also marked by that same genetic change.
That's why I think it is very important, in the Genographic Project, for us to be studying populations that are rooted in geography. Indigenous populations actually provide us that geographic anchor. We work very hard to actually work with populations around the world who would qualify for that genetic isolation or geographic location information that they bring to this study.
Ajay, I might get you to give us a quick lesson in a couple of the terms that are being used in this particular project and this particular field. There are four of them I would like you to address, if you could: DNA, base pairs, nucleotides and genome. Can you just give us a quick rundown on the differences between these ideas?
DNA is the molecule that carries the genetic information. It's a long molecule that's carried in practically all cells of our body. It is the basis of genetic heritage that we have.
DNA is a double stranded molecule. It's made up of units that are strung together. Think of them as beads on a string, if you will. There are two strings that are paired up against each other. A base pair is actually a unit of the DNA on one string that is, in structure, sitting right next to another unit on the other strand. Visualise it, perhaps, as ladders with a rung of the ladder being made up of two units on each side. That rung of the ladder is what we refer to as the base pair, and the unit is what we call the base or the nucleotide.
The nucleotide is just the chemical name for the unit. It's a nucleotide on each of these strings or bars of the ladder that, together, make up a base pair.
Genome is the entirety of the DNA that makes up the individual. For a multi-cellular organism like a human, all the DNA in every cell of my body actually has its origins back to the one cell embryo, which had one copy of all the DNA that I now have.
That copy of the DNA we can refer to as the genome. It is that genome that is replicated as the cell makes more copies of itself. So when we say a genome we actually mean the sequence of the DNA, reading from one end to the other. It is the same in each cell of my body.
Can you give us an idea - when you look at this entire genome and you want to sequence it and you want to have a look at what's in it - how do we go about that?
The genome is a store of information. The information is actually just a sequence of bases or nucleotides, when you read the DNA from end to the other. Any attempt to read this information is typically referred to as sequencing the DNA. There are technologies available to sequence the DNA. Assuming that you have prepared the DNA sample in a certain way, so that this molecule of DNA, or copies of it, are now available for sequencing, you will put that through a sequencing machine.
There are different technologies of sequencing that are available today. The simplest and earliest is something called Sanger sequencing, which attempts to replicate the DNA. That means given one strand of DNA it attempts to make a replica or copy of it, as dictated by the Watson-Crick base pairing. The cleverness of Sanger sequencing was to actually cause termination of this replication process with specially modified nucleotides or bases so that when a chain terminates in that manner - let's say a chain terminates on an A - that A is going to give a unique signal of some kind: let's say an optical signal.
Thereby, when you attempt to replicate a very long sequence of DNA which will randomly terminate on As, will randomly terminate on Ts and Gs and Cs, you will be able to actually read the sequence because all the molecules that are terminated at a particular position - if they read to be an 'A', then that position that's a number 53, you're genome has A, and so on. That's the original technology for DNA sequencing.
Subsequently, there have been many new technologies invented for DNA sequencing, and many more that are continuously being worked on, including some efforts that we are doing here in IBM to build nanotechnology based efforts for sequencing which, in the future, might give us ways to sequence a single molecule of DNA. That is basically how you are reading the sequence of DNA.
About 10 years ago, when the human genome project was finished, it was a heroic effort of the international community working for about 10 or 15 years to sequence the entire human genome from end to end.
I'm Shane Huntington. My guests today are Dr Ajay Royyuru and Professor Justin Zobel. We're talking about the challenges in computational biology, here on Up Close, coming to you from the University of Melbourne, Australia.
Justin, let me turn to you now. How does the gene sequencing itself get dealt with, in terms of digital representation? I assume, when we talk about these base pairs and so forth, these are things that you can very well put into a digital representation of some type?
That's correct, Shane. To answer your question by analogy, if we think of a regular piece of written English text, whether it's a newspaper article or a novel or an entry in an encyclopaedia, at base level it's a sequence of characters in the Latin alphabet; the sequence we often refer to as a string because we have spaces and punctuation and so on.
In a standard representation there are about 256 letters that we allow for. In DNA there are four letters we need to allow for: A, C, G and T. Other than the fact that the alphabet is smaller, the principle is exactly the same: we have a string of letters that can simply be stored in a computer.
Although there are only four to choose from we're talking about a large number of letters. How many are there in the human genome in total?
It depends precisely what you count. A single copy of the human genome is roughly three billion bases. There are, of course, two copies of the human genome in each cell in the human body; one that's inherited from the mother, the ones inherited from the father; so there's a total of six billion. Most sequencing projects consider those together, so we're looking to store approximately three billion bases per human.
Other genomes are larger. There are some cereals, as in agricultural cereals; around 100 billion. Other things are much smaller. A virus is down under 10,000; bacterium, around five million.
When we talk about a billion base pairs what does that translate into in terms of data storage and the amount of actual data we have?
The base alphabet, being only four letters, means that we can store them more compactly than an English character, so we can store four bases for every English character. An English character is in a byte. People are used to hearing about megabytes, gigabytes and so on; a [giga] being a [billion] bytes. Roughly speaking, a gigabyte is required for a human genome.
So one gigabyte per person. We're probably talking about sequencing hundreds of thousands of people eventually.
The story is probably a lot worse than that because of the limitations in sequencing technology. To explain the commonest sequencing technology today, again, by analogy, imagine printing out an encyclopaedia, a large encyclopaedia, as a ribbon; not in nicely formatted pages, but as a long string of characters, perhaps 100 million characters long; a ribbon of about 100 kilometres or so.
Now imagine that there's a fault in the printing device, and for every long sequence of correct material that is printed out, another long series of nonsense of, perhaps, 10 times longer will also get printed. So, to get my 100 kilometres of encyclopaedia I will have another 900 kilometres of nonsense intermingled with it; getting me about 1000 kilometres.
I'll now make, as part of the sequencing process, some thousands of copies or more of that encyclopaedia. I'll tear them up into little pieces, two or three centimetres long. I'll throw most of those pieces in the rubbish and take the remainder - what look like shredded paper - and try to reconstruct the original encyclopaedia. This is going to be an incomplete process. Some parts of the encyclopaedia will be missing, some parts will be ambiguous: I won't know whether this particular five centimetre two or three word fragment comes from one article or another.
So although I originally had this 100 kilometre encyclopaedia, I may end up keeping thousands of kilometres of fragments because I'm not completely sure of where in the genome it came from. So the original data that comes out of a sequencing machine to go back to DNA is typically about 100 gigabytes to represent that one gigabyte of genome.
Justin, when we look back to when the human genome project had its big success in 2000 and declared that the human genome had been sequenced, how is that different to what we're talking about here with regards to an individual person's genome being sequenced, given all of these errors and dislocations in data that you're talking about?
When we describe the human genome project we're describing a project that involved the international scientific community all working on copies of a single human reference, sequencing it through laborious methods. Ajay mentioned Sanger sequencing. Sanger sequencing produces longer fragments; roughly, say, a metre, in the analogy I was using before, rather than the few centimetres - laboriously separating out the DNA; not just randomly tearing it up into pieces, but looking at the components of the DNA, stage by stage, through manual labour.
I see different numbers in the cost for the human genome project. It took about $4 billion, is one round number that you hear. When we talk about sequencing today we're talking about something under $10,000, that's completely automated, but doesn’t have the benefit of the labour of thousands of people examining where in the genome each fragment came from.
Ajay, let me turn to you for a moment, with regards to the Genographic Project. My understanding is that some 99.9 per cent of all human DNA is basically the same. So what exactly are we looking for in order to allow us to paint this picture of human migration?
You're right, Shane; 99.9 per cent of human DNA is the same between individuals; but it is that other .01 per cent that does vary from one individual to the next. In there, actually, lies information of who we are and how an individual is different from another individual.
Zooming in specifically on various portions of the genome, we are looking at such genetic variation on two pieces of the genome. One is a chromosome known as the Y chromosome, which actually, in humans, is carried only by males. It is what makes a male a male. A male has an X and Y chromosome, and a female has an X and X chromosome. It's that Y chromosome that marks the male ancestry of every male individual, because a male person has gotten that Y chromosome only from his father. So, genetic variations on the Y chromosome allow us to infer the paternal, or father's father's father line of ancestry.
Another piece of the genome that we use is known as the mitochondrial genome. This is a small organelle, or a piece within every cell that carries its own little circular DNA. The mitochondrial DNA comes to us through our mother. Males and females actually inherit mitochondrial DNA from their mother.
Therefore, looking at genetic variations on the mitochondrial genome allows us to infer our maternal - that is mother's [mother's] line of ancestry. So we in the Genographic Project are actually zooming in on the .01 per cent genetic change that does vary from person to person. Within .01 per cent we are actually looking at changes specifically on the Y chromosome and the mitochondrial genome, because they allow us to infer these two paternal and maternal lines of ancestry.
This is Up Close, coming to you from the University of Melbourne, Australia. I'm Shane Huntington, and my guests today are Dr Ajay Royyuru and Professor Justin Zobel. We're talking about the challenges in computational biology.
Justin, the projected demand for these personal genomes seems to be driven very strongly at the moment in terms of bringing down the cost and in being able to do this. What are the drivers that are pushing this area?
The drivers, historically, have been research investigating which of the variations that you've been discussing between one human and the next are responsible for disease or for wellness; for investigating the link between the gene - meaning that sequence of nucleotides - and everything about us, whether it's eye colour or longevity or any of the other characteristics of us as organisms.
This demand for knowledge by researchers has led to the creation of a the sequencing technologies that we've been discussing. As the sequencing technologies have become cheaper - almost to the point of being commodities - many applications that were not anticipated as those technologies were invented are starting to appear.
They could be broadly put into a number of categories. One, as you say, is to identify the genome of an individual with the hope of using the information in that genome to identify, perhaps, disease risks or lifetime health risks, help people to make drug choices, diet choices and so on through their lifetime. This needs to be backed up by research data, which is also being built up progressively, but slowly, through the activities of a great many researchers in a great many institutes around the world.
As the technology matures many new applications are appearing. For example, one thing that we can obviously notice about ourselves, as humans, is that we are multi-cellular. We've got about 100 trillion cells in the human body, and those cells differ from each other. A skin cell is not like a brain cell. A brain cell is not like a liver cell. Yet, they have the same genome.
It's clearly not the case that every cell in the body behaves the same way, and it's clearly not the case that it's just the genome that gives the cell its properties, but something else as well. That something else is structure. Just as the encyclopaedia is printed as pages divided into entries - perhaps divided into volumes - the DNA is structured in different ways in different cells. Some of it is packed away. Some of it is open. There are different signals in different cells, indicating which genes should be active and which should be quiet. It's those differences between cells that give different cells their characteristics.
Sequencing technology can be used not just to say what the genome is, but which genes are active and which are quiet, and to what extent that they're active. So we can use them, for example, to measure the impact of a drug on a cell to see is that drug changing the behaviour of the cell in the way that we want, by looking at the genes that are active. It can be used to track the progression of a disease, which may be a disease that affects cellular function. Indeed, most diseases affect cellular function in some way.
On a larger scale gene sequencing can also be used to examine for foreign DNA. We recently heard Craig Venter in Melbourne say that two thirds of the cells in the human body are non-human. They are bacteria of different kinds, or other organisms, present in the human. They all have their own DNA and their own characteristics. Knowing what bacteria are there, whether they're helpful or toxic, is an important part of diagnosis of disease.
I guess a big picture way of stating this is that we've had the genomes used extensively in research now for many years. What we have just begun to do is see that translation to clinical treatment.
Justin, there has been this extraordinary explosion of data, and even fields themselves, out of this work. What sort of challenges is this posing for computational biology? I can imagine there are two areas; one being storage of this immense amount of data. More important is how we go about using it. What are the big challenges there?
A basic answer to your question is that as new data types become available that presents new mathematical challenges or analytical challenges for people whose role as scientists is to invent ways to process data. DNA is very much driving a wave of inventions in data management, as you mentioned - storage and ways of dealing with that data in reasonable time.
An obvious example is the case I gave before of that torn up encyclopaedia. If we can produce the data for a single human genome in a week, we would like to be able to process that data within a week. We'd like to do so cheaply, not invoke super computers or massive resources, because a $5000 genome, we don’t want to spend $50,000 doing the computer processing. So the challenge for computer scientists is to develop new data representations and new algorithms that operate on that data to allow, for example, the fragments from an individual human's genome to be correctly assembled into a sequence that represents that human.
Ajay, sitting where you are at IBM you must see this explosion of data that we've been talking about as something that has strongly influenced inter-disciplinarian cross-disciplinary activity throughout the research community. Is that something that you've observed recently?
Yes, there is a very significant trend to bring expertise from many disciplines to this problem of how to organise the information, how to process and learn new and useful things from this exploding quantity of new information. In my own research group here at IBM we have a very multi-disciplinary team made up of researchers with backgrounds in biology, of course, but also in computer science, physics, mathematics and statistics, and some in clinical and medical sciences as well.
I think all of those need to come together, not just within organisations like ours but, even more interesting and important, is how organisations collaborate with each other to tackle problems in this space. I'm actually very encouraged by the effort that we have together between IBM and the University of Melbourne to tackle exactly these sorts of problems. That, to me, is a very telling transition point in this field, where organisations begin to work together to solve these problems.
Justin, as we move forward examining this sort of data, you mentioned before there's essentially an incredible amount of noise in this data. Computationally, how do we go about pulling out what we need and forgetting the noise and knowing the difference between the two?
The key thing we need to accumulate is reliable databases of genetic knowledge. As we sequence more and more organisms, varieties of bacteria, of crops, of human, of other organisms, of viruses, these will become catalogued and [curated] into large databases of genomic information. These libraries are going to be the reference against which genomic information is checked.
So, to think about future medical treatment, if we want to know exactly which bacteria somebody's contaminated with, one probable treatment is that we will take a biological sample from that person - from the bowel, from the mouth - we will sequence that sample, and it will contain, probably, many thousands of bacteria.
Therefore, the sequencing process will produce DNA fragments from, potentially, many thousands of organisms. Each of those fragments can be thought of as a database query. We will run those fragments into the database, the database will suggest for each fragment the one organism it comes from or, perhaps, the dozens of organisms it might come from, across the, roughly, say, billion fragments that we throw into that database as part of that querying to look at this one person and their one infection.
We would expect to see some candidates emerge strongly across those billion queries. So we get rid of the noise, in effect, by looking at a large sample of material and using the size of the sample to distinguish noise from signal.
Is it likely that we will be able to keep up with this onslaught of data in terms of hardware and the software that we have in place?
There's no doubt we are in a race. Right now I would say that the computer scientists are losing. We sequence today, worldwide, per second, more DNA data than was known 20 years ago in total. That volume of data is growing by a factor of something in the order of three to 10 per year. Over the last three years it's been much more than that. A factor of 80 is quoted for the year 2008. There is no doubt that we have more data than we can effectively process with current methods and current hardware.
My belief is that methods will solve the problem more than hardware. The cost of an internet search like those that we apply to the search engines Bing or Google has fallen by something like a cost of 100 million in 30 years. Some of that gain has been through hardware. Most of it has been through smart algorithmics. I expect something of the same kind to occur in this space.
Gentlemen, thank you for being our guests on Up Close today and giving us a better understanding of the Genographic Project.
Thank you, Shane.
Thank you, Shane; a pleasure to be with you.
Relevant links, a full transcript and more info on this episode can be found at our website http://upclose.unimelb.edu.au.
Up Close is brought to you by Marketing and Communications of the University of Melbourne, Australia. This episode was recorded on 1 December 2010. Our producers for this episode were Kelvin Param and Eric van Bemmel. Audio engineering by Gavin Nebauer. Background research for this episode was conducted by Kelvin Param and Christine Bailey. Up Close is created by Eric van Bemmel and Kelvin Param. I'm Shane Huntington. Until next time, goodbye.
You've been listening to Up Close. For more information visit http://upclose.unimelb.edu.au.
Copyright 2010, the University of Melbourne.
show transcript | print transcript | download pdf
© The University of Melbourne, 2010. All Rights Reserved.