[ExI] Basically, DNA is a computing problem

Stefano Vaj stefano.vaj at gmail.com
Sat Mar 1 22:26:26 UTC 2008


 Basically, DNA is a computing problem

The revolution of genome sequencing has spawned a parallel revolution in
computing, as scientists in Cambridge have found

The computing resources of the Sanger Institute at Hinxton, near Cambridge,
are almost unfathomable. Three rooms are filled with walls of blade servers
and drives, and there is a fourth that is kept fallow, and for the moment
full of every sort of debris: old Sun workstations, keyboards, cases and
cases of backup tapes - even a dishwasher. But the fallow room is an
important part of the centre's preparations. Things are changing so fast
that they can have no idea what they will be required to do in a year's
time.

When Tony Cox, now the institute's head of sequencing informatics, was a
post-doctoral researcher he could sequence 200 bases of DNA in a day (human
DNA has about 3bn bases). The machines being installed today can do 1m bases
an hour. What will be installed in two years' time is anyone's guess, but
the centre is as ready as it can be.
Invisible revolution

Genome sequencing, which is what the centre excels at, has wrought a
revolution in biology that many people think they understand. But it has
happened alongside a largely invisible revolution, in which molecular
biology - which even 20 years ago was done in glassware inside laboratories
- is now done in silicon.

A modern sequencer itself is a fairly powerful computer. The new machines
being brought online at the Wellcome Trust Sanger Institute are robots from
waist-height upwards, where the machinery grows and then treats microscopic
specks of DNA in serried ranks so that a laser can illuminate it and a
moving camera capture the fluorescing bases every two seconds. The lower
half of each cabinet holds the computers needed to coordinate the machinery
and do the preliminary processing of the camera pictures. At the heart of
the machine is a plate of treated glass about the size of an ordinary
microscope slide, which contains around 30m copies of 2,640 tiny fragments
of DNA, all arranged in eight lines along the glass, and all with the bases
at their tips being directly read off by a laser.

To one side is a screen which displays the results. The sequencing cabinet
pumps out 2MB of this image data every second for each two-hour run. With 27
of the new machines running full tilt, each one will produce a terabyte
every three days. Cox was astonished when he did the preliminary
calculations. "It was quite a simple back-of-the envelope calculation:
right, we've got this many machines, and they're producing this much data,
and we need to hold it for this amount of time and we sort of looked at it
and thought: oh, shit, that's 320TB!"

Think of it as the biggest Linux swap partition in the world, since the
whole system is running on Debian Linux. The genome project uses open source
software as much as possible, and one of its major databases is run on
MySQL, although others rely on Oracle.

"History has shown," says Cox, "that when we have created - it used to be
20TB or 30TB, maybe - of sequencing data, for the longer term storage, then
you may need 10 times that in terms of real estate, and computational
process, to analyse and compare and all the things that you want to do with
it. So having produced something in the order of 100TB to 200TB of
sequential data, then the layer beyond that, the scratch space, and the
sequential analysis, and so on - to be honest, we are still teasing out what
that means, but it's not going to be small."

Down in the rooms where the servers are farmed you must raise your voice to
be heard above the fans. A wall of disk drives about 3m long and 2m high
holds that 320TB of data. In the next aisle stands a similarly sized wall of
blade servers with 640 cores, though no one can remember exactly how many
CPUs are involved. "We moved into this building with about 300TB of storage
real estate, full stop," says Phil Butcher, the head of IT. "Now we have
gone up to about a petabyte and a half, and the last 320 of that was just to
put this pipeline together."

This new technology is the basis for a new kind of genomics, with really
frightening implications. The ballyhooed first draft of the Human Genome
Sequence in 2000 was a hybrid of many people's DNA; like scripture, it is
authoritative, but not accurate. Now the Sanger Institute is gearing up for
its part in a project to sequence accurately 1,000 individual human genomes,
so that all of their differences can be mapped. The idea is to identify
every single variation in human DNA that occurs in 0.5% or more of the
population sampled. This will require one of the biggest software efforts in
the world today.

Although it is only very rare conditions that are caused by single gene
defects, almost all common conditions are affected by a complex interplay of
factors along the genome, and the Thousand Genome Project is the first
attempt to identify the places involved in these weak interactions. This
won't be tied to any of the individual donors, who will all be anonymous.
But mapping all the places where human genomes differ is the first necessary
step towards deciding which differences are significant, and of what.

There are three sorts of differences between your DNA - or mine, or anyone's
- and the sequence identified in the human genome project. There are the
SNPs, where a single base change can be identified; these are often
significant, and are certainly the easiest things to spot. Beyond that are
the changes affecting tens of bases at a time: insertions and deletions
within genes; finally there are the changes which can affect relatively long
strings of DNA, whole genes or stretches between genes, which may be copied
or deleted in different numbers. The last of these are going to be extremely
hard to spot, since the DNA must be sequenced in fragments that may be
shorter than the duplications themselves. "It's a bit like one of those spot
the difference things," Cox says. "If you have 1,000 copies, it's very much
easier to spot the smallest differences between them."
Genome me?

All of the work of identifying these changes along the 3bn bases of the
genome must be done in software and - since the changes involved are so rare
- each fragment of every genome must be sequenced between 11 and 30 times to
be sure that the differences the software finds are real and not just errors
in measurement. But there's no doubt that all this will be accomplished. The
project is a milestone towards genome-based medicine, in which individual
patients could be sequenced as a matter of course.

Once that happens, the immense volumes of data that the Sanger Institute is
gearing up to handle will become commonplace. But the project is unique in
that it must not just deal with huge volumes of data, but keep all of it
easily accessible so different parts can quickly be compared with each
other.

At this point, the old sort of science is almost entirely irrelevant. "It
now has come out of the labs and into the domain of informatics," Butcher
says. The Sanger Institute, he says, is no longer just competing for
scientists. It is about to embark on this huge Linux project just at the
time that the rest of the world has discovered how reliable and useful it
can be, so that they have to compete with banks and other employers for
people who can manage huge clusters with large-scale distributed file
systems. Perhaps the threatened recession will have one useful side effect,
by freeing up programmers to work in science rather than the City.
http://www.guardian.co.uk/technology/2008/feb/28/research.computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.extropy.org/pipermail/extropy-chat/attachments/20080301/10f363c3/attachment.html>


More information about the extropy-chat mailing list