CIO

Writing the Book of Life

It was early spring in Cambridge, England, in 1953. James Watson and Francis Crick were frantically racing against one of the world's most renowned researchers-Linus Pauling-to determine the chemical structure of DNA. As with many scientific discoveries, everything seemed to fall into place suddenly. "The brightly shining metal plates were then immediately used to make a model in which for the first time all the DNA components were present," Watson wrote in The Double Helix: A Personal Account of the Discovery of the Structure of DNA. "In about an hour I had arranged the atoms in positions which satisfied both the X-ray data and the laws of stereochemistry."

As you read Watson's elegant description, it's easy to overlook all of the painstaking research that went into the discovery. Yet as any scientist will quickly point out, developing sound theory relies on the meticulous acquisition and analysis of data-in most cases, huge amounts of it.

That's as true today as it was in 1953. But now scientists are tackling an even grander challenge than the one Watson and Crick faced. Today researchers allover the world are in a quest to unravel one of the greatest scientific mysteries of all time-the genetic code that makes each of us unique, also known as the human genome.

Two key players in this rapidly unfolding drama are the federally funded Human Genome Project and the private company Celera Genomics, and their success may be determined as much by IT as it is by science. Indeed, much of the work behind mapping the sequence of the human genome depends on developing an IT infrastructure that can acquire, analyse and store enormous amounts of data quickly and accurately.

The stakes are high. Researchers say that sequencing the human genome will revolutionize health care. Not only will scientists learn more about the origins of certain diseases and why certain people are predisposed to developing them, but pharmaceutical and biotechnology companies will be able to dramatically reduce the amount of time and money needed to develop new drugs. The new drugs, in turn, should cause fewer side effects-a major benefit, given the fact that an estimated 100,000 patients die in US hospitals each year because of adverse reactions to their medications, according to a recent article in the Journal of the American Medical Association.

Work on sequencing the human genome began in earnest in 1990 when the US government launched the Human Genome Project. The goal was to deliver a map of the entire human genome within 15 years. The National Human Genome Research Institute, which heads up the Human Genome Project for the National Institutes of Health, now says that a working draft will be available this spring. Meanwhile, scientists in six countries-in government organisations, universities and private corporations-are collaborating and tackling different pieces of the sequence.

The sheer volume of data is staggering: An estimated 3 billion base pairs are what make up the human genome. (In case you've forgotten your genetics, DNA is a double helix containing pairs of the building blocks, or "bases," adenine and thymine, and guanine and cytosine: A-T and G-C). If you find it difficult to visualise this amount of data, imagine reading out loud the base pairs at the rate of 3 per second-without stopping. It would take you more than 10 years to recite all 3 billion pairs of letters, according to the Human Genome Project.

But before scientists can begin looking at the arrangement of the base pairs within the human genome, they need pieces of the genome. So every researcher begins with the same basic series of steps. Laboratories are set up with robots to automate the daily preparation of thousands of DNA samples obtained from anonymously donated specimens of blood and semen. Once the samples are readied, the data is extracted with sophisticated machines known as sequencers. The next step is to convert the data from analog to digital signals for processing by computers. The data is then cleaned up, compared with other known sequence data via standard search algorithms such as Blast (basic local alignment search tool) and then stored in a database.

Then comes one of the biggest challenges: reassembling the data to create a picture of the human genome. Since a sequencer can look at only about 500 bases at a time, the DNA must be assembled-in much the same way that you would try to reassemble numerous Sunday newspapers that have been shredded into thousands of tiny pieces.

Researchers face several key challenges when it comes to working with the data. First, there's the issue of managing it. Not only must they acquire and store the data being pulled from the sequencers, but they must also track data associated with each step of the process (for example, temperature, movement from team A to team B, etc.). It's not uncommon for a lab to process 80,000 samples of DNA each day; that alone translates into about 15GB of sequence data per day. Meanwhile, since the data has such enormous value, it's stored indefinitely. Since many of the applications needed to run a genomics center are either not available commercially or were not designed to handle the huge amounts of data required to map the human genome, IT staff must often customise software available in the public sector (e.g., Blast) or develop new applications in-house.

Another challenge involves an issue that most CIOs know all too well: getting heterogeneous applications to talk to one another. Researchers must somehow integrate the applications that ship with the sequencers with their own applications.

Scientists at MIT's Whitehead Institute for Biomedical Research are spearheading much of the federal effort to sequence the human genome. Led by Eric Lander, the Whitehead/MIT Center for Genome Research in Cambridge, Mass., has assembled an impressive array of staff and technology. The sequencing bioinformatics staff of 18 has a wide range of expertise in such fields as physics, biology, neurobiology and computer science (see "An Emerging Industry," below). Nearly all have some experience with software engineering. Meanwhile, the nine-person computer systems operations staff provides technical and user support.

The list of hardware and software is equally impressive: 123 sequencers; 17 4-processor SMPs are pipeline, assembly, database and file servers; Compaq StorageWorks RAID arrays with 5 terabytes of storage; two Sybase database production environments; and a slew of custom-developed applications. The sequence data is stored in Unix flat files while data gathered by the Center's laboratory management information systems is stored in Sybase relational databases.

Each night the newly assembled sequence data is automatically updated and archived in-house. (Since receiving $35 million from the National Human Genome Research Institute last March, the Whitehead Institute is scaling up its DNA sequencing from 750 million base pairs per year to 17 billion base pairs annually.) The data is also sent via the internet to GenBank, a public database maintained by the National Center for Biotechnology Information. (NCBI also develops software tools for analyzing genome data, conducts research in computational biology and distributes biomedical information.) NCBI then duplicates the new data to other public databases in Europe and Japan. The emphasis is clear: make the data available as soon as possible to as many people as possible.

Many of the challenges encountered by researchers at the Whitehead Institute may sound familiar. "We face the same issues that anyone building a large production facility faces-reliability, availability and scale," says Jill Mesirov, director for bioinformatics and research computing. "Biology is very new to production on this kind of scale, so we're constantly talking about the best ways to do this work." Since the amount of data continues to grow at a phenomenal rate, scalability is a moving target. "Our vision of what something will look like in six months often changes," says K.M. Peterson, manager of computer systems operations.

Even with all of the challenges, researchers have been able to scale up production dramatically. "Most of this would have been impossible 10 years ago," says Lauren Linton, codirector of the Whitehead/MIT Center for Genome Research. "When I was in graduate school, we had to call out the [base] letters and write them down. We currently generate about 50 million letters each day. Now the software handles, stores and sifts through all the data."

The Human Genome Project has already made significant progress. Last November, government scientists announced that they had identified, sequenced and published one-third of the human genome. Less than a month later came another spectacular announcement: the sequencing-for the first time ever-of the DNA of an entire human chromosome.

Sequencing the human genome also has a corporate side. Companies such as PE were quick to realise that pharmaceutical companies would be willing to pay handsomely for information that allows them to develop drugs more quickly. PE jumped into the arena two years ago when it convinced Dr. J. Craig Venter to leave his position as president of the nonprofit Institute for Genomic Research and launch yet another genomics center-this time, a company called Celera Genomics. (Venter remains chairman of TIGR's board.) Celera says that it will sequence the human genome faster and more cheaply than the government's Human Genome Project. While some characterize efforts by the two groups as a race, others dismiss the notion. In fact, the Human Genome Project and Celera have chosen different approaches to solving the problem-in terms of both science and IT.

At the heart of Celera's approach is a sequencing method that Venter invented. Known as "shotgun sequencing," the approach calls for blasting the entire genome into small pieces of DNA and then assembling the fragments into the proper order by matching the overlapping sequences at the ends of the fragments. The method is faster than the one used by the federal government-which involves sequencing the entire genome by using one known large fragment at a time-and is controversial, with some researchers concerned that Celera's results will not be accurate.

Since opening its doors, Celera has moved quickly. Venter assembled a team of well-known researchers including Samuel Broder, former director of the National Cancer Institute; Nobel Laureate Hamilton Smith, who discovered the Type II restriction enzyme used in gene cloning; and Eugene Myers, who developed the aforementioned Blast bioinformatics algorithm. Last September, Celera announced that it had finished sequencing the genome for Drosophila melanogaster, the fruit fly. One month later, the Rockville, Md.-based company announced yet another major milestone: the sequencing and delivery of approximately 1.2 billion base pairs of human DNA to its subscribers.

Ultimately, Celera wants to become the definitive source of all genomic information by giving subscribers the tools to access and analyse its data via the Internet. To meet that goal, Celera has worked closely with Compaq Computer to develop what company officials refer to as the world's second largest supercomputer facility. It has already installed more than 200 Compaq AlphaServer ES40 systems running 500MHz Alpha processors, 11 GS140 servers and 50 terabytes of StorageWorks storage. On order are an additional 28 ES40s and three WildFires, one with 128GB of RAM. Meanwhile, Celera has also leased 300 3700 DNA sequencers from PE Biosystems. This IT arsenal runs on a switched backbone that supports throughput of 500GB per second.

In many ways, Celera's pipeline for acquiring and processing data is similar to that of the Whitehead Institute. There are, however, several key differences. For one thing, Celera has developed its IT infrastructure with e-commerce in mind. Celera customers, for example, tap into their own databases running on Celera servers via the internet. (Each customer database is updated weekly. Three copies of the data are archived forever on tape.) Celera maintains customer data in more than 60 separate databases.

As the company broadens its subscriber base, that model is expected to change. By the end of the year, smaller customers will go to one database where they will access the specific data that interests them. Digital certificates will ensure key information such as identification, access levels and billing. Meanwhile, larger companies will probably continue to have their own databases maintained by Celera.

Celera has signed up four database customers so far: Pharmacia and Upjohn, Amgen, Novartis and Pfizer. The first three ponied up $5 million for five years to gain early access to Celera's database last year. The company refuses to disclose how that fee structure may change as other companies join the fold. Not-for-profit research organisations, however, will be charged less. "We have said that we will charge between $5,000 and $20,000 per year per lab for a university," says Paul Gilman, director of policy planning. (The amount per lab will be smaller for universities with more labs.) And what do customers get for these fees? In addition to access to the actual data, they also gain access to annotation information (for example, details about whether a certain gene has been seen before, whether it has been patented, etc.), comparative genomics (comparisons with the fruit fly and mouse genome), Celera's computational facilities and a wide assortment of software tools. Celera also saves customers the time and expense of accessing dozens of databases worldwide. "We are not just a sequence DNA database," says Gilman. "We provide extensive annotation and the best tools for manipulating and analyzing the data."

Because e-commerce is the order of the day, Celera faces certain business pressures. "There's so much data that performance and bandwidth are sometimes an issue," says Marshall Peterson, vice president of infrastructure technology. "Customers can either download their tools and data to our servers or they can pull the database to their site. We want to convince them to analyse the data here." Since it's often impossible to predict what customers will need in six months, flexibility is crucial. "We're seeing huge increases in the numbers and types of customers," says Peterson. "We still don't know how customers will want to access our data in the future, so we want to offer flexibility in terms of billing-whether it's by user, CPU, number of queries or other methods." Flexibility will be especially important as Celera eventually adds data from public databases such as GenBank to its stockpile of information.

Celera's business model quickly captured the interest of the business world. Wall Street welcomed the new venture with open arms, sending the price of its stock from 14 3/16 in May 1999 to over 190 by the end of the year. Meanwhile, financial gurus Tom and David Gardner of the personal investment website Motley Fool told followers that they plan to invest $50,000 in the company. And according to biotechnology analyst Eric Schmidt at investment bank S.G. Cowen in New York City, the future looks even brighter. "We think that Celera will win big time because of their dream team, the facilities and their proprietary advantage that will allow them to gain market share," says Schmidt.

Industry observers are also betting on Celera. "There are many companies out there that are sequencing, but Celera is the only one that has tackled the entire genome," says Lynn Arenella, associate professor of biology at Bentley College in Waltham, Mass. Arenella, whose expertise includes the commercialisation of biomedical technologies, believes that Celera has an important edge. "Companies that have the ability to organise and interpret the data will come out on top. Craig Venter is talking about becoming the [Michael] Bloomberg of genetics. And he has always made good on his promises."

It's not clear when Celera or the Human Genome Project will be able to deliver on all of their promises. When they do, however, the results could affect the lives of everyone you know.