Published on 13-Nov-2012
"The genomics industry as a whole has realized that what we need to do would be ludicrously expensive without cluster computing." - Tim Cutts, Platform LSF Administrator, Sanger Institute
Wellcome Trust Sanger Institute
The Wellcome Trust Sanger Institute is a genome research centre set up in 1992 by the Wellcome Trust and the Medical Research Council in order to further the knowledge of genomes. It plays a substantial role in the sequencing and interpretation of the human genome to underpin research on human biology and disease.
Genetic sequencing machines produce 120 terabytes of raw data per week that need to be processed Research on genomes of numerous species generates 100,000s of processes per week
Platform LSF has helped the Institute to run up to half a million sequence matching jobs a day
Researchers can make rapid advances in science by quickly comparing similar genomic structures Ability to perform massive, regular updates to the genome browser Excellent support from IBM Platform Computing allows the Institute to deal with the unique issues that any business faces in running a heterogeneous HPC infrastructure
The Wellcome Trust Sanger Institute is a genome research center set up in 1992 by the Wellcome Trust and the Medical Research Council in order to further the knowledge of genomes. It plays a substantial role in the sequencing and interpretation of the human genome to underpin research on human biology and disease.
The Sanger Institute realized that the research they were doing would be “ludicrously expensive on single machines,” according to Tim Cutts, Platform LSF Administrator, Sanger Institute. “Traditional supercomputers were not set up to deal with the sorts of problems we were facing.”
Twelve clusters, hundreds of nodes
The Institute has a total of twelve clusters, eight of which, including the three largest, run Platform LSF. It is a multi-vendor, heterogeneous Linux environment, with dual and quad core IBM and HP machines, as well as several SGI Altix machines in the various clusters. The architecture varies from 32-bit systems to some 64-bit Opteron and 64-bit Itanium systems. This cluster environment is a perfect fit for IBM® Platform™ LSF® which was designed to work with the most complex IT environments. The largest cluster has 710 nodes and is used for general purpose research by the Institute’s researchers. Platform LSF is a High Performance Computing (HPC) management software solution that intelligently schedules parallel and serial workloads.
Data inputs grow by orders of magnitude
HPC is important to the Sanger Institute more than ever before. The Institute recently purchased 30 new genome sequencing machines, each of which produces two orders of magnitude more data than the previous generation of sequencers. The raw data is sent to a Platform LSF cluster for processing.
The machines run third-party “sequencing pipeline” software. “They generate absolutely stupendous quantities of data. A single one of these instruments has the sequencing capacity of our entire institute three years ago,” says Cutts. “A sequencing reaction run on the machine takes around three days,” he continues.
“Each run generates around two terabytes of data, so that’s four terabytes per machine per week.” Multiply that by 30 machines and the Institute is processing 120 terabytes of raw data a week. This makes it much more vital to have a robust workload management solution available to ensure that the enormous amount of information is processed as efficiently and as quickly as possible. The sequencing is carried out on the 128-node Platform LSF cluster. “This cluster handles the huge quantity of data that come off of the new DNA sequencing machines,” says Cutts.
Accelerating genomic research
The Sanger Institute was a key player in the Human Genome Project, delivering almost one third of the work involved. Since the mid-1990s, Platform LSF has improved workload management, researchers’ efficiency and time-to-results for the Human Genome Project by allowing the Institute to run up to half a million sequence matching jobs a day.
“The Human Genome Project used enormous, scalable compute power to market the data available throughout the project. The Project was as much an exercise in IT and systems needs as in lab science,” says Phil Butcher, Head of IT at the Sanger Institute. According to Butcher the Institute was able to finish sequencing the human genome two years ahead of schedule partly because of the investments in flexible systems and software.
Finding causes and cures for disease
A smaller 100-node cluster, also powered by Platform LSF, handles requests submitted by people externally to the Ensembl Genome Browser. Ensembl, a joint project with the European Bioinformatics Institute, amasses genome information for 20 species including chimpanzees, mice, cows, dogs and humans. This site allows researchers to compare their own gene sequences with information available from the Sanger Institute. Users can paste their information into the web site and ask it to “find something in the database that looks like it,” says Cutts.
“Let’s say you are interested in muscular dystrophy,” says Cutts. “You would be able to compare a particular gene that you know has been associated with muscular dystrophy with the equivalent gene in 20 other organisms. And then use the information to construct for example a mouse model of the disease.”
Platform LSF is crucial to this process as the job of updating the genetic annotation database is a computationally intensive process. “Since the data is being constantly updated by laboratories all over the world, we recalculate the entire Ensembl database from scratch every two months on the large cluster.”
Support is the key
As for many customers of IBM® Platform Computing™, a high standard of support is critical to the Sanger Institute. “The standard of support we have had has been excellent,” says Cutts. “No software is bug free, so we need a company that is actually willing to say, ‘Oh, yeah, okay, we’ll fix that as quickly as possible.’”
The Wellcome Trust Sanger Institute does not endorse commercial products.
For more information
To learn more about IBM Platform Computing please contact your IBM marketing representative or IBM Business Partner, or visit the following website: ibm.com/platformcomputing
Products and services used
IBM products and services that were used in this case study.
© Copyright IBM Corporation 2012 IBM Corporation Systems and Technology Group Route 100 Somers, NY 10589 Produced in the United States of America June 2012 IBM, the IBM logo, ibm.com, Platform Computing and Platform LSF are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates. The performance data discussed herein is presented as derived under specific operating conditions. Actual results may vary. THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided. Actual available storage capacity may be reported for both uncompressed and compressed data and will vary and may be less than stated.