IBM Rochester Centre for Advanced Studies

Improving scalability of MD simulations through performance analysis

IBM Contact: Carlos Sosa
Research Contact: Pratul Agarwal, Oak Ridge National Laboratory

There is considerable thrust from DOE on enabling biomolecular (MD) simulations to work on Petascale machines with >100,000 of processors. On one side DOE has indicated commissioning of two Petascale machine with ANL being the site for a 1000 TeraFLOP Blue Gene/P. On the other side, DOE will be spending $250 M to establish 2 Bio-energy Research Centers. Discussions at various levels are leading to the conclusion that biomolecular simulations will be an important part in both these efforts. We are currently, performing benchmarking and initial testing studies of modeling cellulose degrading enzyme (cellulase) complex. Engineered cellulases have implications in low-cost production of ethanol by enzymatic processing of biomass. Our preliminary studies will provide an idea for the performance metrics and scales achievable (how many nanoseconds can be simulated in a day) as well as the kind of molecular systems that can be modeled on BG/L and BG/P (millions of atoms). The outcome would be some preliminary science results that would suggest the suitability of BG/L (and therefore BG/P) for biomolecular simulations of cellulose complex. Moreover, these preliminary studies would also provide vital information regarding scaling and designing of application health-monitoring and fault-tolerance strategies.

Molecular Simulation of RNA catalysis using multi-scale simulation

IBM Contact: Carlos Sosa
Research Contact: Prof. Darrin York, Prof. George Karypis, Dr. Tai-Sung Lee : University of Minnesota

The present project proposes to merge interests between the IBM Life Sciences division in a multidisciplinary research effort at the University of Minnesota that combines Computational Chemistry and Biology with Bioinformatics and Computer Science. The proposed project goal involves leveraging the computing power of the BlueGene system for long-time molecular simulation of RNA catalysis using recently developed multi-scale simulation models derived from a quantum chemical database for RNA catalysis.

Classification of Secreted Proteins using a Single Domain

IBM Contact: Carlos Sosa
Research Contact: Eric Klee, Mayo Clinic
Stephen Ekker, Genetics, Cell Biology and Development
Lynda Ellis, Laboratory Medicine and Pathology, University of Minnesota

Secreted proteins make up approximately 10-20% of the vertebrate proteome, control cell-cell interactions, and are major targets for drug discovery. They are more properly called CoTranslationally Translocated (CTT) sequences, since they contain N-terminal signal sequences that allow them to be translocated. Thus CTT sequence prediction has traditionally relied on N-terminal signal sequences identification. However, these techniques assume predictions are based on full-length protein sequences and performance significantly decreases when analyzing N-terminally incomplete sequences, such as protein sequences encoded by ESTs. Truncated signal peptides may reduce the sensitivity, and N-terminal truncations near transmembrane domains may reduce the specificity, of these techniques.

We have developed csP, a technique that uses protein domain classification instead of signal sequence identification to predict secreted proteins. Human Swiss-Prot release 44 protein sequences (excluding earlier releases) were downloaded. Protein domains found in annotated CTT sequences and not found in non-CTT sequences are used as our reference. Protein domains are identified using InterProScan []. We demonstrate how csP can be used in conjunction with other software packages to help identify secreted proteins. csP is available upon request.

Protein-protein Interactions

IBM Contact: Carlos Sosa
Research Contact: Dr Ann M. Bode, Hormel Institute

Working on the use of computer modeling as it relates to cancer development and prevention. Carcinogenesis is a multifaceted and complex process, which profoundly affects numerous genes and gene products important in the regulation of various cellular functions. A major focus of our work has been the elucidation of crucial molecular and cellular mechanisms in cancer development and prevention. Our goal is the clarification of signal transduction pathways induced by carcinogens and tumor promoters as causative factors for cancer development. In order to facilitate the development of chemopreventive and chemotherapeutic agents that specifically target molecules important in cancer development, we must know the enemy-we must understand carcinogenesis. Cancer development is an extremely complex process involving a multitude of signaling proteins (e.g., kinases), which rarely, if ever, act alone. Signal transduction is the process by which information from a stimulus outside the cell is transmitted from the cell membrane (e.g., through its receptor) into the cell and along an intracellular chain of signaling molecules to stimulate a response. Accumulating evidence suggests that the process of signal transmission is not linear but instead involves a cross-talking network of communications. In this process, signaling proteins interact extensively with one another, which results in the escalation or suppression of the signal and ultimate cellular response.

Protein interaction with small molecules

IBM Contact: Carlos Sosa
Research Contact: Dr Ann M. Bode, Hormel Institute

Working on the use of computer modeling as it relates to cancer development and prevention. The prevailing thought today is that cancer may be prevented or treated by targeting specific cancer genes, signaling proteins and transcription factors. Cancer is a multistage process, consisting of initiation, promotion and progression stages. Although each stage may be a possible target for chemopreventive agents, because of its extensive length, the promotion stage has the most potential to be reversed. By focusing on the molecular mechanisms explaining how normal cells can undergo neoplastic transformation induced by tumor promoters, we have discovered that several specific transcription factors and proteins, such as activator protein-1 (AP-1) and c-Jun N-terminal kinases (JNKs), are critical factors in cancer development and significant targets for cancer prevention and treatment. A major goal is to identify anticancer agents that have low toxicity with fewer adverse side effects, which may be used alone or in combination with traditional chemotherapeutic agents to prevent or treat cancer. Many dietary compounds (e.g., EGCG from green tea, resveratrol from grapes and wine) have potent anticancer activities that work through, as yet, unknown mechanisms. Over the years, we have been working to identify those mechanisms through our work with signal transduction pathways.

Effective & Efficient Whole Genome Alignment Algorithms

IBM Contact: Tim Mullins
Research Contact: George Karypis, University of Minnesota

The overall goal of this project is to develop highly effective whole-genome sequence alignment algorithms that can correctly align genomes irrespective of their evolutionary distances and can take advantage of the computational power (in terms of both CPU and memory) offered by modern distributed-memory parallel computers. The proposed research is centered around the following three specific aims.

Specific Aim 1: Parallel Whole Genome Alignment Algorithms Study various approaches for developing highly parallel formulations of existing whole-genome alignment algorithms. These formulations will be developed in the context of the IBM Blue Gene class of distributed-memory parallel systems and will concentrate on the MUMmer 3.0 and LAGAN , two of the most widely-used and open source whole-genome alignment algorithms.

Specific Aim 2: Alignment in the Presence of Non-local Rearrangement Events Develop better models for incorporating non-local rearrangement events, widely present in DNA sequences of distant genomes. This would lead to whole-genome alignments that have a higher specificity and sensitivity, a factor determined by how well the algorithm would align the coding and non-coding regions.

Specific Aim 3: Improved Algorithms for Anchor Point and Seed Selection Study the extent to which the anchor-point-based algorithms, which were recently developed . . . for protein sequence alignment, can be extended for aligning genomic sequences. These algorithms identify the anchor points by combining automatically generated sequence profiles with a greedy selection strategy and were shown to produce alignments that are superior to those obtained by the traditional dynamic-programming -based approaches.

Data Mining for finding connections between Disease and Medical Genomic Characteristics

IBM Contact: Fred Kulack
Research Contact: Dr. Vipin Kumar, Dr, Michael Steinbach, University of Minnesota

The first phase of the process is related to data collection, and, from a computer science research perspective, focusing on analysis of that data. The interesting research consists of types of analysis that are appropriate in the context of the unique characteristics that clinical and genomic data holds (for example, highly dimensional data). The idea is that these characteristics imply a different and unique set of analysis strategies, like the ways that we determine the associations and relationships between clinical values and/or genomic data.
In the second phase, we will apply techniques we just described, to determine some new data analysis techniques/algorithms and solutions that can be created because of the hardware capability that the Blue-Gene supercomputer gives us. I.e. what new questions can we answer when we effectively remove the time barrier on some of the calculations.

Tackling the Stochastic Simulation of Biochemical Neetworks with Real Computing Power

IBM Contact: Tim Mullins
Research Contact: Marc D. Riedel, Assistant Professor, ECE, Univ. of Minnesota

Collaboration between Univ. of Minnesota and the IBM-Rochester Life Sciences Group.
Area of Interest: Leveraging the computing power of the BlueGene system in computational biology.

Biochemical reactions are the lingua franca of the biological sciences. For a variety of cellular processes --from metabolic pathways to gene regulation -- biologists have formulated detailed reaction models describing the molecular mechanisms involved. And yet, in spite of the fact that precise models exist, methods for quantitative analysis remain elusive. Translating a descriptive set of reactions into a computational framework has proved challenging.

Despite algorithmic improvements, computation time severely limits the application of stochastic simulation. The computational burden stems from the physical nature of the problem: in a cell, vast numbers of chemical reactions happen nearly simultaneously. Stochastic simulation tracks each and every one of these reactions. If implemented sequentially, each trajectory becomes impossibly long. The problem is clearly ripe for parallelization, and IBM Blue Gene seems to be the right platform.