Published on 22-Apr-2012
Validated on 12 Nov 2013
"There are lots of good reasons to use Revolution Analytics in an analytics appliance super computer like PureData System for Analytics.1 It’s faster and easier to program. It will speed up our computation. And, we have more data sets available to us because of how flexible and quick Revolution Analytics makes it to add and delete variables in our model." - Dr. Murali Ramanathan, University at Buffalo, State University of New York
State University of New York (SUNY) at Buffalo
PureData System for Analytics (powered by Netezza technology), PureSystems, Big Data, Data Warehouse, Smarter Planet
IBM Business Partner:
The State University of New York (SUNY) at Buffalo is home to one of the leading multiple sclerosis (MS) research centers in the world. MS is a devastating, chronic neurological disease that affects nearly one million people worldwide.
Researchers required the ability to quickly build models using a range of variable types and run them on a high-performing environment on huge data sets.
The solution helped the SUNY Buffalo researchers consolidate all reporting and analysis in one location to improve the efficiency, sophistication and impact of their research.
The SUNY researchers were able to reduce the time required to conduct analysis from 27.2 hours to 11.7 minutes.
The State University of New York (SUNY) at Buffalo is home to one of the leading multiple sclerosis (MS) research centers in the world. MS is a devastating, chronic neurological disease that affects nearly one million people worldwide. The disease causes physical and cognitive disabilities in individuals and is characterized by inflammation and neuro-degeneration of the brain and spinal cord. From the beginning, the genetics of MS were known to be complex and it was apparent that no single gene was likely causative for the disease.
Since 2007, the SUNY team has been looking at data obtained from scanned genomes of MS patients to identify genes whose variations could contribute to the risk of developing MS. New technologies now enable hundreds of thousands of genetic variations, called single nucleotide polymorphisms (SNPs), to be obtained from single samples. According to research lead, Dr. Murali Ramanathan, a critical fact in the study of MS is that “gene products work by interacting with both other gene products and environmental factors.”
“Because of this, researchers have postulated that multiple SNPs—combined with environmental variables—would better explain the risk of developing MS. Finding these combinations is analogous to the proverbial search for a needle in a haystack,” says Dr. Ramanathan. Identifying a candidate’s environmental factors that could be used to prevent the disease from progressing in patients was of great importance. Examples of this include sun exposure and vitamin D levels, Epstein-Barr virus infection and smoking.
The researchers needed to create algorithms that would efficiently identify interactions and attempt ‘parsimony’—an extreme efficiency. The team was looking for the fewest number of SNPs, environmental and phenotypic variable combinations that would help explain the presence of MS.
The researchers at the State University of New York developed an approach they called AMBIENCE that is distinctive in its use of new information theoretic methods along with its versatility and scalability. The theoretic underpinnings of AMBIENCE enable the detection of both linear and non-linear dependencies in the data. The AMBIENCE algorithm is capable of conducting an efficient search of the large combinatorial space because of the unique nature of the information theoretic metrics, which allows for greedy search identification of the most promising combinations.
The data sets used in this type of multi-variable research are very large and the analysis is computationally very demanding because the researchers are looking for significant interactions between thousands of genetic and environmental factors. There are two issues to overcome: crunching through the immense data set and building analytic models that allow the team to look at more than simply first order interactions. The researchers want to see not only which variable is significant, but also which pairs of variables or which three variables are significant. This requires the ability to quickly build models using a range of variable types and run them on a high-performing environment on huge data sets. It also requires the ability to include an almost limitless variety of dependent variable types.
The computational challenge in gene-environmental interaction analyses is due to a phenomenon called ‘combinatorial explosion.’ Considering that there are thousands of SNPs, the number of combinations of SNPs that have to be assessed for uncovering potential interaction becomes incredibly large.
Before they could run the analysis using all of the variables presented in the data set, the researchers needed an analytic framework that would allow them to add and remove variables from the model quickly and easily, without having to write hundreds of lines of code. The sheer number of SNPs combined with environmental variables and phenotype values mean that the amount of computations necessary for data mining could number in the quintillions (18 zeros).
SUNY Buffalo is using Revolution R Enterprise in conjunction with IBM PureData System for Analytics, powered by Netezza technology, to dramatically simplify and speed up very complex analysis on large data sets.
The organization’s researchers knew they needed the level of processing only available through a high-performance computer (HPC). They also needed capabilities found in relational databases—not often included in HPC platforms. HPCs typically run their processes only in parallel, breaking up calculations to run simultaneously across multiple processors. In addition, they also often used field-programmable gate array (FPGA) architectures for additional speed.
PureData System for Analytics offered the features and performance the researchers needed and much more. From an analytics perspective, the SUNY Buffalo researchers were able to write a version of their software tools in Revolution R Enterprise for IBM Netezza, so that all their reporting and analysis were consolidated in one location. This prevented the need to move very large amounts of data in and out of PureData System for Analytics, which would cause further delays. They were also able to use a wider variety of data sets.
Due to the nature of SUNY Buffalo's research work, there is immense value in now being able to use a wide variety of data sets to study the interactions among a greater range of variables. The solution improved the efficiency, sophistication and impact of the research. SUNY’s team adopted R—the core of Revolution R Enterprise for IBM Netezza—because it offered the flexibility to include diverse types of variables such as categorical, discrete Poisson-dependent, or continuous normally-distributed variables by simply adding a few lines of code. In the past, the SUNY team would have to rewrite entire algorithms, requiring a great deal of staff time. Now, scientists can change the algorithm themselves by adding a new entropy function, and focus less on writing algorithms and more on the science.
Once SUNY Buffalo deployed the PureData System for Analytics as its research analytics infrastructure along with Revolution R Enterprise for IBM Netezza; and the genetic data were assembled, the environmental and phenotype data were combined, and the algorithms were customized, the researchers were empowered to look for potential factors contributing to the risk of developing MS.
The SUNY researchers were able to:
● Use the new algorithms and add multiple variables that before were nearly impossible to achieve.
● Reduce the time required to conduct analysis from 27.2 hours to 11.7 minutes.
● Carry out their research with little to no database administration. (Unlike other HPC platforms or databases available, PureData System for Analytics was designed to require a minimum amount of maintenance.)
● Publish multiple articles in scientific journals, with more in process.
● Proceed with studies based on ‘vector phenotypes’—a more complex variable that will further push IBM PureData System for Analytics.
About IBM PureData System for Analytics
The IBM PureData System for Analytics, powered by Netezza technology, integrates database, server and storage into a single, easy-to-manage appliance that requires minimal setup and ongoing administration while producing faster and more consistent analytic performance. The IBM PureData System for Analytics simplifies business analytics dramatically by consolidating all analytic activity in the appliance, right where the data resides, for industry-leading performance. Visit: ibm.com/PureData to see how our family of expert integrated systems eliminates complexity at every step and helps you drive true business value for your organization.
About IBM Data Warehousing and Analytics Solutions
IBM provides the broadest and most comprehensive portfolio of data warehousing, information management and business analytic software, hardware and solutions to help customers maximize the value of their information assets and discover new insights to make better and faster decisions and optimize their business outcomes.
Products and services used
IBM products and services that were used in this case study.
IBM Netezza Analytics, IBM Netezza 1000, PureData System for Analytics (powered by Netezza technology)
© Copyright IBM Corporation 2013. IBM, the IBM logo, and ibm.com are trademarks of IBM Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at "Copyright and trademark information" at ibm.com/legal/copytrade.shtml