“The IBM Spark application for the analysis of ATA data is an exciting project that will drive innovation that is applicable across many big data applications, and opens the way for IBM to maintain a publically accessible Spark application that enables data science professionals to work with real data that matters, and to see first hand how IBM is the leader in big data analytics.”
Linda Bernardi, Chief Innovation Officer -IBM Cloud & IoT
Over the last ten years, millions of complex rows of relational data from the Allen Telescope Array (ATA) has been recorded and is being sifted, analyzed and visualized. SETI has been missioned to explore, understand, and explain the origin and nature of life in the universe.
IBM jStart team has joined with the SETI Institute to develop a Spark application to analyze the 168 million radio events detected by the ATA over several years. The complex nature of the data demands sophisticated mathematical models to tease out faint signals, and machine learning algorithms to separate terrestrial interference from true signals of interest. These requirements are well suited to the scalable in-memory capabilities offered by Apache Spark especially when combined with the big data capabilities of IBM Cloud Data Services.
With over 4.5TB data streaming from the ATA dishes every hour, any improvement in the analytics to identify signals of interest will make a big difference. Already, the IBM Spark project is doing exactly that; by analyzing the vast archives of ATA content, new algorithms are already being developed to isolate human radio frequency interference (RFI) from external signals which deserve further scrutiny.
For example, new “topographic” algorithms for signal enhancements have been applied to archived ATA signal recordings, which greatly improve the ability to apply Spark machine learning for automatic signal-type classification (e.g. radar interference). These are new approaches to the SETI analytic mission that have never been tried before, and are enabled by the capabilities of the IBM Apache Spark services and the expertise of the project team that has been assembled to leverage them.
These advances can help find to interesting signals that have previously been ignored, and also open the way for improvements to the real-time decisioning systems right at the ATA , which will permit better processes for capturing data of interest and alerting the targeting operators to return to a target of interest in real-time.
“…what really matters is buried beneath mundane data”
Most importantly, this IBM Spark application has a very practical measure of success which is already being demonstrated: the real issue is not finding the needle in the haystack, it is eliminating all the hay to leave the potential needles behind. Eliminating our own radio frequency interference from the ATA data is a central analytic objective, and is consistent with the demands of other customers in other industries that have the same problem: what really matters is buried beneath mundane data.
The POC has been implemented using the IPython Notebook service on Apache Spark deployed on IBM Cloud Data Services (CDS). ATA data archives have been loaded into CDS object store in a format that facilitates signal processing and experimentation. As a part of the POC development process, various IPython notebooks (“notebooks”) are being created to leverage the in-memory analytics of the Spark environment. These notebooks will create a self-documenting repository of ATA signal processing research which can be searched, referenced, shared and improved upon in a collaborative manner.
Lessons Learned with Complex Data
Tens of millions of ATA signal events have been recorded in binary files, which in turn are linked to hundreds of millions of records in a structured database that provides additional information about the signal event, such as the exact date-time of the signal, the target coordinates, and other details. The IBM Spark project is linking these two data sets – in their entirety – for the first time. Many industry applications generate similar types of big data and face the same analytic challenges. Examples include complex telemetry from engine components in an IoT application, or seismic signal data generated while exploring for new gas deposits.
The SETI application and ATA data is highly representative of the big data challenges that our customers face across multiple industries. The project is a great proving ground for “stress testing” the IBM Spark service platform, and for learning implementation techniques we can apply elsewhere for customer success. For example, just the challenge of storing and efficiently accessing over 25 million different binary files has demanded some innovative file management techniques.
Bigger Data and Better Algorithms
With almost 170 million signals recorded in the last ten years, it is important to find new ways to cluster signals for further analysis. The IBM Spark platform is enabling the team to do this, not only because the environment can scale to the big data challenges, but also because the interactive “notebook” user interface provide an environment that encourages exploration and experimentation.
For example, IBM researchers have applied complex algorithms to the signal data that eliminate the signal distortion that is created by the movement of the earth – both its rotation and orbit around the sun. This has allowed data scientists to use this new “corrected” signal data as a new feature for analysis and clustering, and new findings are already emerging.
For example, “corrected” signals which are determined to have a source with no acceleration relative to the sun are found to cluster in an area of the sky with a declination of 85 degrees.
“…without IBM Spark, we would not even have known to ask”
At the moment, the cause for this spike in observations is unclear, and further investigation is underway. It will likely have a simple explanation, but the main point is that, without IBM Spark, we would not even have known to ask.
These are complex calculations that need to be applied to tens of millions of signals, and exciting new territory for SETI signal processing. This is a perfect fit for the distributed analytic capabilities of Spark.
jStart Sparks Innovation
The jStart team has lead the SETI project with the focused goal of proving the efficacy of our new Apache Spark services to analyze very large and complex data sets in a collaborative team workspace. In order to push the boundaries of our analytic objectives, we have assembled a data science team of world experts in signal processing and astronomical data analysis, which includes data scientists and astronomers from Penn State University, NASA Space Science Division, the SETI Institute, IBM Almaden Research Lab, and the new IBM Johannesburg Research facility. The user experience expectations and analytic demands of this team, and the data they are working with, is highly representative of how we expect IBM customers in other industries to use our Spark services in a production setting.
Already, the SETI project is generating innovations and lessons learned that can be applied to other Spark applications. For example, IBM Cloud Data Services team worked with the jStart team to create a shared object store layer for the main SETI data repository, that can be reliably accessed without disruption even when the Spark services are upgraded and enhanced. The data science team is then free to use their own object store services within their BlueMix accounts to create derivative data extracts for exploration and localized analysis. This combination of a controlled and persistent data lake, combined with localized "sand boxes" of data for collaborative experiments, is a key capability that IBM can take from this project to other customer engagements.
In addition, the cross-organization collaboration is also paying dividends on how IBM can approach new opportunities. For example, the IBM researcher from the Johannesburg labs is also the lead IBM Research scientist for the new Square Kilometer Array (SKA) radio telescope project in South Africa. He has joined the SETI project specifically to learn how to best apply IBM Spark services at the IBM engagement at the SKA data processing facility.
This is the spread of leadership that emerges when IBM steps up to projects which stretch the boundaries of analytics and pushes for insight that matters.
Start Small, Grow Fast
Learn how the jStart Team can help your business get started using our "start small, grow fast" engagement process. Today's business challenges aren't just about huge amounts of information, rather it is leveraging the valuable insights and opportunities living within that data. jStart is a highly skilled team focused on providing fast, smart, and valuable business solutions leveraging the latest technologies. The team typically focuses on emerging technologies which have commercial potential within 12-18 months. This allows the team to keep ahead of the adoption curve, while being prepared for client engagements and partnerships. The team’s focus includes: predictive and prescriptive analytics, cognitive computing, cloud technologies, big data, social data and mobile platforms.
Follow us on: