Scenarios for integrating InfoSphere Data Click and InfoSphere BigInsights

You can use InfoSphere® Data Click to integrate large volumes of data in a Hadoop Distributed File System (HDFS) in InfoSphere BigInsights®. InfoSphere BigInsights can then discover and analyze business insights that are hidden in the large volumes of data.

Because InfoSphere Data Click simplifies data movement, you do not have to be an experienced ETL developer to integrate data. Whether you are an analyst, data scientist, or line-of-business user, you can use InfoSphere Data Click to integrate data sets of any size, even into the terabyte or petabyte range.

After the data is copied into the HDFS, you can use InfoSphere BigInsights to derive business value from complex, unstructured information. For example, you can use IBM Big SQL, the IBM SQL interface to InfoSphere BigInsights, to summarize, query, and analyze data. InfoSphere BigInsights supports various scenarios that can help organizations grow by finding value that is hidden in data and data relationships.

Scenario: Augmenting the data warehouse

An international retailer has a traditional data warehouse and wants to augment their business intelligence environment with new unstructured data sources that are only in an HDFS. To run new analytical models, the unstructured information must be combined with data from the traditional data warehouse.

To augment the data warehouse infrastructure, the retailer uses InfoSphere Data Click to copy customer and product information from the warehouse into an HDFS. The retailer then uses the BigSheets tool, which is part of InfoSphere BigInsights, to combine the warehouse data with structured, unstructured, and streaming data from other sources. Now, the retailer can filter and analyze the combined set of data.

By using the self-service integration features of InfoSphere Data Click with InfoSphere BigInsights, the retailer can move and analyze data in hours rather than days or weeks.

Scenario: Using an HDFS as a landing zone

A financial services department receives data from external and internal sources; these sources send the data in different ways and in various formats. To make their processes more efficient, the department wants to use an HDFS as a common storage area, or landing zone, for the data that comes from their internal partners. Unfortunately, their internal partners lack the data integration skills to send data to the HDFS, which creates a roadblock to achieving the new corporate objective.

The company decides to use InfoSphere Data Click to help the internal partners get their data into the landing zone. Line-of-business users from the marketing, finance, and human resources departments can run InfoSphere Data Click activities to copy data from their data sources to the HDFS. In the HDFS, InfoSphere BigInsights processes can complete detailed analysis on complex sets of data in hours.

The line-of-business users can run InfoSphere Data Click activities on demand, providing current and comprehensive insights into the challenges that are faced by each part of the organization. The combination of InfoSphere Data Click and InfoSphere BigInsights gives the line-of-business users a simple infrastructure and process for both moving data and running analytics on that data.

Scenario: Identify data types to analyze in InfoSphere BigInsights

Data scientists for a large insurance company need to run exploratory statistical models across huge amounts of data. The data scientists want to use InfoSphere BigInsights to analyze large volumes of varied data, but the data is scattered between many different databases and data sources. In addition, the data scientists do not know what kinds of data they have.

The data scientists use the Information Governance Catalog to search the data sources for specific types of data. For example, suppose that some of the data sources are indexed based on meaningful columns such as department, geographical region, and product name. Because the data scientists want to run statistical models on data from the South America geographical region, they search for South America. The Information Governance Catalog shows a list of database tables and other data sources, with an explanation of why the sources are related to the search term.

By browsing the search results, the data scientists identify the data sources that have the data that they want to analyze and the kinds of data in those data sources. The data scientists select the data to analyze and open InfoSphere Data Click directly from the Information Governance Catalog. When InfoSphere Data Click opens, the sources that were selected in the catalog are already selected in InfoSphere Data Click. The data scientists can use InfoSphere Data Click to copy the data to an HDFS and use InfoSphere BigInsights to analyze the data.