InfoSphere DataStage

InfoSphere® DataStage® is a data integration tool that enables users to move and transform data between operational, transactional, and analytical target systems.

Data transformation and movement is the process by which source data is selected, converted, and mapped to the format required by target systems. The process manipulates data to bring it into compliance with business, domain, and integrity rules, and with other data in the target environment.

InfoSphere DataStage provides direct connectivity to enterprise applications as sources or targets, ensuring that the most relevant, complete, and accurate data is integrated into your data integration project.

By using the parallel processing capabilities of multiprocessor hardware platforms, InfoSphere DataStage enables your organization to solve large-scale business problems. Large volumes of data can be processed in batch, in real time, or as a web service, depending on the needs of your project.

Data integration specialists can use the hundreds of prebuilt transformation functions to accelerate development time and simplify the process of data transformation. Transformation functions can be modified and reused, decreasing the overall cost of development, and increasing the effectiveness in building, deploying, and managing your data integration infrastructure.

As part of the InfoSphere Information Server suite, InfoSphere DataStage uses the shared metadata repository to integrate with other components, including data profiling and data quality capabilities. An intuitive web-based operations console enables users to view and analyze the runtime environment, enhance productivity, and accelerate problem resolution.

Balanced Optimization

Balanced Optimization helps to improve the performance of your InfoSphere DataStage job designs that use connectors to read or write source data. You design your job and then use Balanced Optimization to redesign the job automatically to your stated preferences.

For example, you can maximize performance by minimizing the amount of input and output (I/O) that are used, and by balancing the processing against source, intermediate, and target environments. You can then examine the new optimized job design and save it as a new job. Your root job design remains unchanged.

You can use the Balanced Optimization features of InfoSphere DataStage to push sets of data integration processing and related data I/O into a database managements system (DBMS) or into a Hadoop cluster.

Integration with Hadoop

InfoSphere DataStage includes additional components and stages that enable integration between InfoSphere Information Server and Apache Hadoop. You use these components and stages to access and interact with files on the Hadoop Distributed File System (HDFS).

Hadoop is the open source software framework that is used to reliably manage large volumes of structured and unstructured data. HDFS is a distributed, scalable, portable file system written for the Hadoop framework. This framework enables applications to work with thousands of nodes and petabytes of data in a parallel environment. Scalability and capacity can be increased by adding nodes without interruption, resulting in a cost effective solution that can run on multiple servers.

InfoSphere DataStage provides massive scalability by running jobs on the InfoSphere Information Server parallel engine. By supporting integration with Hadoop, InfoSphere DataStage enables your organization to maximize scalability in the amount of storage and data integration processing required to make your Hadoop projects successful.

Big Data File stage: The Big Data File stage enables InfoSphere DataStage to exchange data with Hadoop sources so that you can include enterprise information in analytical results. These results can then be applied in other IT solutions.
Oozie Workflow Activity stage: The Oozie Workflow Activity stage enables integration between Oozie and InfoSphere DataStage. Oozie is a workflow system that you can use to manage Hadoop jobs.