Introduction to InfoSphere DataStage Balanced Optimization

You can use Balanced Optimization to improve the performance of some IBM® InfoSphere® DataStage® jobs.

You can optimize parallel jobs that use Teradata, IBM DB2®, Netezza, or Oracle connectors to connect to Teradata, IBM DB2, Netezza, or Oracle databases.

InfoSphere DataStage jobs provide connectivity, data manipulation functionality, and highly scalable performance. The InfoSphere DataStage visual flow-design paradigm is easy to use when designing simple-to-complex data integration jobs. Better performance might be achieved, however, if the processing load can be shared or redistributed among InfoSphere DataStage and the source or target data servers, where data servers are either databases or Hadoop clusters. You can control where the intensive work is done: in source data servers, in InfoSphere DataStage, or in target data servers.

For job designs that use connectors to read or write data from data sources, you can use Balanced Optimization to give you greater control over the job. You design your job as normal, then use Balanced Optimization to redesign the job automatically to your stated preferences. This redesign process can maximize performance by minimizing the amount of input and output performed, and by balancing the processing against source, intermediate, and target environments. You can then examine the new optimized job design, and save it as a new job. Your root job design remains unchanged. The Balanced Optimization enables you to take advantage of the power of the databases without becoming an expert in native SQL.

The following principles can lead to the better performance of parallel jobs:
Minimize I/O and data movement
Reduce the amount of source data read by the job by performing computations within the source data server. Where possible, move processing of data to the data server and avoid extracting data just to process it and write it back to the same data server.
Maximize optimization within source or target data servers
Make use of the highly developed optimizations that data servers achieve by using local indexes, statistics, and other specialized features.
Maximize parallelism
Take advantage of default InfoSphere DataStage behavior when reading and writing data servers: use parallel interfaces and pipe the data through the job, so that data flows from source to target without being written to intermediate destinations.
Balanced Optimization uses these principles to improve the potential performance of a job. You influence the job redesign by setting options within the tool to specify which of the principles are followed.

Balanced Optimization does not change or optimize machine configurations, InfoSphere DataStage configurations, or database configurations