Parallelism basics in IBM InfoSphere Information Server

The pipeline parallelism and partition parallelism that are used in IBM® InfoSphere® Information Server support its high-performance, scalable architecture.

Data pipelining

Data pipelining is the process of pulling records from the source system and moving them through the sequence of processing functions that are defined in the data-flow (the job). Because records are flowing through the pipeline, they can be processed without writing the records to disk, as Figure 1 shows.

Figure 1. Data pipelining
Concept of data pipelining

Data can be buffered in blocks so that each process is not slowed when other components are running. This approach avoids deadlocks and speeds performance by allowing both upstream and downstream processes to run concurrently.

Without data pipelining, the following issues arise:

Data partitioning

Data partitioning is an approach to parallelism that involves breaking the record set into partitions, or subsets of records. If no resource constraints or other data skew issues exist, data partitioning can provide linear increases in application performance. Figure 2 shows data that is partitioned by customer surname before it flows into the Transformer stage.

Figure 2. Data partitioning
Source data partitioned by customer last name

A scalable architecture should support many types of data partitioning, including the following types:

InfoSphere Information Server automatically partitions data based on the type of partition that the stage requires. Typical packaged tools lack this capability and require developers to manually create data partitions, which results in costly and time-consuming rewriting of applications or the data partitions whenever the administrator wants to use more hardware capacity.

In a well-designed, scalable architecture, the developer does not need to be concerned about the number of partitions that will run, the ability to increase the number of partitions, or repartitioning data.

Dynamic repartitioning

In the examples shown in Figure 2 and Figure 3, data is partitioned based on customer surname, and then the data partitioning is maintained throughout the flow.

This type of partitioning is impractical for many uses, such as a transformation that requires data partitioned on surname, but must then be loaded into the data warehouse by using the customer account number.

Figure 3. A less practical approach to data partitioning and parallel execution
Data partitioned by customer last name throughout data flow

Dynamic data repartitioning is a more efficient and accurate approach. With dynamic data repartitioning, data is repartitioned while it moves between processes without writing the data to disk, based on the downstream process that data partitioning feeds. The InfoSphere Information Server parallel engine manages the communication between processes for dynamic repartitioning.

Data is also pipelined to downstream processes when it is available, as Figure 4 shows.

Figure 4. A more practical approach to dynamic data repartitioning
Data that is dynamically repartitioned throughout the flow

Without partitioning and dynamic repartitioning, the developer must take these steps:

The application will be slower, disk use and management will increase, and the design will be much more complex. The dynamic repartitioning feature of InfoSphere Information Server helps you overcome these issues.