Data sets

Inside a parallel job, data is moved around in data sets. These carry meta data with them, both column definitions and information about the configuration that was in effect when the data set was created.

Inside a InfoSphere® DataStage® parallel job, data is moved around in data sets. These carry meta data with them, both column definitions and information about the configuration that was in effect when the data set was created. If for example, you have a stage which limits execution to a subset of available nodes, and the data set was created by a stage using all nodes, InfoSphere DataStage can detect that the data will need repartitioning.

If required, data sets can be landed as persistent data sets, represented by a Data Set stage (see "Data Set Stage.") This is the most efficient way of moving data between linked jobs. Persistent data sets are stored in a series of files linked by a control file (note that you should not attempt to manipulate these files using UNIX tools such as RM or MV. Always use the tools provided with InfoSphere DataStage).

Note: The example screenshots in the individual stage descriptions often show the stage connected to a Data Set stage. This does not mean that these kinds of stage can only be connected to Data Set stages.