Sort stage

The Sort stage is a processing stage that is used to perform more complex sort operations than can be provided for on the Input page Partitioning tab of parallel job stage editors.

You can also use the Sort stage to insert a more explicit simple sort operation where you want to make your job easier to understand. The Sort stage has a single input link which carries the data to be sorted, and a single output link carrying the sorted data.

Shows a Sort stage taking data from an input data set, sorting it, then outputting it to an output data set

You specify sorting keys as the criteria on which to perform the sort. A key is a column on which to sort the data, for example, if you had a name column you might specify that as the sort key to produce an alphabetical list of names. The first column you specify as a key to the stage is the primary key, but you can specify additional secondary keys. If multiple rows have the same value for the primary key column, then InfoSphere® DataStage® uses the secondary columns to sort these rows.

You can sort in sequential mode to sort an entire data set or in parallel mode to sort data within partitions, as shown below:

Shows a Sort stage being used to sort an entire data set sequentially, and a Sort stage sorting parallel data within partitions

The stage uses temporary disk space when performing a sort. It looks in the following locations, in the following order, for this temporary space.

  1. Scratch disks in the disk pool sort (you can create these pools in the configuration file).
  2. Scratch disks in the default disk pool (scratch disks are included here by default).
  3. The directory specified by the TMPDIR environment variable.
  4. The directory /tmp.

You might perform a sort for several reasons. For example, you might want to sort a data set by a zip code column, then by last name within the zip code. Once you have sorted the data set, you can filter the data set by comparing adjacent records and removing any duplicates.

However, you must be careful when processing a sorted data set: many types of processing, such as repartitioning, can destroy the sort order of the data. For example, assume you sort a data set on a system with four processing nodes and store the results to a data set stage. The data set will therefore have four partitions. You then use that data set as input to a stage executing on a different number of nodes, possibly due to node constraints. InfoSphere DataStage automatically repartitions a data set to spread out the data set to all nodes in the system, unless you tell it not to, possibly destroying the sort order of the data. You could avoid this by specifying the Same partitioning method. The stage does not perform any repartitioning as it reads the input data set; the original partitions are preserved.

You must also be careful when using a stage operating sequentially to process a sorted data set. A sequential stage executes on a single processing node to perform its action. Sequential stages will collect the data where the data set has more than one partition, which might also destroy the sorting order of its input data set. You can overcome this if you specify the collection method as follows:

  • If the data was range partitioned before being sorted, you should use the ordered collection method to preserve the sort order of the data set. Using this collection method causes all the records from the first partition of a data set to be read first, then all records from the second partition, and so on.
  • If the data was hash partitioned before being sorted, you should use the sort merge collection method specifying the same collection keys as the data was partitioned on.
    Note: If you write a sorted data set to an RDBMS there is no guarantee that it will be read back in the same order unless you specifically structure the SQL query to ensure this.

By default the stage will sort with the native InfoSphere DataStage sorter, but you can also specify that it uses the UNIX sort command.

The stage editor has three pages:

  • Stage Page. This is always present and is used to specify general information about the stage.
  • Input Page. This is where you specify details about the data sets being sorted.
  • Output Page. This is where you specify details about the sorted data being output from the stage.