Collecting

Collecting is the process of joining the multiple partitions of a single data set back together again into a single partition.

There are various situations where you might want to do this. There might be a stage in your job that you want to run sequentially rather than in parallel, in which case you will need to collect all your partitioned data at this stage to make sure it is operating on the whole data set.

Similarly, at the end of a job, you might want to write all your data to a single database, in which case you need to collect it before you write it.

There might be other cases where you do not want to collect the data at all. For example, you might want to write each partition to a separate flat file.

Just as for partitioning, in many situations you can leave DataStage® to work out the best collecting method to use. There are situations, however, where you will want to explicitly specify the collection method.

Note that collecting methods are mostly non-deterministic. That is, if you run the same job twice with the same data, you are unlikely to get data collected in the same order each time. If order matters, you need to use the sorted merge collection method.