Partitioning data for parallel reads for the Netezza connector

You must configure the Netezza connector to perform parallel reads; by default the connector runs sequentially. A parallel read is when the data is divided into subsets of data, and then the data is concurrently read by different processing nodes. The Netezza connector supports modulus partitioning. With modulus partitioning the rows are distributed between the processing nodes by adding a modulus expression against the special Netezza column datasliceid to the WHERE clause. For more information about the partition configuration and logical nodes, see the Parallel Job Developer's Guide.

About this task

Procedure

  1. On the job design canvas, double-click the Netezza connector icon.
  2. Select the reference link (for normal lookups only) from the Link drop-down list and then select the Lookup Type as Normal, or, select the output link that you want to configure for the parallel reads.
  3. Set Enable partitioned reads to Yes.
  4. Click OK.

Results

When this job is run, the WHERE clause in your SELECT statement is modified to return a subset of rows that are read by each processing node.

Example

If there are four processing nodes the SELECT statements in each processing node, it displays as follows:
SELECT col1, col2 FROM table WHERE mod(datasliceid,4)=0
SELECT col1, col2 FROM table WHERE mod(datasliceid,4)=1
SELECT col1, col2 FROM table WHERE mod(datasliceid,4)=2
SELECT col1, col2 FROM table WHERE mod(datasliceid,4)=3
If you use a user-defined SQL and the connector reads in parallel, it is your responsibility to ensure appropriate partitioning. To aid in that process the connector defines two place holders that are replaced with actual values at run time:
Place holder Description Value when running sequentially
[[node-count]] The total number of processing nodes. This place holder represents the level of parallelism for the Netezza connector stage and is equal to the number of processing nodes. 1
[[node-number]] The current processing node zero-based index. For example, if there are 4 processing nodes, the processing node indexes are 0,1,2,3. 0
You can use the place holders to create partitioning clauses in your SQL statements. For example:
SELECT * FROM table WHERE mod(datasliceid,[[node-count]])=[[node-number]]
[[node-count]] is replaced with the total number of processing nodes and [[node-number]] is replaced with the current processing node zero-based index.
Note: The column datasliceid is a special Netezza column specific to each table and is not automatically included in views and if needed must be explicitly added to the select list of the view query.