IBM Support

What is the difference between Upsert and Indate?

Question & Answer


Question

Common Connector stages for RDB support "Update then insert" and "Insert then update" Write mode. Sometimes the same job with the same input data behaves differently depending on the Write mode, how this can happen?

Answer

If there are duplicate rows in an array, a Common Connector job with the same input data can show different behavior for "Update then insert" (aka upsert) and "Insert then update" (aka indate).

Common Connector stages for RDB, like Oracle Connector, DB2 Connector, ODBC Connector, etc, do array processing when doing Update or Insert.

In "Update then insert" write mode, it first updates all the rows in the array in a single DB call, and then if there's any update failing row because of "no data found" error, it inserts those failed rows.

In "Insert then update" write mode, it inserts all the rows in the array in a single batch, and then if there's any insertion failing row because of "unique constraint violation" error, it submits update statement for those failed rows.

If there are 2 or more rows to insert with the same value for unique columns in an array, "Update then insert" will meet unique constraint violation error during the insertion phase, and the Common Connector job can fail depending on the "Fail on row error" property setting or the reject link setting.

For example, a parallel job such that "Sequential File --> Oracle Connector" with next configuration will fail if UPSERT write method used, but will succeed if INDATE used.

* Target table layout and initial data

    CREATE TABLE T1(
    K1 CHAR(1) NOT NULL PRIMARY KEY,
    V1 CHAR(1),
    V2 CHAR(1) CHECK (V2 IN ('X', 'Y', 'Z')));

    insert into T1 values('2', '2', 'X');
    insert into T1 values('3', '3', 'X');
    insert into T1 values('4', '4', 'X');
    insert into T1 values('5', '5', 'X');
    insert into T1 values('6', '6', 'X');
    insert into T1 values('7', '7', 'X');
    insert into T1 values('8', '8', 'X');
    insert into T1 values('9', '9', 'X');
    commit;

* Input CSV file for the Sequential file stage.
    0,0,Y
    1,1,X
    1,2,Y
    4,4,Y
    A,A,Z
    B,B,Z
    C,C,Z
    D,D,Z
    E,E,Z

* Oracle Connector stage properties setting


    - Array size : 4

    - Transaction Record count : 8

    - Fail on row error : Yes

    - K1 designated as the Key in Columns tab


* 1 node APT configuration file used.

For UPSERT, Oracle Connector submits "UPDATE T1 SET V1=:V1, V2=:V2 WHERE K1=:K1" statement for the first 4 rows in the array, and the updates for (0, 0, Y), (1, 1, X), and (1, 2, Y) rows fail because there is no matching row in T1. Then it submits " INSERT INTO T1(K1, V1, V2) VALUES(:K1, :V1, :V2)" statement for the 3 failed rows. (0, 0, Y) and (1, 1, X) are inserted successfully but (1, 2, Y) is not because the row (1, 1, X) with the same key column value 1 has been inserted just before. The job aborts because of the "ORA-00001: unique constraint violated" error for (1, 2, Y) row insertion.

For INDATE, Oracle Connector submits the insert statement for the first 4 rows, and the insertion of (1, 2, Y) and (4, 4, Y) fails with ORA-00001 error. Then it submits the update statement for the 2 failed rows, which must succeed. Next 4 rows (A, A, Z), (B, B, Z), (C, C, Z), (D, D, Z) are all inserted well, then the connector submit commit statement. The last row (E, E, Z) is processed well, so the job finishes OK.

Notes:

1. In above UPSERT example, the job will not abort if "Fail on row error" property is set to "No" or a reject link attached to the Oracle Connector stage with "SQL Error - constraint violation" reject condtion chosen. Reject link is a useful tool for finding any upsert error.

2. Obviously we can avoid above UPSERT error by using array size 1, but this is not recommended as job performance can drop drastically with array size 1.

3. Exploiting INDATE to prevent the UPSERT error for duplicate input rows is not a good idea for most cases because the final result in the target table may be unpredictable. For example, if the input data in above example were produced by a DB2 Connector stage instead of the Sequential file stage, the order of the duplicate rows could vary run by run as RDB has no record order. RDB can output the rows (1, 1, X) then (1, 2, Y) order or (1, 2, Y) then (1, 1, X) order as long as proper "order by" clause not given.

4. So, if your job fails to UPSERT due to unique constraint violation at insert phase, review your ETL logic again and make sure the data set does not contain duplicate rows.

[{"Product":{"code":"SSVSEF","label":"IBM InfoSphere DataStage"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"9.1;8.7;8.5;8.1;8.0.1","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
16 June 2018

UID

swg21652163