IBM Support

Environment variables required to import and process very large records in a Datastage parallel job.

Troubleshooting


Problem

Occasionally, there is a need to be able to import and process very long records. Making this work requires setting several environment variables, discussed below, with some performance implications.

Symptom

Diagnosing The Problem

You would see one or more of the following errors in the job log:

Error (1):
7.5.x
##E TOIX 000061 19:43:27(001) <import,0> Consumed more than 100000 bytes looking for record delimiter; aborting
8.x or later
##E IIS-DSEE-TFRS-00061 20:40:23(001) <import,0> Consumed more than 100000 bytes looking for record delimiter; aborting

Error (2):
7.5.x
##F TFDR 000043 19:45:13(003) <import,0> Fatal Error: Virtual data set.; output of "import": the record is too big to fit in a block; the length requested is: 1980449.
8.x or later
##F IIS-DSEE-TFDR-00043 20:45:21(003) <import,0> Fatal Error: Virtual data set.; output of "import": the record is too big to fit in a block; the length requested is: 1980449, the max block length is: 131072.

Error (3):
7.5.x
##F TFDR 000043 19:48:00(004) <peek(0),0> Fatal Error: File data set, file "{0}".; output of "peek(0)": the record is too big to fit in a block; the length requested is: 1980449.
8.x or later
##F IIS-DSEE-TFDR-00043 20:47:38(007) <peek(0),0> Fatal Error: File data set, file "dedup_src_clm_error_edit_logic_s.ds".; output of "peek(0)": the record is too big to fit in a block; the length requested is: 1980449, the max block length is: 131072.

Error (4):
7.5.x, 8.x or later - segmentation or bus error signals.


In addition to these messages, if the job contains a peek operator, the output of peek is either dropped or truncated (depending on the DataStage release level) for very large messages:
7.5.x
##W TFPM 000370 19:48:49(000) <peek(0),0> <peek(0)>: Error message too long (1979961 > 130975) and has no newlines at which to break; dropping. Possibly binary data was passed to an error reporting function?
8.x or later
##W IIS-DSEE-TFPM-00377 20:50:35(001) <peek(0),0> The message's length (1974723) exceeds the limit (65475) and it can't be broken at newlines; the message is truncated.

Resolving The Problem

The following environment variables must be set to handle large records:

NOTE: When importing files with very large string or ustring fields, it is necessary to set ALL of the following variables to values large enough to accommodate the largest records in the flow.

CAUTION: Because the parallel engine must reserve buffer space for all connections between operators based on the size of the largest record, setting these variables can cause a dramatic increase in the amount of virtual memory needed by a job. These variables should only be set if absolutely necessary for the correct functioning of the job. It is better to set these variables at the job level instead of project level unless all the jobs in the project process similar/large size records.

For Error (1), set APT_DELIMITED_READ_SIZE and APT_MAX_DELIMITED_READ_SIZE


(i) APT_DELIMITED_READ_SIZE:
This sets the initial size of the buffer to be used by the importer for an incoming data stream containing delimited records. The default is 500 bytes. If this environment variable is explicitly set (minimum legal value is 2), the read size will be incremented by factors of 2 (by default, increments by factors of 4), up to the value of APT_MAX_DELIMITED_READ_SIZE. For example, if APT_DELIMITED_READ_SIZE is set to 1000 bytes, then the read size increases as 1000, 2000, 4000, 8000 bytes etc.,
If APT_DELIMITED_READ_SIZE is not set it will increment the read size by multiplying the current read size by 4 i.e., 500, 2000, 8000, 32000, etc.,

(ii) APT_MAX_DELIMITED_READ_SIZE:
This controls how far ahead in an incoming data stream the importer will search for a record delimiter before concluding that the delimiter is not going to be found, and giving up on the import. The default value is 100,000 bytes.


Here is a simple scenario that explains how to determine the value of APT_MAX_DELIMITED_READ_SIZE:

Let's say, the record size is 50,000 bytes.

If APT_DELIMITED_READ_SIZE is set to 500 bytes and APT_MAX_DELIMITED_READ_SIZE is set to 100,000 bytes.
The read size will increase as follows:
500 bytes [No record delimiter found.. continue]
500*4 => 2000 bytes [No record delimiter found.. continue]
2000*4 => 8000 bytes [No record delimiter found.. continue]
8000*4 => 32,000 bytes [No record delimiter found.. continue]
32,000*4 => 128,000 bytes [ERROR is thrown here as 128,000 > 100,000 and it exceeded the max delimited size]

As you can see, although the record size is only 50,000 bytes which is less than the value of APT_MAX_DELIMITED_READ_SIZE i.e., 100,000 bytes the error is thrown because of the way the read size increments.

So, in this case set APT_MAX_DELIMITED_READ_SIZE to any value higher than 128,000 bytes.


For Error (2), one of more of the following environment variables will help.


(i) APT_DEFAULT_TRANSPORT_BLOCK_SIZE:
Specify the default block size in bytes for transferring data between players. The valid value range for this variable is 8192 to 268435456 bytes. If necessary, the value provided by a user for this variable is rounded up to the operating system's nearest page size.

If APT_DEFAULT_TRANSPORT_BLOCK_SIZE is not set, then the internal default 131072 is used.
This variable is provided as part of the functionality for processing records greater than 128K.

In general, to process large records, increase the transport block size to a value larger than the largest record you expect to process. Set APT_DEFAULT_TRANSPORT_BLOCK_SIZE at the job level to increase the transport block size.

However, if you are processing fixed length records you can use APT_AUTO_TRANSPORT_BLOCK_SIZE and by default you will get a block that is APT_LATENCY_COEFFICIENT (default 5) times the record size. This value must fall between APT_MIN_TRANSPORT_BLOCK_SIZE and APT_MAX_TRANSPORT_BLOCK_SIZE (default 256MB).

(ii) APT_AUTO_TRANSPORT_BLOCK_SIZE:
If set, the framework calculates the block size for transferring data between players according to this algorithm:
if (recordSize * APT_LATENCY_COEFFICIENT <
APT_MIN_TRANSPORT_BLOCK_SIZE)
blockSize = minAllowedBlockSize
else if (recordSize * APT_LATENCY_COEFFICIENT >
APT_MAX_TRANSPORT_BLOCK_SIZE)
blockSize = maxAllowedBlockSize
else
blockSize = recordSize * APT_LATENCY_COEFFICIENT

(iii) APT_LATENCY_COEFFICIENT:
Specifies the number of writes to a block which transfers data between players. This variable is intended to allow a user to control latency of data flow through a step. Default value is 5. Specify value of 0 to have a record to be transported immediately.
Note: Many operators have built-in latency, in which case setting this varaible will not affect latency of those operators.

(iv) APT_MIN_TRANSPORT_BLOCK_SIZE and APT_MAX_TRANSPORT_BLOCK_SIZE:
Specify the minimum and maximum allowable block size for
transferring data between players. Defaults are 8192 and 1048576
respectively. APT_MIN_TRANSPORT_BLOCK_SIZE cannot be less than 8192;
APT_MAX_TRANSPORT_BLOCK_SIZE cannot be greater than 1048576.
These variables are only meaningful when used in combination with
APT_LATENCY_COEFFICIENT and APT_AUTO_TRANSPORT_BLOCK_SIZE.

Note: Variables APT_MIN/MAX_TRANSPORT_BLOCK_SIZE, APT_LATENCY_COEFFICIENT and APT_AUTO_TRANSPORT_BLOCK_SIZE are used only in case of fixed-length records.



Error (3), set APT_PHYSICAL_DATASET_BLOCK_SIZE

APT_PHYSICAL_DATASET_BLOCK_SIZE:
This controls the size of the disk blocks written for persistent datasets. Specify the block size to use for reading and writing to a data set stage. The default is 128 KB.



Error (4), set APT_TSORT_STRESS_BLOCKSIZE

APT_TSORT_STRESS_BLOCKSIZE:
This controls the size of the sort buffer allocated by tsort. Because this buffer is split into two halves to achieve I/O overlap, this should be set to about 2.5 times the value of the largest expected record. This means that it will have to be set to about 2.5 times APT_MAX_TRANSPORT_BLOCK_SIZE.
    Set APT_NO_JOBMON
    APT_NO_JOBMON
    This environment variable is concerned with the Job Monitor on InfoSphere® DataStage®. Setting this environment variable will turn off job monitoring entirely.

    [{"Product":{"code":"SSVSEF","label":"IBM InfoSphere DataStage"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"9.1.2.0;9.1.0.1;9.1;8.7.0.2;8.7.0.1;8.7;8.5.0.3;8.5.0.2;8.5.0.1;8.5;8.2.0.1;8.2.0;8.1.0.2;8.1.0.1;8.1;8.0.2;8.0.1.3;8.0.1.2;8.0.1.1;8.0.1;8.0;7.5.3;7.5.2;7.5.1;11.3.1.0;11.3","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

    Document Information

    Modified date:
    16 June 2018

    UID

    swg21660153