How can I generate stack traces for Parallel jobs at DataStage 9.1?
There is a new facility to generate stack traces and capture other valuable information for parallel jobs at version 9.1 of DataStage.
The following user defined environment variables are used to control this feature:
APT_DUMP_STACK - Setting this to 1 will enable basic stack trace dump.
APT_DUMP_STACK_DIRECTORY - When set to a valid path the dump files will be created in the specified directory; if undefined or not set to a valid path then the dump files will be default to /tmp on Unix/Linux and %TEMP% on Windows.
Please note that the specified directory needs to exist on all systems if the parallel engine is used in an MPP or cluster configuration.
After setting APT_DUMP_STACK the feature is automatically invoked by the parallel framework when an unrecoverable exception occurs such as a segmentation fault. Not all errors will generate a signal that will cause a stack trace.
Note: This applies to parallel jobs only. Not applicable for server or sequence jobs.
If the job is successful a dump will not be created therefore you can leave this set to capture a dump for an intermittent issue.
The files created will be named: px_engine_dump_YYYY_MM_DD_HH_MM_SS_PID
For example: px_engine_dump_2013_06_07_16_07_16_3228
This is available on Unix/Linux and Windows. It will provide information that was not previously available on Windows since there is no core file on Windows to get a stack trace and on Unix/Linux it doesn't rely on a debugger being installed.
To use on demand when signaled via SIGABRT for a job that is deadlocked or hung set the following additional environment variables:
- When APT_DUMP_STACK_PERIOD is set along with APT_DUMP_STACK it allows us to get a trace by sending a SIGABRT to a process without aborting the process/job.
- If APT_DUMP_STACK is not enabled then the handler will generate a stack trace and abort the process/job.
- When the job encounters a deadlock/hang you need to identify the process that is hung and send a SIGABRT. Having the pids and the dump score in the job log can help with this.
- Once you have identified the process send the SIGABRT using:
kill -s sigabrt <pid>
Note: APT_DUMP_STACK_PERIOD needs to be defined as a user defined environment variable.