How to diagnose a hanging DataStage Parallel job on Unix/Linux

Technote (FAQ)


Question

How can I diagnose a hanging DataStage parallel job on Unix/Linux?

Answer

Complete the following steps before the next hang:

  • Set the following environment variables at the job or project level:
    APT_PM_SHOW_PIDS=True
    APT_DUMP_SCORE=True
  • On non-production environments or production environments, where you can compile a job, create a user defined environment variable:
    • DS_PXDEBUG
    • Leave the default value blank at the project level
    • Set the default value to 1 at the job level
  • Create a user defined environment variable APT_NO_PM_SIGNAL_HANDLERS and set the default value to 1 at the project level. Many times a hang is caused when a database client core dumps. When this occurs, often the database operators/connectors will sit and wait forever for a response from the client that will never be sent due to the core dump. Therefore the job hangs. Setting APT_NO_PM_SIGNAL_HANDLERS will allow the Unix/Linux system to terminate all the processes associated with the core dump and a core file will be generated.

    If this results in a core file being generated ensure the system will permit core files to be created, e.g. set ulimit -c unlimited. See the following technotes for additional information:
    For AIX, How to get a stack trace for failing processes in a DataStage Parallel Job, AIX platform
    For Linux, How to get a stack trace for failing processes in a DataStage Parallel Job, Linux platforms
If setting APT_NO_PM_SIGNAL_HANDLERS does not result in the job aborting and creating a core file the following information should be gathered the next time a hang occurs:

  1. Capture and send output from 'ps -eaf'
  2. Send the job log of hanging job with full details. Detailed job log and stack traces collected in item five below need to be for the same job run.
  3. Send an export of the job design.
  4. If DS_PXDEBUG is set tar and send Debugging/<job_name> directory found under the project directory.
  5. Collect stack trace on each pid seen. The pids will be displayed in the job log, since APT_PM_SHOW_PIDS is set. Save the stack trace to a file with the pid as part of the file name. Use either pstack/procstack OR debugger (dbx, or gdb) to get the stack trace.
    • Example of using pstack/procstack
      pstack <pid>
      procstack <pid>
      (pstack on Solaris/Linux - procstack on AIX)


      $ pstack 14059
      #0 0xffffe402 in __kernel_vsyscall ()
      #1 0x00b1edf3 in __read_nocancel () from /lib/libpthread.so.0
      #2 0x0810bd0a in api_pipe_read ()
      #3 0x08100927 in main ()

    • Here are examples using dbx and gdb debuggers to attach to running process:
      • Set environment variables:
        -APT_ORCHHOME
        Default is /opt/IBM/InformationServer/Server/PXEngine
        -APT_CONFIG_FILE
        Set to the configuration file listed in the job log
        -PATH=$APT_ORCHHOME/bin:$APT_ORCHHOME/osh_wrappers:$PATH
        -Set the library path:
        •On AIX LIBPATH=$APT_ORCHHOME/lib:.:/usr/lib:/lib:$LIBPATH
        On Linux/Solaris:
        LD_LIBRARY_PATH
        =$APT_ORCHHOME/lib:.:/usr/lib:/lib:$LD_LIBRARY_PATH

      • dbx -a <pid>
        at the "dbx" prompt type:
        <dbx> where > dbx_<pid>.out
        <dbx> detach * detach without killing the job

      • gdb -p <pid>
        at the gdb prompt type :
        set logging file gdb_<pid>.out *specifies file for output
        set logging on
        thread
        where
        * display back trace
        detach * detach without killing the job
        quit

  6. Send ISALite Basic System Summary. This does not need to be collected during the hang. See the following technote for additional information on downloading and running the tool. Download the ISALite for Information Server tool
  7. Send all files to IBM Support.

Rate this page:

(0 users)Average rating

Add comments

Document information


More support for:

InfoSphere DataStage

Software version:

8.0.1, 8.1, 8.5, 8.7, 9.1

Operating system(s):

AIX, HP-UX, Linux, Solaris

Reference #:

1640682

Modified date:

2014-01-09

Translate my page

Machine Translation

Content navigation