IBM Support

How to diagnose a hanging DataStage Parallel job on AIX or Linux

Question & Answer


Question

How can I diagnose a hanging DataStage parallel job on AIX or Linux?

Answer

Determine whether the parallel job is hung:
  • If a job is hung, the DataStage Director shows that the job is running but the job monitor will not show any progress.
  • Check for processes at the operating system level
    • ps -ef | grep DSD
    • The DSD.RUN process is the first process started and starts the other related processes. The DSD.OshMonitor collects information on the row counts
    • After locating the DSD.RUN process look for osh processes.  If there are multiple parallel jobs running it will be difficult to know which osh processes are associated with the hung job unless APT_PM_SHOW_PIDS is set
    • If there are no processes or if the DSD.RUN is the only process most likely the job aborted, but did not update the status. Clear the status file from Director to update the status
    • If the job aborts look for a core file in the project directory.
  • Create a user defined environment variable APT_NO_PM_SIGNAL_HANDLERS before the next hang and set the default value to 1 at the job or project level. Many times a hang is caused when a database client core dumps. When this occurs, often the database operators/connectors will sit and wait forever for a response from the client that will never be sent due to the core dump. Therefore the job hangs. Setting APT_NO_PM_SIGNAL_HANDLERS will allow the Unix/Linux system to terminate all the processes associated with the core dump and a core file will be generated.
  • If setting APT_NO_PM_SIGNAL_HANDLERS results in a core file being generated ensure the system will permit core files to be created, e.g. set ulimit -c unlimited. See the following technotes for additional information:
    For AIX, How to get a stack trace for failing processes in a DataStage Parallel Job, AIX platform
    For Linux, How to get a stack trace for failing processes in a DataStage Parallel Job, Linux platform
If setting APT_NO_PM_SIGNAL_HANDLERS does not result in the job aborting and creating a core file complete the following steps before the next hang:
  • Set the following environment variables at the job or project level:
    APT_PM_SHOW_PIDS=True
    APT_DUMP_SCORE=True
  • On non-production environments or production environments, where you can compile a job, create a user defined environment variable:
    • DS_PXDEBUG
    • Leave the default value blank at the project level
    • Set the default value to 1 at the job level
  • If the hang involves a database connector set CC_MSG_LEVEL
    • Leave the default value blank at the project level
    • Set to 1 or 2 at the job level 
    • Setting CC_MSG_LEVEL to 1 causes trace information to be written to the log for each record.  Therefore if the job processes many records set CC_MSG_LEVEL to 2
The following information should be gathered the next time a hang occurs:
  1. Capture and send output from 'ps -eaf'
  2. Send the job log of hanging job with full details. Detailed job log and stack traces collected in item five below need to be for the same job run.
  3. Send an export of the job design.
  4. If DS_PXDEBUG is set compress and send Debugging/<job_name> directory found under the project directory.
  5. Collect stack trace on each pid seen. The pids will be displayed in the job log, since APT_PM_SHOW_PIDS is set. Save the stack trace to a file with the pid as part of the file name. Use either pstack/procstack OR debugger (dbx, or gdb) to get the stack trace.
    • Example of using pstack/procstack
      pstack <pid>
      procstack <pid>
      (pstack on Linux - procstack on AIX)


      $ pstack 14059
      #0 0xffffe402 in __kernel_vsyscall ()
      #1 0x00b1edf3 in __read_nocancel () from /lib/libpthread.so.0
      #2 0x0810bd0a in api_pipe_read ()
      #3 0x08100927 in main ()
       
    • Here are examples using dbx and gdb debuggers to attach to running process:
      • Set environment variables:
        -APT_ORCHHOME
        Default is /opt/IBM/InformationServer/Server/PXEngine
        -APT_CONFIG_FILE
        Set to the configuration file listed in the job log
        -PATH=$APT_ORCHHOME/bin:$APT_ORCHHOME/osh_wrappers:$PATH
        -Set the library path:
        •On AIX LIBPATH=$APT_ORCHHOME/lib:.:/usr/lib:/lib:$LIBPATH
        On Linux or Solaris:
        LD_LIBRARY_PATH=$APT_ORCHHOME/lib:.:/usr/lib:/lib:$LD_LIBRARY_PATH
         
      •  
        dbx -a <pid>

        At the "dbx" prompt type:
      • <dbx> where > dbx_<pid>.out
        <dbx> detach * detach without killing the job
      •  For each pid:
        gdb -p <pid> -ex "thread apply all bt" -ex "detach" -ex "quit"  > gdb_<pid>.out
  6. Send ISALite Basic System Summary. This does not need to be collected during the hang. See the following technote for additional information on downloading and running the tool. Download the ISALite for Information Server tool
  7. Send all files to IBM Support.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSVSEF","label":"IBM InfoSphere DataStage"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"11.3.0;11.3.1;11.5.0;11.7.0","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
12 March 2021

UID

swg21640682