IBM Support

Troubleshooting a Hung Process or Command on PowerVM Virtual I/O Server

Troubleshooting


Problem

This technote describes how to troubleshoot a hung process or command on PowerVM Virtual I/O Server before resorting to potentially having to force a system dump.

Environment

This applies to PowerVM Virtual I/O Server version 2.2

Diagnosing The Problem

See NOTE 1 in step 3 to try determining if the command in questions is actually hung as opposed to experiencing a delayed.

Resolving The Problem

1. Download pdump.sh from ftp://ftp.software.ibm.com/aix/tools/debug/ and ftp (binary) to the VIOS as padmin (by default, you will be dropped in /home/padmin directory).

2. Login to the VIOS as padmin and change permissions.
$ chmod 755 padmin.sh
3. Go to the root shell and find the process ID (PID) for the hung process or command.
$ oem_setup_env
# ps -ef |grep <hung_command> =>Get the PID. It is the number after the user name

The following example, padmin snap command is hung, and its PID is 11993246:
# ps -ef|grep snap
root 8060958 8585354 0 13:30:17 pts/2 0:00 grep snap
padmin 9830500 11993246 0 13:30:12 pts/3 0:00 /bin/ksh /usr/sbin/snap -r
padmin 11993246 9109512 0 13:30:12 pts/3 0:00 ioscli snap
 

  • NOTE 1:  Sometimes a command may be mistakenly considered to be hung when in reality, it may just be taking some time to complete.  This may be seen on a VIOS with large configuration if memory are over-utilized.  Run proctree command to determine if the "hung" PID spawned any child processes.  If so, get the PID of the youngest child process (the last one in the tree).  In the following example, it is 7798810.

# proctree 11993246
2228366 /usr/sbin/srcmstr
9437368 /usr/sbin/inetd
10092564 telnetd -a
9109512 -rksh
11993246 ioscli snap
9830502 /bin/ksh /usr/sbin/snap -a -c
8978460 /bin/sh /usr/lib/ras/snapscripts/svCollect all
8061118 /bin/sh /usr/lib/ras/snapscripts/svCollect all
7274712 kdb -script
#

Wait a minute or so, then re-run the command.  Repeat that a few times to see if the youngest child process changes (7274712, in this case) .
If it does not change, then proceed to step #4.
If it does, then, more than likely the command may be having a delay rather than being hung.

padmin commands, such as snap and backupios, are known to be delayed when the VIOS has insufficient memory resources.  Determine if the amount of memory on the VIOS is adequate by running VIOS Performance Advisor tool.  The Performance Advisor tool will generate a *.tar file containing a the vios_advisor.xml report that can be viewed via browser.  If the memory resources are over utilized, the report will generate a VIOS Memory Recommended Value.  Examine the report and make the necessary change if a new Recommended Value is generated before rerunning the command again.  The Recommended Value is generated based on VIOS workload going on at the time the performance data is being collected.  Therefore, the data must be captured at the time the problem is ongoing.

4. If the command is indeed hung, run the pdump.sh tool against the last child process ID listed at the bottom of the proctree output (7274712, in this example).

# ./pdump.sh -d <last child PID>
This will create output file pdump.<hung command>.<PID>.<date>.out in the current working directory

Example:

# ./pdump.sh -d 7274712

Getting general environment data ...
Dumping process information from kdb ...

dumping process slot 2928 ...
Error getting thread list. Skip other kdb commands.

Dumping process information with proc tools ...

Dumping process information from dbx ...

dumping tid 1 ...
listing object files ...

Done.
Output file is pdump.ioscli.11993246.11Oct2018-14.24.54.out

# ls -la pdump.ioscli.7274712.11Oct2018-14.24.54.out
-rw-r--r-- 1 root staff 85269 Oct 11 14:25 pdump.ioscli.7274712.11Oct2018-14.24.54.out
#

5. Rename the file to reflect your Support Case ID and send the testcase. Example:

# mv <original_filename>.out TS<xxxxxxxxx>.<VIO_hostname> .<original_filename>.out
-rw-r--r-- 1 root staff 85269 Oct 11 14:25 TS123456789.VIOS1.pdump.ioscli.7274712.11Oct2018-14.24.54.out

6. Where to send the testcase.

Document information

More support for: Virtual I/O Server

Component: --

Software version: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.2.4

Operating system(s): AIX

Software edition: Enterprise, Express, Standard

Reference #: T1012503

Modified date: 15 October 2018


Translate this page: