IBM Support

Troubleshooting a Hung Process or Command on PowerVM Virtual I/O Server

Troubleshooting


Problem

This technote describes minimum data needed to begin diagnosis of a hung process or command on PowerVM Virtual I/O Server. 

Environment

This document applies to PowerVM Virtual I/O Server version 3.1.x

Diagnosing The Problem

See NOTE 1 in step 3 to try determining whether the command in questions is hung as opposed to experiencing a delayed.

Resolving The Problem

1. Download pdump.sh and sftp (binary) to the VIOS as padmin. By default, you are put in padmin's home directory, /home/padmin.  This script captures KDB data from the hung process ID.  In some instances, this may be enough to diagnose root cause.  However, in cases where such data does not reveal the root cause, an OS system dump may be needed for further investigation. 

If you prefer to force an OS system dump now, skip steps 2-5.  Keep in mind this task may impact the VIO clients being served by the VIOS if the clients' network and disk I/O are not redundant through a second VIOS. Therefore, ensure the clients are redundant through a second VIOS before proceeding.

  • IMPORTANT: If the VIOS in question is part of a Shared Storage Pool (SSP) cluster, contact your local IBM SupportLine Representative to discuss. Do NOT force a dump.
  • To force a system dump on the VIOS, refer to technote How to force a system dump > Section LPAR using HMC > HMC GUI. 
  • After forcing the system memory dump, check that the VIOS partition boots up to the padmin login prompt, and verify 'sysdumpdev -L' output reports the dump completed successfully. If there is a good dump (Dump status: 0), proceed to capture VIOS snap data. Then contact you local IBM SupportLine Representative to have the dump analyzed.

2. Log in to the VIOS as padmin and change permissions.
$ chmod 755 padmin.sh

3. Go to the root shell and find the process ID (PID) for the hung process or command.
$ oem_setup_env
# ps -ef |grep <hung_command> =>Get the PID. It is the number after the user name

The following example, snap command is hung, and its PID is 11993246:
# ps -ef|grep snap
root 8060958 8585354 0 13:30:17 pts/2 0:00 grep snap
padmin 9830500 11993246 0 13:30:12 pts/3 0:00 /bin/ksh /usr/sbin/snap -r
padmin 11993246 9109512 0 13:30:12 pts/3 0:00 ioscli snap

 

  • NOTE 1:  Sometimes a command might be mistakenly considered to be hung when in reality, it might be taking some time to complete.  This can be seen on a VIOS with large configuration if memory is over-utilized.  Run proctree command to determine whether the "hung" PID created any child processes.  If so, get the PID of the youngest child process (the last one in the tree).  In the following example, it is 7798810.

# proctree 11993246
2228366 /usr/sbin/srcmstr
9437368 /usr/sbin/inetd
10092564 telnetd -a
9109512 -rksh
11993246 ioscli snap
9830502 /bin/ksh /usr/sbin/snap -a -c
8978460 /bin/sh /usr/lib/ras/snapscripts/svCollect all
8061118 /bin/sh /usr/lib/ras/snapscripts/svCollect all
7274712 kdb -script
#


Wait a minute or so, then rerun the command.  Repeat that a few times to see whether the youngest child process changes (7274712, in this case) .
If it does not change, then proceed to step #4.
If it does, then, more than likely the command might be having a delay rather than being hung.

padmin commands, such as snap and backupios, can be delayed when the VIOS has insufficient memory resources.  Determine if the amount of memory on the VIOS is adequate by running VIOS Performance Advisor tool.  The Performance Advisor tool generates a *.tar file containing the vios_advisor.xml report that can be viewed via browser.  If the memory resources are over-utilized, the report generates a VIOS Memory Recommended Value.  Examine the report and make the necessary change if a new Recommended Value is generated before rerunning the command again.  The Recommended Value is generated based on VIOS workload going on at the time the performance data is being collected.  Therefore, the data must be captured at the time the problem is ongoing.

4. If the command is indeed hung, run the pdump.sh tool against the last child process ID listed at the bottom of the proctree output (7274712, in this example).

# ./pdump.sh <last child PID>
This creates output file pdump.<hung command>.<PID>.<date>.out in the current working directory

Example:

"

# ./pdump.sh -d 7274712

Getting general environment data ...
Dumping process information from kdb ...

dumping process slot 2928 ...
Error getting thread list. Skip other kdb commands.

Dumping process information with proc tools ...

Dumping process information from dbx ...

dumping tid 1 ...
listing object files ...

Done.
Output file is pdump.ioscli.11993246.11Oct2018-14.24.54.out

# ls -la pdump.ioscli.7274712.11Oct2018-14.24.54.out
-rw-r--r-- 1 root staff 85269 Oct 11 14:25 pdump.ioscli.7274712.11Oct2018-14.24.54.out
#

5. Rename the file to reflect your Support Case ID and send the testcase. Example:

# mv <original_filename>.out TS<xxxxxxxxx>.<VIO_hostname> .<original_filename>.out
-rw-r--r-- 1 root staff 85269 Oct 11 14:25 TS123456789.VIOS1.pdump.ioscli.7274712.11Oct2018-14.24.54.out


6. Where to send the testcase.

[{"Type":"MASTER","Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSPHKW","label":"PowerVM Virtual I\/O Server"},"ARM Category":[{"code":"a8m50000000L0KcAAK","label":"PowerVM VIOS-\u003EPADMIN\/CLI"}],"ARM Case Number":"TS004620385","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"3.1.1;3.1.2;3.1.3"}]

Document Information

Modified date:
15 March 2024

UID

isg3T1012503