Diagnosing and resolving an lssam hang

Technote (troubleshooting)


Problem(Abstract)

The "lssam" command is hanging. How can I troubleshoot and resolve this type of problem ?

Cause

IBM.RecoveryRM, IBM.GblResRM or IBM.StorageRM are not running or busy or hung.

Environment

For environments prior to version 3.1.0.5, IBM.RecoveryRM was the most likely cause. After this, the most likely causes are related to IBM.StorageRM and IBM.GblResRM

Diagnosing the problem

If you're interested in trying to narrow down the root cause of the problem see the advice in the following section, otherwise skip to the next section, "Resolving the problem".

This section is split into two parts:
1) Identify the problematic Resource Manager
2) Collecting diagnostic data



Identify the problematic Resource Manager

First step is to always re-run lssam with the trace (-T) option ... this will write trace messages to stdout :

    lssam -T

From the output shown, the last message will likely indicate the "class" of resource that is being queried and that will tell us which Resource Manager is not doing its job:

Resource Manager (RM) Classes owned by the RM (only the ones of interest to lssam)
IBM.RecoveryRM IBM.ResourceGroup, IBM.Equivalency, IBM.ManagedResource
IBM.GblResRM IBM.Application, IBM.ServiceIP
IBM.StorageRM IBM.AgFileSystem
IBM.TestRM IBM.Test
IBM.ConfigRM IBM.PeerNode, IBM.NetworkInterface

Here's an example of the last few lines of 'lssam -T' output for an lssam hang caused by an unresponsive IBM.GblResRM daemon on one of the nodes:
lssam: calling lsrsrc-api /usr/sbin/rsct/bin/lsrsrc-api -Dtvrtvrtvr -s IBM.Application::"Name like '%'"::Name::OpState::ResourceType::AggregateResource::ResourceHandle::NodeNameList


In addition, you can also run some specific queries to double check which Resource Managers are working:

A) To test if the problem is specific to the IBM.RecoveryRM daemon, run the following commands to see if they appear to hang :
    lsrg -m
    lsrsrc -Ab IBM.ResourceGroup

B) To test if the problem is specific to the IBM.GblResRM daemon, run the following commands:
    lsrsrc -Ab IBM.Application
    lsrsrc -Ab IBM.ServiceIP

C) To test if the problem is specific to the IBM.StorageRM daemon, run the following command:
    lsrsrc -Ab IBM.AgFileSystems

D) To test if the problem is specific to the IBM.TestRM daemon, run the following command:
    lsrsrc -Ab IBM.Test

You should repeat the above tests on other nodes and try to keep notes as to the node name and time in which you issued each command if you intend asking IBM Support for root cause. A good practice would be to add the "date;" in front of each command, so the date/time is always displayed immediately before the ls* command is attempted.

Specific to IBM.GblResRM suspected hangs, you can use the "find_hung_rm.sh" script (attached to this document) to determine on which node IBM.GblResRM is unresponsive. The result would be one of three possibilities:
1) The test queries are successful for all but one node, thus revealing on which node IBM.GblResRM is unresponsive.
2) The test queries against all nodes time out, thus suggesting you're running the "find_hung_rm.sh" script on the actual node containing the problematic IBM.GblResRM daemon ... re-run "find_hung_rm.sh" on another node to confirm this.
3) None of the queries time out, suggesting that IBM.GblResRM is functioning fine on all nodes and is not the root cause of the "lssam" hang.
find_hung_rm.2.0.20130207.tarfind_hung_rm.2.0.20130207.tar
Note: there is sample output included with the "find_hung_rm.sh" script in a file called "sample.out".


Collecting diagnostic data

For hangs where IBM.RecoveryRM is the suspected culprit, kill the hanging lssam and temporarily increase IBM.RecoveryRM's trace level by running (as root) :
ctsettrace -s IBM.RecoveryRM -a "_SDK:*=255";ctsettrace -s IBM.RecoveryRM -a "_RMF:*=255"

Then run 'lssam -T -V' ... it should hang again. At this point kill the hanging lssam with 'kill -6 <PID>' and this should capture details needed to point to the root cause. Run 'getsadata -all' on the master node and the node where lssam was hung and provide this to IBM Support via a PMR.
Killing the master IBM.RecoveryRM will reestablish the default trace settings ... IBM.RecoveryRM would be automatically respawned.

For hangs where IBM.GblResRM is the suspected culprit, use the "find_hung_rm.sh" script to figure out on which node IBM.GblResRM is not responding, then determine its PID, and finally run gstack against that PID (as root):
ps -ef | grep IBM.GblResRMd | grep -v grep | awk '{print $2}'
gstack <pid>
Save the gstack output to a file which you can provide IBM Support later on. Note the date and time and the hostname.
In addition, you can force a core file to be generated:
kill -6 <pid>
Run 'getsadata -all' on the master node, and on the node where IBM.GblResRM was found to be unresponsive, and finally on the node where lssam was hung (might all be the same node) ... provide this to IBM Support via a PMR. Please include all the date, time, and node name information.

Resolving the problem

Check and try the following:


1. Do you see any messages or errors of any kind ? Search the IBM Portal for suggestions based on any error messages.

2. Is the domain online ? If it is not, the lssam command will not be serviced and may even appear hung in earlier releases of TSAMP. Use 'lsrpdomain' to check the state of the domain. Use 'startrpdomain <domain_name>' to start the domain if found to be offline and assuming you want to bring it online.

3. Is the node (on which you are trying to run the lssam command) online ? Use the "lsrpnode" command to make sure the node is online, else the Resource Managers will not be running on this node and thus the lssam command will not be serviced. Use 'startrpdomain <domain_name>' from the offline node if you want to bring it online.

4. Assuming the domain and node are online, check that the key Resource Managers are running by issuing the following commands on the local node:
lssrc -ls IBM.RecoveryRM
lssrc -ls IBM.GblResRM
lssrc -ls IBM.StorageRM

5. Is lssam really hung ... how long have you waited for a response ? It is possible you issued the lssam command while the automation engine (IBM.RecoveryRM) was performing resource validation or re-validation ... during this period, no ls* commands are serviced. You can check if resource validation is complete with the following query:
lssrc -ls IBM.RecoveryRM | grep -i "In Config State"

    => True means complete
    => False means still initializing, so not ready to service commands like lssam
6. Have you tried running lssam on the other nodes in the cluster to determine if the problem is specific to a single node ? If it is a problem specific to one node, then try to identify which Resource Manager is causing the hang (see the "Diagnosing the problem" section above) and kill that local Resource Manager (IBM.RecoveryRMd, IBM.GblResRMd, IBM.StorageRMd) ... it will re-spawn automatically and hopefully clear the hang condition.

7. If the lssam hang is diagnosed to be because of an IBM.RecoveryRM related query, then force the "master" IBM.RecoveryRM to move to another node, as follows:
a) Identify the node hosting the master:
lssrc -ls IBM.RecoveryRM | grep -i master
b) On that node identified above, find the PID for IBM.RecoveryRMd:
ps -ef | grep IBM.RecoveryRMd | grep -v grep | awk '{print $2}'
c) Kill that PID
kill -6 <pid>
Check that IBM.RecoveryRM is re-spawned (new PID number).

8. If the lssam hang is diagnosed to be because of an IBM.GblResRM related query, then kill the IBM.GblResRMd daemon on the node where it was found to be unresponsive:
ps -ef | grep IBM.GblResRMd | grep -v grep | awk '{print $2}'
kill -6 <pid>
Check that IBM.GblResRM is re-spawned (new PID number).

9. Finally, have you considered installing the latest Fixpack ?
http://www.ibm.com/support/docview.wss?uid=swg27039236

Rate this page:

(0 users)Average rating

Document information


More support for:

Tivoli System Automation for Multiplatforms

Software version:

3.1, 3.2, 3.2.1, 3.2.2, 4.1

Operating system(s):

AIX, Linux, Solaris

Reference #:

1293701

Modified date:

2013-02-07

Translate my page

Machine Translation

Content navigation