IBM Support What's New?

Diagnosing and resolving an lssam hang

Technote (troubleshooting)


Problem(Abstract)

The "lssam" command is hanging. How can I troubleshoot and resolve this type of problem ?

Cause

IBM.RecoveryRM, IBM.GblResRM or IBM.StorageRM are not running or busy or hung.

Environment

For environments prior to version 3.1.0.5, IBM.RecoveryRM was the most likely cause.

After this, the most likely causes are related to IBM.StorageRM and IBM.GblResRM.
If using RSCT v3.2.0.0 through to v3.2.0.5, IBM.TestRM is likely causing the hang.


Diagnosing the problem

If you're interested in trying to narrow down the root cause of the problem see the advice in the following section, otherwise skip to the next section, "Resolving the problem".


This section is split into two parts:
Hide details for Identify the problematic Resource ManagerIdentify the problematic Resource Manager
Identify the problematic Resource Manager

First step is to always re-run lssam with the trace (-T) option ... this will write trace messages to stdout :

    lssam -T

From the output shown, the last message will likely indicate the "class" of resource that is being queried and that will tell us which Resource Manager is not doing its job:

Resource Manager (RM) Classes owned by the RM (only the ones of interest to lssam)
IBM.RecoveryRM IBM.ResourceGroup, IBM.Equivalency, IBM.ManagedResource
IBM.GblResRM IBM.Application, IBM.ServiceIP
IBM.StorageRM IBM.AgFileSystem
IBM.TestRM IBM.Test
IBM.ConfigRM IBM.PeerNode, IBM.NetworkInterface

Here's an example of the last few lines of 'lssam -T' output for an lssam hang caused by an unresponsive IBM.GblResRM daemon on one of the nodes:
lssam: calling lsrsrc-api /usr/sbin/rsct/bin/lsrsrc-api  -Dtvrtvrtvr -s IBM.Application::"Name like '%'"::Name::OpState::ResourceType::
AggregateResource::ResourceHandle::NodeNameList

Its necessary to also identify on which node the Resource Manager is not responding to commandline queries. This can be done by using the "tsahealth" utility :

!!!!!!#########!!!!!!########!!!!!!!########<blah.


Alternatively, you can run specific queries which target individual Resource Managers:

A) To test if the problem is specific to the IBM.RecoveryRM daemon:
    lsrg -m
    lsrsrc -Ab IBM.ResourceGroup

B) To test if the problem is specific to the IBM.GblResRM daemon:
    lsrsrc -Ab IBM.Application

C) To test if the problem is specific to the IBM.StorageRM daemon:
    lsrsrc -Ab IBM.AgFileSystems

D) To test if the problem is specific to the IBM.TestRM daemon:
    lsrsrc -Ab IBM.Test

Repeat the command on each node to find out which ones hang. Add "date;" in front of each command, so the date/time is always displayed immediately before the ls* command is attempted.


Hide details for Collecting diagnostic data Collecting diagnostic data
Collecting diagnostic data

For hangs where IBM.RecoveryRM is the suspected culprit, kill the hanging lssam and temporarily increase IBM.RecoveryRM's trace level by running (as root) :
ctsettrace -s IBM.RecoveryRM -a "_SDK:*=255";ctsettrace -s IBM.RecoveryRM -a "_RMF:*=255"

Then run 'lssam -T -V' ... it should hang again. At this point kill the hanging lssam with 'kill -6 <PID>' and this should capture details needed to point to the root cause. Run 'getsadata -all' on the master node and the node where lssam was hung and provide this to IBM Support via a PMR.
Killing the master IBM.RecoveryRM will reestablish the default trace settings ... IBM.RecoveryRM would be automatically respawned.

For the other Resource Managers, use 'tsahealth" to determine which node the Resource Manager is not responding,then determine its PID:
ps -ef | grep IBM.<RM>d | grep -v grep | awk '{print $2}'
where <RM> is either "GblResRM", "TestRM", or "StorageRM".

Run gstack or procstack against that PID (as root):
LINUX:  gstack <pid>
AIX:  procstack <pid>
Save the backtrace output to a file which you can provide IBM Support later on.
Note the date and time and the hostname.

In addition, force a core file to be generated:
kill -6 <pid>

Run 'getsadata -all' on the master node, and on the node where the Resource Manager was found to be unresponsive, and finally on the node where lssam was hung (might all be the same node) ... provide this to IBM Support via a PMR. Please include all the date, time, and node name information collected above.

Resolving the problem

Check and try the following:


1. Do you see any messages or errors of any kind ? Search the IBM Portal for suggestions based on any error messages.

2. Is the domain online ? If it is not, the lssam command will not be serviced and may even appear hung in earlier releases of TSAMP. Use 'lsrpdomain' to check the state of the domain. Use 'startrpdomain <domain_name>' to start the domain if found to be offline and assuming you want to bring it online.

3. Is the node (on which you are trying to run the lssam command) online ? Use the "lsrpnode" command to make sure the node is online, else the Resource Managers will not be running on this node and thus the lssam command will not be serviced. Use 'startrpdomain <domain_name>' from the offline node if you want to bring it online.

4. Assuming the domain and node are online, check that the key Resource Managers are running by issuing the following commands on the local node:
lssrc -ls IBM.RecoveryRM
lssrc -ls IBM.GblResRM
lssrc -ls IBM.StorageRM
lssrc -ls IBM.TestRM

5. Is lssam really hung ... how long have you waited for a response ? It is possible you issued the lssam command while the automation engine (IBM.RecoveryRM) was performing resource validation or re-validation ... during this period, no ls* commands are serviced. You can check if resource validation is complete with the following query:
lssrc -ls IBM.RecoveryRM | grep -i "In Config State"

    => True means complete
    => False means still initializing, so not ready to service commands like lssam
6. Have you tried running lssam on the other nodes in the cluster to determine if the problem is specific to a single node ? If it is a problem specific to one node, then try to identify which Resource Manager is causing the hang (see the "Diagnosing the problem" section above) and kill that local Resource Manager (IBM.RecoveryRMd, IBM.GblResRMd, IBM.StorageRMd, IBM.TestRMd) ... it will re-spawn automatically and hopefully clear the hang condition.

7. If the lssam hang is diagnosed to be because of an IBM.RecoveryRM related query, then force the "master" IBM.RecoveryRM to move to another node, as follows:
a) Identify the node hosting the master:
lssrc -ls IBM.RecoveryRM | grep -i master
b) On that node identified above, find the PID for IBM.RecoveryRMd:
ps -ef | grep IBM.RecoveryRMd | grep -v grep | awk '{print $2}'
c) Kill that PID
kill -6 <pid>
Check that IBM.RecoveryRM is re-spawned (new PID number).

8. If the lssam hang is diagnosed to be because of an IBM.GblResRM or IBM.StorageRM or IBM.TestRM related query, then kill the IBM.*RMd daemon on the node where it was found to be unresponsive. The following example is for IBM.GblResRM:
ps -ef | grep IBM.GblResRMd | grep -v grep | awk '{print $2}'
kill -6 <pid>
Check that IBM.GblResRM is re-spawned (new PID number).

9. Finally, have you considered installing the latest Fixpack ?
http://www.ibm.com/support/docview.wss?uid=swg27039236

Document information

More support for: Tivoli System Automation for Multiplatforms

Software version: 3.1, 3.2, 3.2.1, 3.2.2, 4.1

Operating system(s): AIX, Linux, Solaris

Reference #: 1293701

Modified date: 2013-02-07