IBM Support

Diagnosing and resolving an lssam hang

Troubleshooting


Problem

The "lssam" command is hanging. How can I troubleshoot and resolve this type of problem?

Cause

IBM.RecoveryRM, IBM.GblResRM or IBM.StorageRM are not running or busy or hung.

Environment

For TSAMP environments on version earlier than 3.1.0.5, IBM.RecoveryRM was the most likely cause.
After TSAMP version 3.1.0.5, the most likely causes are related to IBM.StorageRM and IBM.GblResRM.
If the environment uses RSCT v3.2.0.0 through to v3.2.0.5, IBM.TestRM is likely causing the hang.

Diagnosing The Problem

If you're interested in trying to narrow down the root cause of the problem see the advice in the following section, otherwise skip to the next section, "Resolving the problem".

This section is split into two parts:
Identify the problematic Resource Manager

Identify the problematic Resource Manager

The first step is to run lssam with the trace (-T) option ... this writes trace messages to stdout:
  • lssam -T

From the output shown, the last message indicates the "class" of resource that is being queried and that tells us which Resource Manager is experiencing issues:
 
Resource Manager (RM) Classes owned by the RM (only the ones of interest to lssam)
IBM.RecoveryRM IBM.ResourceGroup, IBM.Equivalency, IBM.ManagedResource
IBM.GblResRM IBM.Application, IBM.ServiceIP
IBM.StorageRM IBM.AgFileSystem
IBM.TestRM IBM.Test
IBM.ConfigRM IBM.PeerNode, IBM.NetworkInterface

Here's an example of the last few lines of 'lssam -T' output for a lssam hang caused by an unresponsive IBM.GblResRM daemon on one of the nodes:
lssam: calling lsrsrc-api /usr/sbin/rsct/bin/lsrsrc-api -Dtvrtvrtvr -s IBM.Application::"Name like '%'"::Name::OpState::ResourceType::
AggregateResource::ResourceHandle::NodeNameList

It's necessary to also identify on which node the Resource Manager is not responding to command line queries

You can run specific queries, which target individual Resource Managers:

A) To test if the problem is specific to the IBM.RecoveryRM daemon, issue the following commands:
  • lsrg -m
    lsrsrc -Ab IBM.ResourceGroup

B) To test if the problem is specific to the IBM.GblResRM daemon, issue the following commands:
  • lsrsrc -Ab IBM.Application

C) To test if the problem is specific to the IBM.StorageRM daemon, issue the following commands:
  • lsrsrc -Ab IBM.AgFileSystem

D) To test if the problem is specific to the IBM.TestRM daemon, issue the following commands:
  • lsrsrc -Ab IBM.Test

Repeat the command on each node to find out which ones hang. Add "date;" in front of each command, so the date and time is always displayed immediately before the ls* command is attempted.
 
Collecting diagnostic data

Collecting diagnostic data

For hangs where IBM.RecoveryRM is the suspected culprit, kill the hanging lssam and temporarily increase IBM.RecoveryRM's trace level by running (as root):
ctsettrace -s IBM.RecoveryRM -a "_SDK:*=255";ctsettrace -s IBM.RecoveryRM -a "_RMF:*=255"

Then, run 'lssam -T -V' ... it should hang again. At this point, kill the hanging lssam with 'kill -6 <PID>' and this should capture details needed to point to the root cause. Run 'getsadata -all' on the IBM.RecoveryRM master node and the node where lssam was hung and provide this to IBM Support with a case.
Killing the master IBM.RecoveryRM daemon resets the default trace settings ... IBM.RecoveryRM would be automatically respawned.

For the other Resource Managers, use "tsahealth" to determine which node the Resource Manager is not responding on, then determine its PID:
ps -ef | grep IBM.<RM>d | grep -v grep | awk '{print $2}'
where <RM> is either "GblResRM", "TestRM", or "StorageRM".

Run gstack or procstack against that PID (as root):
LINUX: gstack <pid>
AIX: procstack <pid>
Save the backtrace output to a file, which you can provide IBM Support later on.
Note the date and time and the hostname.

In addition, force a core file to be generated:
kill -6 <pid>

Finally, run 'getsadata -all' on the IBM.RecoveryRM master node, and on the node where the Resource Manager was found to be unresponsive, and finally on the node where lssam was hung (might all be the same node) ... provide this to IBM Support with a case. Include all the date, time, and node name information collected.

Resolving The Problem

Check and try the following:

1. Do you see any messages or errors of any kind? Search the IBM Portal for suggestions based on any error messages.

2. Is the domain online? If it is not, the lssam command does not function and may even appear hung in earlier releases of TSAMP. Use 'lsrpdomain' to check the state of the domain. Use 'startrpdomain <domain_name>' to start the domain if found to be offline and assuming you want to bring it online.

3. Is the node (on which you are trying to run the lssam command) online? Use the "lsrpnode" command to make sure the node is online, else the Resource Managers will not be running on this node and thus the lssam command will not be serviced. Use 'startrpdomain <domain_name>' from the offline node if you want to bring it online.

4. Assuming the domain and node are online, check that the key Resource Managers are running by issuing the following commands on the local node:
lssrc -ls IBM.RecoveryRM
lssrc -ls IBM.GblResRM
lssrc -ls IBM.StorageRM
lssrc -ls IBM.TestRM

5. Is lssam really hung ... how long have you waited for a response? It is possible you issued the lssam command while the automation engine (IBM.RecoveryRM) was performing resource validation or re-validation ... during this period, no ls* commands are serviced. You can check if resource validation is complete with the following query:
lssrc -ls IBM.RecoveryRM | grep -i "In Config State"
  • => True means complete
    => False means still initializing, so not ready to service commands like lssam















  •  
6. Have you tried running lssam on the other nodes in the cluster to determine if the problem is specific to a single node? If it is a problem specific to one node, then try to identify which Resource Manager is causing the hang (see the "Diagnosing the problem" section above) and kill that local Resource Manager (IBM.RecoveryRMd, IBM.GblResRMd, IBM.StorageRMd, IBM.TestRMd) ... it will re-spawn automatically and hopefully clear the hang condition.

7. If the lssam hang is diagnosed to be because of an IBM.RecoveryRM related query, then force the "master" IBM.RecoveryRM daemon to move to another node, as follows:
a) Identify the node hosting the master IBM.RecoveryRM daemon:
lssrc -ls IBM.RecoveryRM | grep -i master
b) On that node identified above, find the PID for IBM.RecoveryRMd:
ps -ef | grep IBM.RecoveryRMd | grep -v grep | awk '{print $2}'
c) Kill that PID
kill -6 <pid>
Check that IBM.RecoveryRM is re-spawned (new PID number).

8. If the lssam hang is diagnosed to be because of an IBM.GblResRM or IBM.StorageRM or IBM.TestRM related query, then kill the IBM.*RMd daemon on the node where it was found to be unresponsive. The following example is for IBM.GblResRM:
ps -ef | grep IBM.GblResRMd | grep -v grep | awk '{print $2}'
kill -6 <pid>
Check that IBM.GblResRM is re-spawned (new PID number).

9. Finally, have you considered installing the latest Fixpack?
https://www.ibm.com/support/pages/node/964892

[{"Product":{"code":"SSRM2X","label":"Tivoli System Automation for Multiplatforms"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"}],"Version":"3.1;3.2;3.2.1;3.2.2;4.1","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
01 March 2022

UID

swg21293701