IBM Support

Fixing the No RMC Connection Error

Question & Answer


Question

How do I correct the "No RMC Connection" error I get when I try to DLPAR, perform an LPM operation or many other similar operations involving virtual machines in a Power Systems Environment.

Cause

Remote Management and Control is a suite of applications built into AIX and available for some Enterprise Linux offerings that required a fixed IP configuration and secure communications using both TCP and UDP protocols between all hosts in the RMC peer domain. This communication can breakdown due to reconfigurations, reinstallations of backups, network issues or even code defects. There is no one-size-fits all solution to any RMC problem, but there are some simple checks that can be performed to verify configuration as well as attempt to repair the trusted configuration between a peer such as an LPAR and its management console such as an HMC.

Answer


There are some basic commands that can be run to check status of RMC configurations and there are some dependancies on RSCT versions as to which commands you use. RSCT 3.2.x.x levels are the newest and available in the latest releases of AIX and VIOS. More common installations will have RSCT at 3.2.x.x levels or higher. The basic queries you can run to check RMC health are listed below.
Please read the CAUTION note first under (4). For Shared Storage Pool, you should not use the /usr/sbin/rsct/install/bin/recfgct command and cluster services must be stopped.

1. To check RMC status on a LPAR as root (AIX or VIOS)

a. Applies to all AIX and VIOS levels

lslpp -l rsct.core.rmc ---> This fileset needs to be 3.1.0.x level or higher
/usr/sbin/rsct/bin/ctsvhbac ---> Are all IP and host IDs trusted?
lsrsrc IBM.MCP ---> Is the HMC listed as a resource?

b. Only applies if AIX 6.1 TL5 or lower is used

lslpp -l csm.client ---> This fileset needs to be installed
lsrsrc IBM.ManagementServer ---> Is HMC listed as a resource?

2. To check RMC status on Management Console (as a hmcsuperadmin user)

lspartition -dlpar ---> Is the LPAR's DCaps value non-zero ?

3. If you answer no to any of the above then corrective action is required.

a. Missing file sets or fixes need to be installed.

b. If RSCT file set rsct.core.rmc is at 3.1.5.0 or 3.2.0.0 then APARs apply.

c. Fix It Commands (run as root on LPAR and Management Console)

(1) You can try one of the following commands first as a super admin user on the HMC

lspartition -dlparreset (use if HMC v7)
diagrmc --autocorrect -v (Use if HMC v8 or higher)

(2) On the LPAR try running these commands first as well

/usr/sbin/rsct/bin/rmcctrl -z
/usr/sbin/rsct/bin/rmcctrl -A
/usr/sbin/rsct/bin/rmcctrl -p

(3) Wait a few minutes then check status again using the lsrsrc command listed above on the LPAR to see if the resources for the HMC are loaded and if you need to try something else proceed with the next options with caution.

(4)The commands to run as root - note CAUTION
/usr/sbin/rsct/install/bin/recfgct
/usr/sbin/rsct/bin/rmcctrl -p

(a) CAUTION: The command is safe to run on the HMC; however, running the recfgct command on a node in a RSCT peer domain or in a Cluster Aware AIX (CAA) environment should NOT be done before taking other precautions first. This note is not designed to cover all CAA or other RSCT cluster considerations so if you have an application that is RSCT aware such as PowerHA, VIOS Storage Pools and several others do not proceed until you have contacted support for those specific products. If you need to determine if your system is a member of a CAA cluster then please refer to the Reliable Scalable Cluster Technology document titled, "Troubleshooting the resource monitoring and control (RMC) subsystem"

https://www.ibm.com/docs/en/rsct/3.2?topic=troubleshooting-rmc-subsystem

(b) Pay particular attempt to the section titled Diagnostic procedures to help learn if you node is a member of any domain other than the Management Console's management domain.

(5) If the above does not help you will need to request pesh passwords from IBM Support for your Management Console so you can run the recfgct and rmcctrl commands listed above.

(7) After running the above commands it will take several minutes before RMC connection is restored. The best way to monitor is by running the lspartition -dlpar command on the Management Console every few minutes and watch for the target LPAR to show up with a non-zero DCaps value.

4. Things to consider before using the above fix commands or if the reconfigure commands don't help.

a. Network issues are often overlooked or disregarded. There are some network configuration issues that might need to be addressed if the commands that reconfigure RSCT don't restore DLPAR function. Determining if there is a network problem will require additional debug steps not covered in this tech note. However, there are some common network issues that can prevent RMC communications from passing between the Management Console and the LPARs and they include the following.

(1) Firewalls blocking bidirectional RMC related traffic for UDP and TCP on port 657. A crude field test for TCP connectivity is to telnet from the LPAR to port 657 on the HMC to see if you can connect (telnet <HMC IP> 657). If the connection attempt times out you know you have an issue where port 657 is blocked. The HMC does have a firewall configuration tool that is GUI based. Its accessed using the Change Network Settings task. For firewall issues beyond the HMC you will need to work with your network team to open up firewall access for the RMC channel.

(2) Mix of jumbo frames and standard Ethernet frames between the Management Console and LPARs.

(3) Multiple interfaces with IP addresses on the LPARs that can route traffic to the Management Console. 
(4) Network Address Translation (NAT) is not supported in HMC RMC domains and NAT would need to be disabled on the network.  There might also  or a reconfiguration of the HMC and LPAR's IP scheme would have to be accomplished such that the HMC and LPAR both see the actual IP address being used by each node.

b. The above steps only cover the more common and simplistic issues involved in RMC communication errors. If you are unable to reestablish RMC connection by running the commands suggested then a more detailed look at the problem is required. Data gathering tools such as pedbg on the Management Console and ctsnap on the LPARs are the next tools that should be used to look at the problem more closely.

5. If the basic things listed above have been checked or performed and still not getting RMC to work then its appropriate to collect additional data.

a. RMC Connection Errors Data Collection on LPAR

(1) Please check the clock setting on LPAR and management console to make sure they are in sync (use date command). Synchronizing clocks will make data analysis much easier.

(2) From LPAR collect a snap

(a) If AIX LPAR as root run
snap -gtkc

(b) If VIOS LPAR as padmin run
snap


b. Collect a ctsnap from the LPAR as root

/usr/sbin/rsct/bin/ctsnap -x runrpttr

c. Collect a pedbg from the management console as described in following below.

(1) If HMC then run "pedbg -c -q 4" as user hscpe and refer to following document for additional information if needed.

 
HMC Enhanced View: Collecting PEDBG from the HMC
https://www-01.ibm.com/support/docview.wss?uid=nas8N1022548
d. Rename the data files collected on the LPAR.

(1) rename then snap file

(a) On AIX LPAR the snap is in /tmp/ibmsupt so as root run following.

mv /tmp/ibmsupt/snap.pax.Z /tmp/ibmsupt/<PMR#.Branch.000>-snap.pax.Z

(b) On VIOS LPAR the snap is in /home/padmin so as padmin run following.
mv /home/padmin/snap.pax.Z /home/padmin/<PMR#.Branch.000>-snap.pax.Z

(2) rename the ctsnap file

Note: The output file for ctsnap will be in /tmp/ctsupt with a name similar to following.
ctsnap.<hostname>.<date time>.tar.gz
Renaming it requires you to list /tmp/ctsupt so you can view the current name.

ls -l /tmp/ctsupt

mv /tmp/ctsupt.<ctsnap filename> <PMR#.Branch.000>-<ctsnap filename>

e. Transmit the data files to IBM.

(1) FTP or HTTPS site is testcase.software.ibm.com

(2) User ID is anonymous and password can be your email address

(3) Directory is /toibm/aix

(4) Include the snap and ctsnap from the LPAR

(5) Include the pedbg from the HMC

[{"Product":{"code":"SWG10","label":"AIX"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Not Applicable","Platform":[{"code":"PF002","label":"AIX"}],"Version":"5.3;6.1;7.1","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Document Information

Modified date:
30 July 2022

UID

isg3T1020611