The most common reasons for failures with Dynamic Logical Partitioning

Technote (FAQ)


Question

I need to add or remove processors, memory, or I/O devices from an LPAR using the HMC. Its not working and I would like to know why ?

Cause

Most likely, the cause is a RMC connection failure between the HMC and the LPAR

Answer

Starting point for troubleshooting problems with Dynamic Logical Partitioning.

The procedures listed below apply to Power4, Power5, & Power6 HMCs

The most common reason is due to the RMC connection failure between the HMC and the LPAR.

The first place to check is the HMC by using query commands within the HMC restricted shell command prompt.

# lspartition -dlpar

If there is no output at all, then there is an RMC problem affecting all lpars attached to this particular HMC. If this happens it is OK to close all Serviceable Events (under Service Focal Point) and reboot the HMC

# hmcshutdown -r -t now

Once the HMC reboots, wait about 15 minutes and re-run

# lspartition -dlpar

If still no output, then it would be recommended open a call with tech support.

In order for RMC to work, port 657 upd/tcp must have to be open in both directions between the HMC public interface and the lpars.

Check the partition in question. For dlpar to function, the partition must be returned, the partition must return with the correct IP of the lpar. The active value must be higher than zero, and the decaps value must be higher 0x0


Example of a working lpar

<#1> Partition:<11*9117-570*10XXXX, correct_hostname.domain, correct_ip>
Active:<1>, OS:<AIX, 5.3, 5.3>, DCaps:<0x3f>, CmdCaps:<0xb, 0xb>, PinnedMem:<146>

Example of non-working lpar

<#9> Partition:<10*9117-570*10XXXX, hostname, ip>
Active:<0>, OS:<, , >, DCaps:<0x0>, CmdCaps:<0x0, 0x0>, PinnedMem:<0>

If you see the condition in the above second example (and dlpar is working for other lpars on this HMC)next step is to check the RMC status from the lpar (AIX root access will be needed).


lssrc -a | grep rsct
ctcas rsct inoperative
ctrmc rsct inoperative
IBM.ERRM rsct_rm inoperative
IBM.HostRM rsct_rm inoperative
IBM.ServiceRM rsct_rm inoperative
IBM.CSMAgentRM rsct_rm inoperative
IBM.DRM rsct_rm inoperative
IBM.AuditRM rsct_rm inoperative
IBM.LPRM rsct_rm inoperative

This example output shows that all the RSCT daemons are inoperative. In many cases, some active and some missing. The key component daemon for dynamic logical partitioning is IBM.DRM

**Update 1/13/2011*** Beginning with csm.client 1.7.1.0, IBM.DRM will only become active when it its needed. After doing the rmcctrl or the recfgct commands discussed below, IBM.DRM if successfully starting, will only remain active for five to ten minutes before it stops. The best method to check for a good RMC connection from the lpar is to run "lsrsrc IBM.ManagementServer" AFTER recycling CTRMC with rmcctrl or rebuilding with recfgct. The output will return a "resource" for each HMC or other type of management server, such as a CSM server.****


# lsrsrc IBM.ManagementServer
Resource Persistent Attributes for IBM.ManagementServer
resource 1:
Name = "9.3.55.192"
Hostname = "9.3.55.192"
ManagerType = "HMC"
LocalHostname = "9.3.55.166"
ClusterTM = "9078-160"
ClusterSNum = ""
ActivePeerDomain = ""
NodeNameList = {"myhost"}
resource 2:
Name = "9.3.55.193"
Hostname = "9.3.55.193"
ManagerType = "HMC"
LocalHostname = "9.3.55.166"
ClusterTM = "9078-160"
ClusterSNum = ""
ActivePeerDomain = ""
NodeNameList = {"myhost"}


An appropriate way to stop and start RMC without erasing the configuration would be using the following commands.

# /usr/sbin/rsct/bin/rmcctrl -z
# /usr/sbin/rsct/bin/rmcctrl -A
# /usr/sbin/rsct/bin/rmcctrl -p

Check the daemon states.

# lssrc -a | grep rsct
Is IBM.DRM active now? If so, the problem may have been resolved.

Go back to the HMC restricted shell command prompt

# lspartition -dlpar
the partition shows correct hostname & IP
Active<1> and Decaps value 0x3f

The above values mean that the partition is capable of a dlpar operation.

*** Other notes *** an lpar cloned from a mksysb may still have the
RMC configuration from the mksysb source. In this case, IBM.DRM is shown as active.

****Using the recfgct command*****

recfgct deletes the RMC database, does a discovery, and recreates the RMC configuration.

In many cases where the lpars were not already configured for the specific purposes, recfgct may be safe to use on those nodes. There are cases where you would not use recfgct. One of the cases may be if the LPAR is a CSM Management Server or the LPAR has RMC Virtual Shared Disks (VSDs). VSDs are usually only found in very large GPFS clusters. If you are using VSDs, then these filesets would be installed on your AIX system: rsct.vsd.cmds, rsct.vsd.rvsd, rsct.vsd.vsdd, and
rsct.vsd.vsdrm


# lslpp -L | grep vsd
If no output, then you are not using VSDs

The other rarely used application that can be interrupted by recfgct, but without significant consequences, is if the node is a CSM Manager node or CSM client node. All AIX lpars should have these filesets
# lslpp -L | grep csm
csm.client 1.7.0.10 C F Cluster Systems Management
csm.core 1.7.0.10 C F Cluster Systems Management
csm.deploy 1.7.0.10 C F Cluster Systems Management
csm.diagnostics 1.7.0.10 C F Cluster Systems Management
csm.dsh 1.7.0.10 C F Cluster Systems Management Dsh
csm.gui.dcem 1.7.0.10 C F Distributed Command Execution

If you have additional filesets that start with csm, such as csm.server, csm.hpsnm, csm.ll, csm.gpfs, then you may have an LPAR that is part of a larger CSM cluster. The csm.server fileset should only be installed on a CSM Management Server. Following details a few additional checks you can perform to see if you have a Management Server configured.


# csmconfig -L ---> csmconfig not found, this is not a csm server
# lsrsrc IBM.ManagementServer
This will list resources that manage the lpar, including the HMC
and/or a csm server
Look at the Manager Type field
Manager Type = CSM --- this is a csm node

So if it turns the node is a csm manager, then you would have to re-add all the nodes. If the system was a csm client node, then you would need to get onto the manager server and re-add the node.

Thats it for the warnings on recfgct. If you think you might be using VSDs and/or a CSM cluster, but are not sure, then please open a pmr and support can assist in you in determining this.

If it is unsure whether VSDs and/or a CSM cluster then please open a pmr and support can assist in determining this.


Assuming you have not reason to be concerned about the warning discussed above, then proceed.

# /usr/sbin/rsct/install/bin/recfgct

Wait several minutes

# lssrc -a | grep rsct
If you see IBM.DRM active, then you have probably resolved the issue

# lsrsrc IBM.ManagementServer

Check whether the output has this entry
ManagerType=HMC

Try the dlpar operation again. If it fails, then you will likely need
to open a software PMR.

The other main reason for a dlpar failure is that the lpar has reached its minimum or maximum (on processors or memory)

Note. The partition profile does not give a true picture of the current
running configuration. If the profile was edited, but the partition did not go down into a "not activated" state, then reactivated, then the profile edits have not been read.

To check the current "running configuration" check the Partition Properties instead of the profile properties. You will see the min, max, & current. You can not remove or add processors and memory that are not within these boundaries. The command to check the running properties from the HMC restricted shell listed here

# lssyscfg -r sys -F name
(you need the value of name for use with the -m flag on many HMC commands)

# lshwres -r proc -m <server_name> --level lpar
(this list just the lpars settings)
# lshwres -r proc -m <server_name> --level sys
(this list the entire servers memory settings)

If you are checking for memory, replace "proc" in the above commands with "mem"

DLPAR can fail for many reasons, and it may be necessary to contact Remote Technical Support. However, the above may solve your problem.


Rate this page:

(0 users)Average rating

Add comments

Document information


More support for:

AIX family

Software version:

5.2, 5.3, 6.0, 6.1

Operating system(s):

AIX

Software edition:

Standard

Reference #:

T1010615

Modified date:

2009-11-13

Translate my page

Machine Translation

Content navigation