IBM Support

IV69760: NODE DOWN IN CAA CLUSTER DUE TO CONFIGRM MEMORY LEAK

A fix is available

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • ***************************************************************
    * USERS AFFECTED:
    * Systems running rsct.core.rmc 3.2.0.0 through 3.2.0.4.
    * This includes AIX 6.1 TL9 and AIX 7.1 TL3, and VIOS 2.2.3
    * Other AIX levels can be affected if RSCT has been updated
    * independently of AIX.
    ***************************************************************
    * PROBLEM DESCRIPTION:
    * Starting in rsct.core.rmc 3.1.5.0 (and continuing into the
    * 3.2.0.0 release), a memory leak in CAA-specific code paths
    * of the IBM.ConfigRM subsystem may lead to library calls
    * failing which can cause ConfigRM to believe the CAA domain
    * is being shut down, causing it to go through offline
    * processing of the RSCT domain, including stopping cthags.
    * That action is a critical infrastructure loss for PowerHA 7
    * or VIOS SSP, and will lead to node failure (halt in the
    * case of PowerHA, or system crash with VIOS SSP).
    *
    * The leak occurs as long as CAA is active, regardless of
    * what PowerHA or SSP is doing, and only on the node
    * operating as the ConfigRM Group leader.  The GL node
    * can be identified in "lssrc -ls IBM.ConfigRM"
    * A reboot is guaranteed to reset the situation.  Time to
    * failure after a new boot is estimated to be between 4 and 8
    * months, although no existing records of failures in the
    * field still retained the time of the last reboot, so a
    * precise deadline is not known.
    ***************************************************************
    * RECOMMENDATION:
    * The fix for RSCT 3.2.0 is available via RSCT APAR IV69760.
    * The fix for RSCT 3.2.0 will also ship with:
    * AIX 6.1 TL9 SP5, AIX 7.1 TL3 SP5, and VIOS 2.2.3.5.
    * An interim fix for RSCT 3.2.0 is available from either:
    * ftp://aix.software.ibm.com/aix/ifixes/iv69760/
    * https://aix.software.ibm.com/aix/ifixes/iv69760/
    ***************************************************************
    * NOTICE:
    * The interim fix package available in the links above is a
    * bundle including all of these fixes:
    *   IV69760 - ConfigRM memory leak (RSCT 3.2 APAR)
    *   IV69674 - TieBreaker issue causing node reboots
    *   IV71572 - On PowerHA 7, "shutdown -F" may end in panic
    *
    * It supersedes these previous fix packages:
    *   Label       Package                    Addresses
    *   ----------  -------------------------  ------------------
    *   IV66606.3   IV66606.3.150225.epkg.Z    Only IV69760
    *   IV66606.3a  IV66606.3a.150306.epkg.Z   IV69760 & IV71572
    *
    *   Note: The official APAR for the ConfigRM memory leak in
    *         RSCT 3.2 is IV69760; however, the early fix packages
    *         for RSCT 3.2 still used "IV66606" as a reference
    *         (the APAR for RSCT 3.1), because the 3.2 APAR had
    *         not yet been cloned at that time.
    *
    * If any of those fix packages are already installed and the
    * IBM.ConfigRM subsystem is active (lssrc), then no further
    * action is needed, unless the customer wishes to obtain any
    * of the additional fixes above.
    * As far as the memory leak itself goes, those older fixes
    * are fine as long as IBM.ConfigRM is able to run.
    *
    * If you are holding one of those packages but have not
    * yet installed it, you should discard it for the one
    * available in the links above.
    *
    * Any customer who finds ConfigRM is not able to run with
    * their current fix package should contact IBM support for
    * assistance on replacing it, since the absence of
    * IBM.ConfigRM may cause emgr removal checks to fail.
    ***************************************************************
    

Local fix

Problem summary

  • Starting in rsct.core.rmc 3.1.5.0, a slow memory leak in
    IBM.ConfigRM under CAA can lead to a cluster service
    shutdown, which causes to a node failure in both PowerHA v7
    (halt) and VIOS SSP (system panic).
    
    The leak occurs as long as CAA is active, regardless of
    what PowerHA or SSP is doing, and only on the node
    operating as the ConfigRM Group leader.  The GL node
    can be identified in "lssrc -ls IBM.ConfigRM"
    
    A reboot is guaranteed to reset the situation.  Time to
    failure after a new boot is estimated to be between 6 and 8
    months, although no existing records of failures in the
    field still retained the time of the last reboot, so a
    precise deadline is not known.
    

Problem conclusion

  • The leak has been addressed.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IV69760

  • Reported component name

    RSCT/RMC FOR CS

  • Reported component ID

    5765F07AP

  • Reported release

    320

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    YesHIPER

  • Submitted date

    2015-02-21

  • Closed date

    2015-02-21

  • Last modified date

    2017-08-02

  • APAR is sysrouted FROM one or more of the following:

    IV66606

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    RSCT/RMC FOR CS

  • Fixed component ID

    5765F07AP

Applicable component levels

  • R320 PSY U876547

       UP17/08/02 I 1000 Ž

PTF to Fileset Mapping

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11O","label":"APARs - AIX 4.3 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11Q","label":"AIX 6.1 HIPERS, APARs and Fixes"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11N","label":"APARs - AIX 5.1 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11P","label":"APARs - AIX 5.3 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11M","label":"APARs - AIX 5.2 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11R","label":"APARs - AIX 7.1 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
02 August 2017