IBM Support

IV66606: NODE DOWN IN CAA CLUSTER DUE TO CONFIGRM MEMORY LEAK

A fix is available

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • ***************************************************************
    * USERS AFFECTED:
    * Systems running rsct.core.rmc 3.1.5.0 through 3.1.5.8.
    * This includes AIX 6.1 TL9 and AIX 7.1 TL3, and VIOS 2.2.3
    * Other AIX levels can be affected if RSCT has been updated
    * independently of AIX.
    ***************************************************************
    * PROBLEM DESCRIPTION:
    * Starting in rsct.core.rmc 3.1.5.0, a memory leak in
    * CAA-specific code paths of the IBM.ConfigRM subsystem may
    * lead to library calls failing which can cause ConfigRM to
    * believe the CAA domain is being shut down, causing it to go
    * through offline processing at the RSCT domain layer,
    * including stopping cthags.
    * That action is a critical infrastructure loss for PowerHA 7
    * or VIOS SSP, and will lead to node failure (halt in the
    * case of PowerHA, or system crash with VIOS SSP).
    *
    * The leak occurs as long as CAA is active, regardless of
    * what PowerHA or SSP is doing, and only on the node
    * operating as the ConfigRM Group leader.  The GL node
    * can be identified in "lssrc -ls IBM.ConfigRM"
    * A reboot is guaranteed to reset the situation.  Time to
    * failure after a new boot is estimated to be between 4 and 8
    * months, although no existing records of failures in the
    * field still retained the time of the last reboot, so a
    * precise deadline is not known.
    ***************************************************************
    * RECOMMENDATION:
    * The fix for RSCT 3.1.5 is available via RSCT APAR IV66606.
    * An interim fix for RSCT 3.1.5 is available from either:
    * ftp://aix.software.ibm.com/aix/ifixes/iv66606/
    * https://aix.software.ibm.com/aix/ifixes/iv66606/
    ***************************************************************
    * NOTICE:
    * The interim fix package available in the links above is a
    * bundle including all of these fixes:
    *   IV66606 - ConfigRM memory leak
    *   IV69017 - TieBreaker issue causing node reboots
    *   IV41939 - Incorrect TieBreaker type in CAA migration
    *   IV71572 - On PowerHA 7, "shutdown -F" may end in panic
    *
    * It supersedes these previous fix packages:
    *   Label       Package                    Addresses
    *   ----------  -------------------------  ------------------
    *   IV66606.1   IV66606.1.150225.epkg.Z    Only IV66606
    *   IV66606.2   IV66606.2.150225.epkg.Z    Only IV66606
    *   IV66606.2a  IV66606.2a.150306.epkg.Z   IV66606 & IV71572
    *
    * If any of those fix packages are already installed and the
    * IBM.ConfigRM subsystem is active (lssrc), then no further
    * action is needed, unless the customer wishes to obtain any
    * of the additional fixes above.
    * As far as the memory leak itself goes, those older fixes
    * are fine as long as IBM.ConfigRM is able to run.
    *
    * If you are holding one of those older packages but have
    * not yet installed it, you should discard it for the one
    * available in the links above.
    *
    * Customers with the "IV66606.1" package already installed
    * on RSCT 3.1.5.0 or 3.1.5.1 will find that IBM.ConfigRM is
    * unable to run, although this will not have an immediate
    * impact on the cluster.  These customers should contact IBM
    * support for assistance on replacing this package, since
    * the absence of IBM.ConfigRM will cause emgr removal checks
    * for that epkg to fail.
    ***************************************************************
    

Local fix

Problem summary

  • Starting in rsct.core.rmc 3.1.5.0, a slow memory leak in
    IBM.ConfigRM under CAA can lead to a cluster service
    shutdown, which causes to a node failure in both PowerHA v7
    (halt) and VIOS SSP (system panic).
    
    The leak occurs as long as CAA is active, regardless of
    what PowerHA or SSP is doing, and only on the node
    operating as the ConfigRM Group leader.  The GL node
    can be identified in "lssrc -ls IBM.ConfigRM"
    
    A reboot is guaranteed to reset the situation.  Time to
    failure after a new boot is estimated to be between 6 and 8
    months, although no existing records of failures in the
    field still retained the time of the last reboot, so a
    precise deadline is not known.
    

Problem conclusion

  • The leak has been addressed.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IV66606

  • Reported component name

    RSCT/RMC FOR CS

  • Reported component ID

    5765F07AP

  • Reported release

    315

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    YesHIPER

  • Submitted date

    2014-11-05

  • Closed date

    2015-02-21

  • Last modified date

    2016-10-20

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    IV69760

Fix information

  • Fixed component name

    RSCT/RMC FOR CS

  • Fixed component ID

    5765F07AP

Applicable component levels

  • R315 PSY U874797

       UP16/10/20 I 1000 Ž

PTF to Fileset Mapping

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11O","label":"APARs - AIX 4.3 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"315","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11Q","label":"AIX 6.1 HIPERS, APARs and Fixes"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"315","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11N","label":"APARs - AIX 5.1 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"315","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11P","label":"APARs - AIX 5.3 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"315","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11M","label":"APARs - AIX 5.2 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"315","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11R","label":"APARs - AIX 7.1 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"315","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
20 October 2016