IBM Support

IV80836: NODE PANIC ON POWERHA 7 DUE TO HAGS CLIENT UNRESPONSIVE

A fix is available

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • ***************************************************************
    * USERS AFFECTED:
    * PowerHA SystemMirror v7 systems running RSCT 3.1.4.0 or
    * higher.  This level was shipped with AIX 6.1 TL8 and 7.1 TL2
    * in 2012, and was also available for download from Fix Central.
    * NOTE: This problem did not begin having a visible impact in
    * the field until AIX 6.1 TL9 and 7.1 TL3 (2013), although the
    * reason for this is not known.  The problem involves a timing
    * factor between the native RSCT code and AIX APIs being used
    * to obtain cluster information, so slight changes to either
    * or both layers could have increased the chances of exposure.
    ***************************************************************
    * PROBLEM DESCRIPTION:
    * CAA (Cluster-Aware AIX) system calls are blocking so that
    * the latest data across all nodes is always reported for
    * any client query.
    *
    * This is in conflict with the RSCT Group Services (hags)
    * expectation of only allowing non-blocking calls in critical
    * code paths, and several design decisions were made in hags
    * when CAA was first being developed on the assumption that
    * all calls would be non-blocking.
    *
    * Visible side-effects of this design conflict did not begin
    * to occur until cluster protection methods were added to
    * the CAA environment in RSCT 3.1.4.0, primarily the daemon
    * monitoring between critical processes in the RSCT stack.
    * Impacts now include hags mistakenly declaring IBM.ConfigRM
    * unresponsive, and RMC voting timeouts resulting in the that
    * subsystem being declared unresponsive.  Either situation can
    * cause a node panic to protect application resources because
    * hags believes that the RSCT infrastructure is unreliable.
    *
    * Regardless of which subsystems are affected, the panic will
    * display the panic string "RSCT reboot caused by critical
    * resource protection - Group Services"
    ***************************************************************
    * RECOMMENDATION:
    * An interim fix for the latest AIX levels is available from:
    * https://ibm.biz/PowerHAFixes
    * (Ifixes for older levels can be requested from IBM service
    * on an as-needed basis.)
    ***************************************************************
    

Local fix

Problem summary

  • ***************************************************************
    * USERS AFFECTED:
    * PowerHA SystemMirror v7 systems running RSCT 3.1.4.0 or
    * higher.  This level was shipped with AIX 6.1 TL8 and 7.1 TL2
    * in 2012, and was also available for download from Fix Central.
    * NOTE: This problem did not begin having a visible impact in
    * the field until AIX 6.1 TL9 and 7.1 TL3 (2013), although the
    * reason for this is not known.  The problem involves a timing
    * factor between the native RSCT code and AIX APIs being used
    * to obtain cluster information, so slight changes to either
    * or both layers could have increased the chances of exposure.
    ***************************************************************
    * PROBLEM DESCRIPTION:
    * CAA (Cluster-Aware AIX) system calls are blocking so that
    * the latest data across all nodes is always reported for
    * any client query.
    *
    * This is in conflict with the RSCT Group Services (hags)
    * expectation of only allowing non-blocking calls in critical
    * code paths, and several design decisions were made in hags
    * when CAA was first being developed on the assumption that
    * all calls would be non-blocking.
    *
    * Visible side-effects of this design conflict did not begin
    * to occur until cluster protection methods were added to
    * the CAA environment in RSCT 3.1.4.0, primarily the daemon
    * monitoring between critical processes in the RSCT stack.
    * Impacts now include hags mistakenly declaring IBM.ConfigRM
    * unresponsive, and RMC voting timeouts resulting in the that
    * subsystem being declared unresponsive.  Either situation can
    * cause a node panic to protect application resources because
    * hags believes that the RSCT infrastructure is unreliable.
    *
    * Regardless of which subsystems are affected, the panic will
    * display the panic string "RSCT reboot caused by critical
    * resource protection - Group Services"
    ***************************************************************
    * RECOMMENDATION:
    * An interim fix for the latest AIX levels is available from:
    * https://ibm.biz/PowerHAFixes
    * (Ifixes for older levels can be requested from IBM service
    * on an as-needed basis.)
    ***************************************************************
    

Problem conclusion

  • Changing the nature of the CAA queries to make them non-
    blocking would be difficult and could not be accomplished
    any time soon.  Instead, Group Services client timeouts
    are all being recalculated to allow for the necessary
    CAA query timeouts.
    

Temporary fix

  • *********
    * HIPER *
    *********
    

Comments

  • AIX 6100-09 - use RSCT APAR IV80836
    AIX 7100-03 - use RSCT APAR IV80836
    

APAR Information

  • APAR number

    IV80836

  • Reported component name

    RSCT/RMC FOR CS

  • Reported component ID

    5765F07AP

  • Reported release

    320

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    YesHIPER

  • Submitted date

    2016-01-26

  • Closed date

    2016-01-26

  • Last modified date

    2019-02-01

  • APAR is sysrouted FROM one or more of the following:

    IV64642

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    RSCT/RMC FOR CS

  • Fixed component ID

    5765F07AP

Applicable component levels

  • R320 PSY U881677

       UP19/02/01 I 1000 Ž

PTF to Fileset Mapping

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11O","label":"APARs - AIX 4.3 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11Q","label":"AIX 6.1 HIPERS, APARs and Fixes"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11N","label":"APARs - AIX 5.1 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11P","label":"APARs - AIX 5.3 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11M","label":"APARs - AIX 5.2 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11R","label":"APARs - AIX 7.1 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
01 February 2019