IBM Support

IJ02843: POWERHA NODE HALT DURING DURING IP CHANGES

A fix is available

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • **************************************************************
    * USERS AFFECTED:
    * Systems running PowerHA System Mirror on the
    * AIX 7100-05 Technology Level or
    * AIX 7200-02 Technology Level with
    * rsct.basic.rte at 3.2.3.0.
    **************************************************************
    * PROBLEM DESCRIPTION:
    * An improvement in obtaining adapter state information from
    * AHAFS event responses introduced some errors in handling
    * internal tracking of monitored IP addresses.
    *
    * This can result in a core dump of the hagsd process any time
    * an IP change occurs at the OS layer.  This means the failure
    * cannot happen while a cluster is running stable with no
    * changes occurring, but it is a risk during startup, shutdown,
    * or a failover scenario, and cannot be predicted beyond that.
    *
    * Two possible core dump stacks which have been seen as a
    * result of this issue are as follows:
    *
    *    (dbx) where
    *    SGroup.SGroup::RemoveRealProvider()(),
    *       line 4516 in SGroup.C
    *    SGroup.SGroup::RemoveRealProvider()(),
    *       line 3193 in SGroup.C
    *    SFailureProtocol::RemoveChangingMembersFromGroup()(),
    *       line 1169 in SFailureProtocol.C
    *    SFailureProtocol::ApprovedPostBroadcast()(),
    *       line 552 in SFailureProtocol.C
    *    SVProtocol::ExecutePostBroadcast()(),
    *       line 457 in SVProtocol.C
    *    SFailureProtocol::ExecutePostBroadcast()(),
    *       line 500 in SFailureProtocol.C
    *    SVProtocol::Approved()(), line 597 in SVProtocol.C
    *    SVProtocol::Execute()(), line 397 in SVProtocol.C
    *    SFailureProtocol::Execute()(),
    *       line 462 in SFailureProtocol.C
    *    SProtocolMgr::ExecuteThisProtocol()(),
    *       line 1939 in SProtocolMgr.C
    *    SGroupAdaptMbr::ExecutePendingProtocols()(),
    *       line 1237 in SGroupAdaptMbr.C
    *    SDelayedProtocol::main()(),
    *       line 106 in SDelayedProtocol.h
    *    executeEvent()(), line 306 in DelayedJob.C
    *    DelayedJob::dispatchAll()(), line 431 in DelayedJob.C
    *    DispatchControl::Dispatcher()(),
    *       line 681 in DispatchControl.C
    *    main(), line 875 in pgsd.C
    *
    *    (dbx) where
    *    FindHashLoc()(), line 131 in hash.C
    *    Hash_insert()(), line 270 in hash.C
    *    hb_caa_update_global_tbl()(),
    *       line 3073 in hb_communication.C
    *    AHAFSConfigurationHandler::
    *    update_global_table_and_construct_events()(),
    *       line 110 in CAA_AHAFSConfigurationHandler.C
    *    unnamed block in AHAFSIPChangeEventHandler::handler()(),
    *       line 188 in CAA_AHAFSIPChangeEventHandler.C
    *    AHAFSIPChangeEventHandler::handler()(),
    *       line 188 in CAA_AHAFSIPChangeEventHandler.C
    *    unnamed block in AHAFSHandler::dispatch()(),
    *       line 189 in CAA_AHAFSHandler.C
    *    AHAFSHandler::dispatch()(),
    *       line 189 in CAA_AHAFSHandler.C
    *    hb_get_event_message(),
    *       line 1078 in hb_communication.C
    *    PMRun()(), line 1436 in PMClient.C
    *    PMSocket::HandleInput()(), line 68 in PMSocket.C
    *    DispatchControl::HandleInput()(),
    *       line 1211 in DispatchControl.C
    *    DispatchControl::Dispatcher()(),
    *       line 976 in DispatchControl.C
    *    main(), line 875 in pgsd.C
    *
    **************************************************************
    * RECOMMENDATION:
    * Install APAR IJ02843.
    * Prior to fix availability, an interim fix is available from
    * either
    * ftp://aix.software.ibm.com/aix/ifixes/ij02843/
    * https://aix.software.ibm.com/aix/ifixes/ij02843/
    * Installation of the ifix does not require a reboot; however,
    * applying the ifix requires PowerHA to be stopped on the node
    * prior to applying the fix.  Resources must be moved to
    * another node or taken offline (the fix won't install with
    * unmanage resources).
    **************************************************************
    

Local fix

Problem summary

  • A flaw in handling of monitored IP changes during some
    adapter state improvements in RSCT 3.2.3.0 has led to the
    risk of a hagsd core dump in a couple code paths.
    

Problem conclusion

  • Transition of IP lists during a monitoring change has been
    corrected.
    

Temporary fix

  •   *********
      * HIPER *
      *********
    

Comments

APAR Information

  • APAR number

    IJ02843

  • Reported component name

    RSCT FOR AIX

  • Reported component ID

    5765F07AP

  • Reported release

    323

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    YesHIPER

  • Submitted date

    2017-12-22

  • Closed date

    2018-02-14

  • Last modified date

    2021-09-02

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    U881668

Fix information

  • Fixed component name

    RSCT FOR AIX

  • Fixed component ID

    5765F07AP

Applicable component levels

  • R323 PSY U889770

       UP21/09/02 I 1000 Ž

PTF to Fileset Mapping

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11O"},"Platform":[{"code":"PF053","label":"Power Systems"}],"Version":"323"}]

Document Information

Modified date:
03 September 2021