IBM Support

IV82494: CAA: A NODE MAY NOT SEE A REBOOTED NODE AS UP APPLIES TO AIX 7100-03

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • **************************************************************
    * USERS AFFECTED:
      * Systems running the AIX 7100-03 Technology Level
      * with bos.cluster.rte below the 7.1.3.48 level.
      **************************************************************
      * PROBLEM DESCRIPTION:
      *     After reboot of one node, the CAA cluster state
      *     may be inconsistent in a cluster using multicast
      *     communication mode, if there is an issue with
      *     multicast communication, but unicast communication
      *     is working.
      *     'lscluster -m' of node1:
      *     ------------------------
      *     Calling node query for all nodes...
      *     Node query number of nodes examined: 2
      *
      *             Node name: node1
      *             Cluster shorthand id for node: 1
      *             ...
      *             State of node: UP  NODE_LOCAL
      *             ...
      *             Node name: node2
      *             Cluster shorthand id for node: 2
      *             ...
      *             State of node: DOWN
      *             ...
      *     'lscluster -m' of node2:
      *     ------------------------
      *     Calling node query for all nodes...
      *     Node query number of nodes examined: 2
      *
      *             Node name: node1
      *             Cluster shorthand id for node: 1
      *             ...
      *             State of node: UP
      *             ...
      *             Node name: node2
      *             Cluster shorthand id for node: 2
      *             ...
      *             State of node: UP  NODE_LOCAL
      *             ...
      *     In the above example node2 was the last node, which
      *     has been rebooted.
      *     syslog.caa of node1 looks like:
      *     -------------------------------
      *     ...
      *     <timestamp> node1 caa:info unix: kcluster_lock.c
      *      count_active_nodes      200      num_nodes_active 2
      *      *up_node_cnt 1 db_node_cnt 1
      *     <timestamp> node1  caa:err|error unix:
      *      kcluster_clusterwide.c
      *      kcluster_clusterwide    841     clusterwide query
      *      node timeout: cmd = 0x20, from node id = 2
      *     ...
      *     <timestamp> node1 caa:err|error unix:
      *      kcluster_clusterwide.c
      *      kcluster_clusterwide    841     clusterwide query
      *      node timeout: cmd = 0x20, from node id = 2
      *     ...
      *     syslog.caa of node2 looks like:
      *     -------------------------------
      *     ...
      *     <timestamp> node2  caa:info unix: kcluster_syscalls.c
      *      _xcluster_create        2614
      *      Clusterwide locking services are starting.
      *     ...
      *     <timestamp> node2 caa:info unix: kcluster_lock.c
      *      count_active_nodes      200      num_nodes_active 2
      *      *up_node_cnt 0 db_node_cnt 1
      *     <timestamp> node2 caa:info unix: kcluster_lock.c
      *      wait_on_node_bringup    255     All nodes are active.
      *     ...
      *     <timestamp> node2  caa:info unix: kcluster_lock.c
      *      count_active_nodes      200      num_nodes_active 2
      *      *up_node_cnt 0 db_node_cnt 1
      *     <timestamp> node2  caa:info unix: kcluster_lock.c
      *      xcluster_lock   607     xcluster_lock: lock
      *      2 acquired, num_nodes_active: 2
      *     <timestamp> node2  caa:info unix: kcluster_lock.c
      *      xcluster_lock   608     xcluster_lock: nodes
      *      which responded: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0
      *     ...
      *     <timestamp> node2 caa:info clusterÝ2490836¨:
      caa_config.c
      *      cl_th_sock      5317    258     Node node1
      *      is DOWN, and we are not trying to JOIN it or STOP it.
      *      Skipping.
      *     ...
      **************************************************************
      * RECOMMENDATION:
      * Install APAR IV82494.
      **************************************************************
    

Local fix

  • Use unicast communication mode.
    

Problem summary

  •   **************************************************************
      * USERS AFFECTED:
      * Systems running the AIX 7100-03 Technology Level
      * with bos.cluster.rte below the 7.1.3.48 level.
      **************************************************************
      * PROBLEM DESCRIPTION:
      *     After reboot of one node, the CAA cluster state
      *     may be inconsistent in a cluster using multicast
      *     communication mode, if there is an issue with
      *     multicast communication, but unicast communication
      *     is working.
      *     'lscluster -m' of node1:
      *     ------------------------
      *     Calling node query for all nodes...
      *     Node query number of nodes examined: 2
      *
      *             Node name: node1
      *             Cluster shorthand id for node: 1
      *             ...
      *             State of node: UP  NODE_LOCAL
      *             ...
      *             Node name: node2
      *             Cluster shorthand id for node: 2
    

Problem conclusion

  • If it is known that a certain number of nodes is heartbeating
    to the repository, do not attempt to acquire clusterwide locks
    until the number of nodes gossiping is equal to it.
    

Temporary fix

  •   *********
      * HIPER *
      *********
    

Comments

APAR Information

  • APAR number

    IV82494

  • Reported component name

    AIX V7.1

  • Reported component ID

    5765H4000

  • Reported release

    710

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    YesHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2016-03-09

  • Closed date

    2016-03-10

  • Last modified date

    2016-11-09

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    IV82624 IV82627 IV82651 IV82781

Fix information

  • Fixed component name

    AIX V7.1

  • Fixed component ID

    5765H4000

Applicable component levels

  • R710 PSY U869347

       UP16/06/21 I 1000

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SG11R"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"710","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Document Information

Modified date:
19 April 2022