IBM Support

On Spectrum Scale, when any of the quorum nodes are under high load, the cluster manager may unexpectedly lose its membership from the cluster resulting in unexpected cluster manager elections

Flashes (Alerts)


Abstract

On Spectrum Scale, when any of the quorum nodes are under high load, the cluster manager may unexpectedly lose its membership from the cluster resulting in unexpected cluster manager elections.

Content

Issue:

On Spectrum Scale, when any of the quorum nodes are under high memory load or high I/O load to the local disk, the cluster manager may unexpectedly lose its membership from the cluster resulting in unexpected cluster manager elections.

This has been seen most often on the Elastic Storage Server (ESS) where the management server (EMS) node is overloaded, but this may also happen on non-ESS systems.

An overdue lease with unsuccessful replies for a lease request can be seen on quorum nodes that are not  the current cluster manager. In addition, one of those nodes may trigger an unexpected cluster manager election. This can be seen on the GPFS log (/var/adm/ras/mmfs.log.latest). For example:

gssio2: [I] Lease overdue with unsuccessful replies to lease requests. Probing cluster testcluster.ibm.com
gssio2: [D] Running election ...
gssio2: [I] Node 10.1.1.2 (gssio2) is now the Group Leader.
gssio3: [I] Lease overdue with unsuccessful replies to lease requests. Probing cluster testcluster.ibm.com
gssio4: [I] Lease overdue with unsuccessful replies to lease requests. Probing cluster testcluster.ibm.com

The former cluster manager node reports losing its role. For example:

gssio1: [N] Disk lease period expired 0.010 seconds ago in cluster testcluster.ibm.com. Attempting to reacquire the lease.
gssio1: [E] ccrCheckAndRespondChallenge: challenge failed ccrQVersion 498898 err 815 (Local node is no longer leader)
gssio1: [E] Lost membership in cluster testcluster.ibm.com. Unmounting file systems.

A possible indication of local disk stress can be seen in the GPFS log (/var/adm/ras/). For example:

gssio1: [N] PFD store (DIO): /var/mmfs/ccr/ccr.paxos.1 took 22.6 seconds

A possible indication of node resources (CPU/RAM) running tight may be seen in /var/log/messages:

sshd[2695]: error: fork: Cannot allocate memory
ems1 crond[2361]: (CRON) CAN'T FORK (do_command): Cannot allocate memory
ems1 crond[2361]: (CRON) CAN'T FORK (do_command): Cannot allocate memory

Corrective Action:

IBM Spectrum Scale:

IBM Spectrum Scale V5.0:  Upgrade to IBM Spectrum Scale V5.0.0.2 or later
IBM Spectrum Scale V4.2:  Upgrade to IBM Spectrum Scale V4.2.3.8 or later

IBM Spectrum Scale V5.0.0.2  is available at:  https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Spectrum+Scale&release=5.0.0&platform=All&function=all

IBM Spectrum Scale V4.2.3.8  is available at: https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Spectrum+Scale&release=4.2.3&platform=All&function=all

Elastic Storage Server:

Upgrade to ESS 5.3.0.1, 5.2.2.1 or later.

ESS 5.3.0.1 is available at:
https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=5.3.0&platform=All&function=all

ESS 5.2.2.1 is available at: https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=5.2.0&platform=All&function=all

Workarounds:

Until you are able to upgrade to either IBM Spectrum Scale V5.0.0.2 or V4.2.3.8 or later, or to ESS 5.3.0.1 or later, move the quorum designation from the quorum node under high load to another node, if there are enough spare nodes available, by using the mmchnode command (--nonquorum and --quorum options).

Declaring the quorum node under high load as a non-quorum node should only be done if this results in an odd number of quorum nodes, e.g. from 7 to 5 or 5 to 3.

Please follow the recommendation for quorum node selection provided here:
https://ibm.biz/BdZ3VX
https://ibm.biz/BdZ3VH

Please follow the recommendation, when adding Spectrum Scale nodes to ESS provided here:
https://ibm.biz/BdZ3Vr

Note: It is suggested that you determine and remediate the issues as to why the node is stressed with local disk/CPU/RAM.

If you are unable to apply the latest level of service, contact IBM Service for an efix:
- For IBM Spectrum Scale V 5.0, reference APAR IJ04129
- For IBM Spectrum Scale or V4.2., reference APAR IJ04663

To contact IBM Service, see http://www.ibm.com/planetwide/  

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"4.2.0;4.2.1;4.2.2;4.2.3;5.0.0","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STHMCM","label":"IBM Elastic Storage Server"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"3.5;4.0;4.5;4.6;5.0;5.1;5.2;5.3","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
26 September 2022

UID

ibm10713707