Managing failover outage events

Typically, a failover results from a node outage, but there are other reasons that can also generate a failover. Different system or user actions can potentially cause failover situations.

It is possible for a problem to affect only a single cluster resource group (CRG) that can cause a failover for that CRG but not for any other CRG.

Four categories of outages can occur within a cluster. Some of these events are true failover situations where the node is experiencing an outage, while others require investigation to determine the cause and the appropriate response. The following tables describe each of these categories of outages, the types of outage events that fall into that category and the appropriate recovery action you should take to recover.

Category 1 outages: Node outage causing failover

Node-level failover occurs, causing the following to happen:

For each CRG, the primary node is marked inactive and made the last backup node.
The node that was the first backup becomes the new primary node.

Failovers happen in this order:

All device CRGs
All data CRGs
All application CRGs

Notes:

If a failover for any CRG detects that none of the backup nodes are active, the status of the CRG is set to indoubt and the CRG recovery domain does not change.
If all of cluster resource services fails, then the resources (CRGs) that are managed by cluster resource services go through the failover process.

Table 1. Category 1 outages: Node outage causing failover
Failover outage event
ENDTCP(IMMED or CNTRLD with a time limit) is issued.
ENDSYS (IMMED or CNTRLD) is issued.
PWRDWNSYS(IMMED or CNTRLD) is issued.
Initial program load (IPL) button is pressed while cluster resource services is active on the system.
End Cluster Node (API or command) is called on the primary node in the CRG recovery domain.
Remove Cluster Node (API or command) is called on the primary node in the CRG recovery domain.
HMC delayed power down of the partition or panel option 7 is issued.
ENDSBS QSYSWRK(IMMED or CNTRLD) is issued.

Category 2 outages: Node outage causing partition or failover

These outages will cause either a partition or a failover depending on whether advanced node failure detection is configured. Refer to the columns in the table. If advanced node failure detection is configured, failover occurs in most cases and Category 1 outage information applies. If advanced node failure detection is not configured, partition occurs and the following applies:

The status of the nodes not communicating by cluster messaging is set to a Partition status. See Cluster partition for information about partitions.
All nodes in the cluster partition that do not have the primary node as a member of the partition will end the active cluster resource group.

Notes:

If a node really failed but is detected only as a partition problem and the failed node was the primary node, you lose all the data and application services on that node and no automatic failover is started.
You must either declare the node as failed or bring the node back up and start clustering on that node again. See Change partitioned nodes to failed for more information.

Table 2. Category 2 outages: Node outage causing partition
Failover outage event	No advanced node failure detection	HMC	VIOS
CEC hardware outage (CPU, for example) occurs.	partition	failover	partition or failover
Operating system software machine check occurs.	partition	failover	failover
HMC immediate power off or panel option 8 is issued.	partition	failover	failover
HMC partition restart or panel option 3 is issued.	partition	failover	failover
Power loss to the CEC occurs.	partition	partition	partition

Category 3 outages: CRG fault causing failover

Start of change For a system containing VIOS, a CEC hardware failure could result in either failover or partition. Which occurs depends upon the type of system and the hardware failure. For example in a blade system, a CEC failure that prevents VIOS from running results in a partition since VIOS is unable to report any failure. In the same system in which a single blade fails but VIOS continues to run, failover results since VIOS is able to report the failure. End of change

When a CRG fault causes a failover, the following happens:

If only a single CRG is affected, failover occurs on an individual CRG basis. This is because CRGs are independent of each other.
If someone cancels several cluster resource jobs, so that several CRGs are affected at the same time, no coordinated failover between CRGs is performed.
The primary node is marked as Inactive in each CRG and made the last backup node.
The node that was the first backup node becomes the new primary node.
If there is no active backup node, the status of the CRG is set to Indoubt and the recovery domain remains unchanged.

Table 3. Category 3 outages: CRG fault causing failover
Failover outage event
The CRG job has a software error that causes it to end abnormally.
Application exit program failure for an application CRG.

Category 4 outages: Communication outage causing partition

This category is similar to category 2. These events occur:

The status of the nodes not communicating by cluster messaging are set to Partition status. See Cluster partition for information about partitions.
All nodes and cluster resource services on the nodes are still operational, but not all nodes can communicate with each other.
The cluster is partitioned, but each CRG's primary node is still providing services.

The normal recovery for this partition state should be to repair the communication problem that caused the cluster partition. The cluster will resolve the partition state without any additional intervention.

Note: If you want the CRGs to fail over to a new primary node, ensure that the old primary node is not using the resources before the node is marked as failed. See Change partitioned nodes to failed for more information.

Table 4. Category 4 outages: Communication outage causing partition
Failover outage event
Communications adapter, line, or router failure on cluster heartbeat IP address lines occurs.
ENDTCPIFC is affecting all cluster heartbeat IP addresses on a cluster node.

Outages with active CRGs

If the CRG is Active and the failing node is not the primary node, the following results:
- The failover updates the status of the failed recovery domain member in the CRG's recovery domain.
- If the failing node is a backup node, the list of backup nodes is reordered so that active nodes are at the beginning of the list.
If the CRG is Active and the recovery domain member is the primary node, the actions the system performs depend on which type of outage has occurred.
- Category 1 outages: Node outage causing failover
- Category 2 outages: Node outage causing partition
- Category 3 outages: CRG fault causing failover
- Category 4 outages: Communication outage causing partition

Outages with inactive CRGs

When there is an outage with CRGs, the following occur:

The membership status of the failed node in the cluster resource group's recovery domain is changed to either Inactive or Partition status.
The node roles are not changed, and the backup nodes are not reordered automatically.
The backup nodes are reordered in an Inactive CRG when the Start Cluster Resource Group (STRCRG) command or the Start Cluster Resource Group (QcstStartClusterResourceGroup) API is called.
Note: The Start Cluster Resource Group API will fail if the primary node is not active. You must issue the Change Cluster Resource Group (CHGCRG) command or the Change Cluster Resource Group (QcstChangeClusterResourceGroup) API to designate an active node as the primary node, and then call the Start Cluster Resource Group API again.