Managing failover outage events
Typically, a failover results from a node outage, but there are other reasons that can also generate a failover. Different system or user actions can potentially cause failover situations.
It is possible for a problem to affect only a single cluster resource group (CRG) that can cause a failover for that CRG but not for any other CRG.
Four categories of outages can occur within a cluster. Some of these events are true failover situations where the node is experiencing an outage, while others require investigation to determine the cause and the appropriate response. The following tables describe each of these categories of outages, the types of outage events that fall into that category and the appropriate recovery action you should take to recover.
Category 1 outages: Node outage causing failover
- For each CRG, the primary node is marked inactive and made the last backup node.
- The node that was the first backup becomes the new primary node.
- All device CRGs
- All data CRGs
- All application CRGs
- If a failover for any CRG detects that none of the backup nodes are active, the status of the CRG is set to indoubt and the CRG recovery domain does not change.
- If all of cluster resource services fails, then the resources (CRGs) that are managed by cluster resource services go through the failover process.
Failover outage event |
---|
ENDTCP(*IMMED or *CNTRLD with a time limit) is issued. |
ENDSYS (*IMMED or *CNTRLD) is issued. |
PWRDWNSYS(*IMMED or *CNTRLD) is issued. |
Initial program load (IPL) button is pressed while cluster resource services is active on the system. |
End Cluster Node (API or command) is called on the primary node in the CRG recovery domain. |
Remove Cluster Node (API or command) is called on the primary node in the CRG recovery domain. |
HMC delayed power down of the partition or panel option 7 is issued. |
ENDSBS QSYSWRK(*IMMED or *CNTRLD) is issued. |
Category 2 outages: Node outage causing partition or failover
- The status of the nodes not communicating by cluster messaging is set to a Partition status. See Cluster partition for information about partitions.
- All nodes in the cluster partition that do not have the primary node as a member of the partition will end the active cluster resource group.
- If a node really failed but is detected only as a partition problem and the failed node was the primary node, you lose all the data and application services on that node and no automatic failover is started.
- You must either declare the node as failed or bring the node back up and start clustering on that node again. See Change partitioned nodes to failed for more information.
Failover outage event | No advanced node failure detection | HMC | VIOS |
---|---|---|---|
CEC hardware outage (CPU, for example) occurs. | partition | failover | partition or failover |
Operating system software machine check occurs. | partition | failover | failover |
HMC immediate power off or panel option 8 is issued. | partition | failover | failover |
HMC partition restart or panel option 3 is issued. | partition | failover | failover |
Power loss to the CEC occurs. | partition | partition | partition |
Category 3 outages: CRG fault causing failover
For a system containing VIOS, a CEC hardware failure could result in either failover or partition. Which occurs depends upon the type of system and the hardware failure. For example in a blade system, a CEC failure that prevents VIOS from running results in a partition since VIOS is unable to report any failure. In the same system in which a single blade fails but VIOS continues to run, failover results since VIOS is able to report the failure.
- If only a single CRG is affected, failover occurs on an individual CRG basis. This is because CRGs are independent of each other.
- If someone cancels several cluster resource jobs, so that several CRGs are affected at the same time, no coordinated failover between CRGs is performed.
- The primary node is marked as Inactive in each CRG and made the last backup node.
- The node that was the first backup node becomes the new primary node.
- If there is no active backup node, the status of the CRG is set to Indoubt and the recovery domain remains unchanged.
Failover outage event |
---|
The CRG job has a software error that causes it to end abnormally. |
Application exit program failure for an application CRG. |
Category 4 outages: Communication outage causing partition
- The status of the nodes not communicating by cluster messaging are set to Partition status. See Cluster partition for information about partitions.
- All nodes and cluster resource services on the nodes are still operational, but not all nodes can communicate with each other.
- The cluster is partitioned, but each CRG's primary node is still providing services.
Failover outage event |
---|
Communications adapter, line, or router failure on cluster heartbeat IP address lines occurs. |
ENDTCPIFC is affecting all cluster heartbeat IP addresses on a cluster node. |
Outages with active CRGs
- If the CRG is Active and the failing node is not the primary
node, the following results:
- The failover updates the status of the failed recovery domain member in the CRG's recovery domain.
- If the failing node is a backup node, the list of backup nodes is reordered so that active nodes are at the beginning of the list.
- If the CRG is Active and the recovery domain member is the primary
node, the actions the system performs depend on which type of outage
has occurred.
- Category 1 outages: Node outage causing failover
- Category 2 outages: Node outage causing partition
- Category 3 outages: CRG fault causing failover
- Category 4 outages: Communication outage causing partition
Outages with inactive CRGs
- The membership status of the failed node in the cluster resource group's recovery domain is changed to either Inactive or Partition status.
- The node roles are not changed, and the backup nodes are not reordered automatically.
- The backup nodes are reordered in an Inactive CRG when the Start
Cluster Resource Group (STRCRG) command or the Start
Cluster Resource Group (QcstStartClusterResourceGroup) API is
called. Note: The Start Cluster Resource Group API will fail if the primary node is not active. You must issue the Change Cluster Resource Group (CHGCRG) command or the Change Cluster Resource Group (QcstChangeClusterResourceGroup) API to designate an active node as the primary node, and then call the Start Cluster Resource Group API again.