Managing data center failures when quorum is not enabled
If quorum is not enabled, a situation can occur in which multiple catalog servers try to act as the primary catalog server. This situation occurs because of conditions that prevent the catalog server from knowing that other catalog servers are available, typically in a brownout scenario. If the brownout moves past the heartbeat interval, two primaries start showing up. This condition is referred to as split-brain, in which one cluster of catalog servers cannot communicate with the others.
About this task
Catalog server islanding, or a network partition, can
occur when two catalog servers act as the primary catalog server concurrently.
These catalog servers do not know that the other catalog server exists,
and therefore both servers make changes to the internal data grid
that tracks the state of the servers. The islanded processes are active
and might be attempting to recover. These processes view the other
processes in the system as the ones that have failed. When two catalog
servers are making placement decisions concurrently, data movement
and churn occurs. The data grid becomes unstable and inaccessible.
To resolve this split-brain scenario, you have several options:
- Allow WebSphere® eXtreme Scale to automatically resolve split-brain using JVM re-connection: WebSphere eXtreme Scale now automatically resolves catalog server split-brain that can result from network partitions, severe swapping, disk I/O, or garbage collection. In previous releases of WebSphere eXtreme Scale, when this type of split-brain process communication occurred between catalog servers, you had to use the log files to determine which catalog server to keep as a single primary. You also had to manually end the processes of any other remaining catalog servers that declared themselves as the primary. You no longer need to go through that process. When the two primary catalogs reestablish communication, one of them acknowledges that the other catalog is primary, and decides to minimally shutdown, or even restart, based on the container reconnect settings for that particular catalog JVM. As long as you have set the appropriate re-connection settings, you no longer have to manually end catalog processes. For more information, see Container server reconnect properties.
- With XIO and the failure detection that it offers, stand-alone servers and the Liberty, can complete container reconnect without having to recycle the actual process.
- You can prevent the network partition scenario from occurring by enabling quorum. When quorum is enabled, a defined group of catalog servers must be available for placement decisions to occur. If the defined group is not available, placement decisions for the data grid are suspended until the quorum is available again. However, enabling quorum effectively disables the automatic recovery that occurs. Messages indicate that the quorum is not available and placement operations are suspended. Then, as an administrator, you must act on the situation to resolve the issues so that placement can occur again. For more information about enabling quorum, see Configuring the quorum mechanism. If you do not have quorum enabled, a network partition can cause incompatible changes to be made to the state of the data grid. As a result, a restart of all of the container and catalog servers is required to fully recover from the network partition.
- Although it is not recommended, you can disable automatic recovery by disabling container reconnect settings on the catalogs and containers. If you decide that you want to manually recover the catalog when a split-brain condition occurs and quorum is not enabled, then use the procedure below. The following steps outline how you can determine which catalog servers declared themselves as the primary, and how to use the logs to determine which catalog server to keep as a single primary. The steps also outline how to end the processes of any other remaining catalog servers that declared themselves as the primary.