Catalog server quorums

The catalog service domain is a fixed set of catalog server Java virtual machines (JVM). For the best performance, do not configure catalog service domains to span data centers. When the quorum mechanism is enabled, all the catalog servers in the quorum must be available and communicating with each other for placement operations to occur in the data grid. The catalog service responds to container server lifecycle events while the catalog service has quorum. These lifecycle events include the placement or removal of shards on a container server when the container server stops or starts. When a brownout scenario or other failure occurs, not all members of the quorum are available. So, you must override quorum because placement operations do not occur if the quorum is not available.

Failure classification

Single failure: When the failure of one container server or catalog server occurs in the environment, it is considered to be a single failure event. When a single failure event occurs, recovery can occur without data loss.

Double failure: When two failures of any server processes occur simultaneously, data loss can occur on the second failure. Because of the second failure, applications might lose write access to the data that was stored on the failed container server. To prevent double failures, you can isolate components of the data grid from each other. For more information, see Zones.

Quorum loss

If the catalog service loses quorum, it waits for quorum to be reestablished. While the catalog service does not have quorum, it ignores lifecycle events from catalog and container servers.

WebSphere® eXtreme Scale expects to lose quorum for the following scenarios:

  • A catalog server fails

    A catalog server that fails causes quorum to be lost. If a JVM fails, quorum can be reestablished by either overriding quorum or by restarting the failed catalog server.

  • Brownout occurs

    A brownout is when a temporary loss of connectivity occurs. Brownouts are transient and clear within seconds or minutes. Brownouts can be frequent and repeated depending on the cause. Brownouts can be caused by network partitions, long garbage collection pauses, operating system level swapping, or disk I/O problems. Quorum is the mechanism for reacting to brownouts in the catalog server that are long enough to cause heartbeat failures. While the product tries to maintain normal operation during the brownout period, a brownout is regarded as a single failure event. The failure is expected to be fixed and then normal operation resumes with no actions necessary.

WebSphere eXtreme Scale does not lose quorum when a catalog server is stopped with the stop command or any other administrative actions. The system knows that the server instance stopped, which is different from a JVM failure or brownout. The quorum drops to one less server, preserving quorum. The remaining servers still have quorum. Restarting the catalog server sets quorum back to the previous number.

Client behavior during quorum loss

If the client can connect to a catalog server, the client can bootstrap to the data grid whether the catalog service domain has quorum or not. The client tries to connect to any catalog server instance to obtain a route table and then interact with the data grid. If no container failures or connectivity issues happen during the quorum loss event, then clients can still fully interact with the container servers.

Recovery after quorum is reestablished

If quorum is lost for any reason, when quorum is reestablished, a recovery protocol is run. When the quorum loss event occurs, all heartbeating for core groups is suspended and failure reports are also ignored. After quorum is back, any container server failures that occurred while quorum was lost are processed. Any shards that were hosted on container servers that were reported as failed are recovered. If primary shards were lost, then surviving replica shards become primary shards. If replica shards were lost, more replica shards are created.

Scenarios for overriding quorum

Quorum loss due to a catalog server failure or a network brownout recovers automatically after the catalog server is restarted or the network brownout ends. When intermittent failures are occurring, such as network instability, you must remove the problematic catalog servers by manually ending the catalog server processes. Then, you can override quorum.

When you override quorum, the catalog service assumes that quorum is achieved with the current membership. Container server lifecycle events are processed. When you run an override quorum command, you are informing the catalog service domain that the failed catalog servers do not have a chance of recovering.

The following list considers some scenarios for overriding quorum. In the configuration, you have three catalog servers: A, B, and C.
  • Brownout: Brownout scenarios occur and are resolved fairly quickly. The C catalog server is isolated temporarily. The catalog service loses quorum and waits for the brownout to complete. After the brownout is over, the C catalog server rejoins the catalog service domain and quorum is reestablished. Your application sees no problems during this time. You do not need to take any administrative actions.
  • JVM process failure: The JVM for the C catalog server fails and the catalog service loses quorum. You can override quorum immediately, which restarts the processing of container server lifecycle events. Then, diagnose why the C catalog server failed and resolve any issues. When you are sure that the problem is resolved, you can restart the C catalog server. The C catalog server joins the catalog service domain again when it restarts. Your application sees no problems during this time.
  • Problematic or repeated brownouts: In this scenario, the A and B catalog servers are on one side of the network partition, while the C catalog server is on the other. You must be careful about when you override quorum in this scenario. You do not want to override quorum just as the brownout temporarily heals, and then have the brownout occur again. If this scenario were to occur, both sides of the network partition could become primary, causing a split brain condition.
  • Multiple failures: During a failure scenario, catalog server C and one or more container servers are lost. Ensure that the failing servers are stopped. Then, override quorum. The surviving catalog servers use the remaining container servers to run a full recovery by replacing shards that were hosted in the failed container servers. The catalog service is now running with a full quorum of the A and B catalog servers. The application might see delays or exceptions during the interval between the start of the blackout and when quorum is overridden. After quorum is overridden, the data grid recovers and normal operation is resumed. If multiple containers were lost that included primary and all replica shards for particular partitions, data loss for those partitions occurs.

Majority quorum

For added flexibility to the standard quorum support in WebSphere eXtreme Scale, a new quorum type is available called majority quorum. In this quorum type, WebSphere eXtreme Scale does not leave quorum if there are greater than half the catalog servers still running. For example, if there are three catalog servers and one of them cannot communicate with the other two catalog servers, then the other two catalogs stay in quorum. The other two catalogs allow for placement changes to occur. If the other catalog rejoins the group, WebSphere eXtreme Scale tries to let it join dynamically if possible. Otherwise, the catalog is restarted so that it can properly rejoin the catalog cluster. Majority quorum automatically resolves catalog server failures on the majority side when a brownout event affects the catalog servers. Also, this quorum policy greatly reduces the need to recycle the catalogs that were partitioned when they rejoin. Even if the primary catalog server was partitioned, when it rejoins the cluster, the catalog server is merged back and only one primary remains in the cluster. To enable majority quorum, see Configuring the quorum mechanism. However, if you have four catalog servers and two are isolated, then there is no majority and WebSphere eXtreme Scale leaves quorum. Therefore, a majority quorum policy equates to<number of catalog servers configured in a cluster>/2 +1.

Note: If a brownout event occurs and affects container servers on the non-majority side, then the container servers need to be recycled when the brownout recovers. Also, if there are concerns of frequent and repeated brownouts within your environment, then standard quorum might prove to be a better option than majority quorum. This way, you can investigate and fix the environmental issue, rather than continually moving data around during repeated container error recovery.

Deciding on a quorum policy for your environment

Use the table to help you decide what type of quorum policy would make sense for your environment, or if you should enable quorum at all.

Quorum state How does this policy behave during a brownout? What is the motivation for usage? Associated risks to consider
Standard
  • All configured catalog cluster members need to be able to communicate with each other in order for placement changes to occur as a result of changes in container server lifecycle
  • Fully removes need to recycle catalogs.
Prevent under any circumstance the need to recycle the grid due to state corruption caused by catalogs that become divergent during periods where communication does not work. Requires manual intervention to initiate any container failure recovery. This means failure placement will not occur until quorum is overridden. If a container is lost while quorum is violated, then the shards on that container are not available until quorum is overridden. If a client cannot reach a primary shard due to loss of the shard or network issues, it will not be able to update data. In cases where replicaReadEnabled is set to true, the client will not be able to access the data either, until quorum is overridden or reestablished.
[Version 8.6 and later]Majority [Version 8.6 and later]
  • Greater than 50% of configured catalog cluster members must be able to communicate with each other in order for placement changes to occur.
  • Greatly reduces the need to recycle the catalogs.
[Version 8.6 and later]Compromise, which allows for autonomic container failure recovery when communication is somewhat contained and affects only a subset of the catalog cluster. More likely to have access to data on accessible subset of grid during brownouts. [Version 8.6 and later]Manual intervention is still advisable with more catastrophic communication issues (it might be better to pause churn caused by container failure recovery until environment fixed). Some catalogs that are automatically recycled are still conceivable with more catastrophic communication issues.
None (No Quorum)
  • When communication is broken between multiple catalog servers, each catalog can potentially promote itself as the primary. The result is a conflict on where various partitions can reside. The conflicts can also lead to inconsistent copies of customer data.
  • The catalog cluster state can also become corrupted in such scenarios, resulting in the need to recycle the catalogs. However, recycle can happen automatically depending on whether the container reconnect settings are enabled or disabled.
You should not manually intervene and let WebSphere eXtreme Scale recycle or shutdown catalogs automatically. Catastrophic, repeated, insidious communication errors can lead to repeated data movement as a result of constant failure recovery. In worst cases, most of the grid must recycle to achieve consistency when environmental issues are fixed.