Managing data center failures when quorum is not enabled

If quorum is not enabled, a situation can occur in which multiple catalog servers try to act as the primary catalog server. This situation occurs because of conditions that prevent the catalog server from knowing that other catalog servers are available, typically in a brownout scenario. If the brownout moves past the heartbeat interval, two primaries start showing up. This condition is referred to as split-brain, in which one cluster of catalog servers cannot communicate with the others.

About this task

Catalog server islanding, or a network partition, can occur when two catalog servers act as the primary catalog server concurrently. These catalog servers do not know that the other catalog server exists, and therefore both servers make changes to the internal data grid that tracks the state of the servers. The islanded processes are active and might be attempting to recover. These processes view the other processes in the system as the ones that have failed. When two catalog servers are making placement decisions concurrently, data movement and churn occurs. The data grid becomes unstable and inaccessible. To resolve this split-brain scenario, you have several options:
  • [Version 8.6.0.5 and later]Allow WebSphere® eXtreme Scale to automatically resolve split-brain using JVM re-connection: WebSphere eXtreme Scale now automatically resolves catalog server split-brain that can result from network partitions, severe swapping, disk I/O, or garbage collection. In previous releases of WebSphere eXtreme Scale, when this type of split-brain process communication occurred between catalog servers, you had to use the log files to determine which catalog server to keep as a single primary. You also had to manually end the processes of any other remaining catalog servers that declared themselves as the primary. You no longer need to go through that process. When the two primary catalogs reestablish communication, one of them acknowledges that the other catalog is primary, and decides to minimally shutdown, or even restart, based on the container reconnect settings for that particular catalog JVM. As long as you have set the appropriate re-connection settings, you no longer have to manually end catalog processes. For more information, see Container server reconnect properties.
  • [Version 8.6.0.6 and later]With XIO and the failure detection that it offers, stand-alone servers and the Liberty, can complete container reconnect without having to recycle the actual process.
  • You can prevent the network partition scenario from occurring by enabling quorum. When quorum is enabled, a defined group of catalog servers must be available for placement decisions to occur. If the defined group is not available, placement decisions for the data grid are suspended until the quorum is available again. However, enabling quorum effectively disables the automatic recovery that occurs. Messages indicate that the quorum is not available and placement operations are suspended. Then, as an administrator, you must act on the situation to resolve the issues so that placement can occur again. For more information about enabling quorum, see Configuring the quorum mechanism. If you do not have quorum enabled, a network partition can cause incompatible changes to be made to the state of the data grid. As a result, a restart of all of the container and catalog servers is required to fully recover from the network partition.
  • Although it is not recommended, you can disable automatic recovery by disabling container reconnect settings on the catalogs and containers. If you decide that you want to manually recover the catalog when a split-brain condition occurs and quorum is not enabled, then use the procedure below. The following steps outline how you can determine which catalog servers declared themselves as the primary, and how to use the logs to determine which catalog server to keep as a single primary. The steps also outline how to end the processes of any other remaining catalog servers that declared themselves as the primary.

Procedure

  1. Remove catalog servers that are having heartbeating failures.
    1. Examine the log files for each catalog server.
      To find the message in the logs, you can either search the SystemOut*.log files for each catalog server, or you can use the message center to filter the logs for all the connected catalog servers. [Version 8.6 and later]For more information about searching the logs with the message center, see Viewing health event notifications in the message center. The following messages indicate that multiple catalog servers have declared themselves as the primary catalog server:
      CWOBJ1506E: More than one primary replication group member exists in this group ({1}). Only one primary can be active. ({0}}
      CWOBJ0002W: ObjectGrid component, {1}, is ignoring an unexpected exception: Multiple primaries.  Network partition!
      CWOBJ8106I: The master catalog service cluster activated with cluster {0}
      The following messages indicate that the processor is overloaded, and as a result heartbeating might be stopped or the JVM might be hung:
      HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 118 seconds.
      DCSV0004W: DCS Stack POC_Core_Group1 at Member POCGRID1SDCCell\POCGRID1_Cell1_SDC32Managed2\POCGRID1SDC32_cell1_Container_Server1: 
      Did not receive adequate CPU time slice. Last known CPU usage time at 08:24:40:748 PST. Inactivity duration was 138 seconds. 
    2. Declare one primary catalog server by manually ending the processes for the other catalog servers.

      To choose a primary catalog server, look for the CWOBJ1506E messages. If a catalog server exists in your configuration that has not logged this message, choose that catalog server as the primary. If all the catalog servers logged this message, look at the time stamps and choose the server that logged the message most recently as the primary. Manually stop the processes that are associated with the primary catalog servers that you have chosen not to use. End the processes with the command that is appropriate for your operating system, with a command such as the kill command.

    3. On the servers where you stopped catalog server processes, resolve any garbage collection, operating system, hardware, or networking issues. Garbage collection information is in the file where the JVM stores the garbage collection information.
  2. Remove container servers that are having communication problems.
    1. Search for the following messages in the log files on the primary catalog server:
      These messages indicate which container servers are having communication issues.
      • CWOBJ7211I: As a result of a heartbeat (view heartbeat type) from leader {0} for core group {1} with member list {2}, the server {3} is being removed from the core group view.
      • CWOBJ7213W: The core group {0} received a heart beat notification from the server {1} with revision {2} and a current view listing {3} and previous listing {4} - such a combination indicates a partitioned core group.
      • CWOBJ7214I: While processing a container heart beat for the core group {0}, a difference between the defined set and view was detected. However, since the previous and current views are the same, {1}, this difference can be ignored.
      • CWOBJ7205I: Server, {0}, sent a membership change notice that is rejected because this member was removed from the core group.
      The following messages indicate that the processor is overloaded, and as a result heartbeating might be stopped or the JVM might be frozen:
      HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 118 seconds.
      DCSV0004W: DCS Stack POC_Core_Group1 at Member POCGRID1SDCCell\POCGRID1_Cell1_SDC32Managed2\POCGRID1SDC32_cell1_Container_Server1: Did not receive adequate CPU time slice. Last known CPU usage time at 08:24:40:748 PST. Inactivity duration was 138 seconds.
       
    2. End processes for problematic container servers.
      End the processes with the command that is appropriate for your operating system, with a command such as the kill command.
    3. On the servers where you stopped container server processes, resolve any garbage collection, operating system, hardware, or networking issues. Garbage collection information is in the file where the JVM stores the garbage collection information.
    4. Continue to monitor the catalog server log files for the messages. If you are no longer seeing these messages, you have successfully ended all of the container servers with heartbeating problems.
      Scan the primary catalog server logs for a few iterations of the heartbeat interval. For example, if your heartbeat interval is 30 seconds, wait 90 seconds. Determine if the CWOBJ72* log messages have stopped.
  3. Initiate recovery actions.
    1. Suspend heartbeating in the environment.
      Run the xscmd -c suspend -t heartbeat command.
      Attention: You must have a fix that contains APAR PM95826 applied to use this command.
    2. Continue to monitor the logs to verify that the CWOBJ72* messages have stopped that are related to container server stability.
    3. After you have confirmed that the container servers are stable, run the xscmd -c triggerPlacement command.
      Running this command initiates shard placement so that the remaining system can service requests.
    4. Monitor the catalog server logs for indications that placement has begun.
      The CWOBJ7501 messages indicate that placement has started and routes are coming into the catalog server. This process can go for 30 seconds or a minute, depending on the health and size of your system.
      CWOBJ7501I: The following partitions listed by the form
            grid:mapSet:partitionId:gridEpoch:partitionEpoch just had their routing entries update:
            {0}.
      The following messages indicate that work is being successfully routed between the catalog and container server: [Version 8.6 and later]
      CWOBJ7507I: The placement of workId:grid:mapSet:partition {0} has been sent to container
            {1}.
      [Version 8.6 and later]
      CWOBJ7503I: The placement work intended for container {0} for
            workId:grid:mapSet:partition:shardId {1} was acknowledged by the container as successful.
      If you see the CWOBJ7501 logs stop, you can move on to validation actions.
      The following message indicates that an error has occurred with placement of work on the container server:
      CWOBJ7504E: The placement work intended for container {0} for
            workId:grid:mapSet:partition:shardId {1} has encountered a failure {2}.
  4. Validate that placement behavior is back to normal and that recovery was successful.
    1. Run the xscmd -c showPlacement command every 15 seconds for a minute.
      • Confirm that placement is stable and that no changes are occurring.
      • If you see the primary shard for a partition on more than one container server, you must restart all but one of the container servers to purge any stale data. This situation can occur after more than one catalog server declared itself as the primary. To determine which container server to leave running, look at the container server logs for the server that has the following message with the most recent time stamp:
        CWOBJ1511I: {0} ({1}) is open for business.
    2. When you have confirmed that placement is stable, resume heartbeating.
      Run the xscmd -c resume -t heartbeat command.
      Attention: You must have a fix that contains APAR PM95826 applied to use this command.
    3. Run the xscmd -c routetable command. This command displays the current route table by simulating a new client connection to the data grid. It also validates the route table by confirming that all container servers are recognizing their role in the route table, such as which type of shard for which partition.
      xscmd -c routetable -cep hostname:port -g myGrid 
    4. Run the xscmd -c showMapSizes command to track that data is flowing to the data grid as expected. Verify that key distribution is uniform over the shards in the key. If some container servers have more keys than others, then it is likely the hash function on the key objects has a poor distribution.
      xscmd -c showMapSizes -cep hostname:port -g myGrid -ms myMapSet
    5. If you are running with multi-master replication, run the xscmd -c showLinkedPrimaries command to list every primary shard.
      This command displays which remote catalog service domain and remote container server with which the primary shard is connected. If the remote link is functional, the status is displayed as online.
    6. Run the xscmd -c revisions command.
      If any revisions come back in the list, the primary and replica pairs are not completely replicated. Depending on your load, incomplete replication is okay. However, if you see a difference between the revision numbers of a primary and replica growing over time, then replication might not be working or is struggling to keep up with your load. Run this command multiple times to watch for trends.
    7. Run the xscmd -c listCoreGroups command to display a list of all the core groups for the catalog server.
      xscmd -c listCoreGroups -cep hostname:port