Managing data center failures when quorum is enabled

When the data center enters a failure scenario, consider overriding quorum so that the container server life cycle events are not ignored based on your analysis of the failure scenario. You can use the xscmd utility to query about and run quorum tasks, such as the quorum status and overriding quorum.

Before you begin

  • Configure the quorum mechanism to be the same setting in all of your catalog servers. See Configuring the quorum mechanism for more information.
  • Quorum is the minimum number of catalog servers that are necessary to conduct placement operations for the data grid and is the full set of catalog servers, unless you configure a lower number. WebSphere® eXtreme Scale expects to lose quorum for the following reasons:
    • Catalog service JVM member fails
    • Network brown out
    • Data center loss
    • Garbage collection
    • Disk I/O
    • Severe swapping

    The following message indicates that quorum has been lost. Look for this message in your catalog service logs.

    CWOBJ1254W: The catalog service is waiting for quorum.
  • [Version 8.6.0.6 and later]With the move from the high availability (HA) manager to the failture detective system that is available through XIO, you get more details about whether a failure is a blackout or a brownout. In particular, if a socket connection is lost, then a process is failing, and blackout will occur and noted in the log file with a CWOBJ7217 message for the primary catalog server. Clients not being able to communicate to containers without the catalogs losing socket connections to the container, or containers not checking in with the catalog cluster when the catalog has not lost a socket connection, are more likely brownouts. If the catalog ultimately tries to connect to the container directly, and that fails, then a CWOBJ7218 message is displayed in logs for the catalog server. Also, the catalogs, even with XIO, still leverage HA manager. Therefore, the catalog core groups are of relevance. However, the container core groups are ignored.

About this task

Override quorum in a data center failure scenario only. When you override quorum, any surviving catalog server instance can be used. All survivors are notified when one is told to override quorum.

To resolve these communication issues, you search the logs for your catalog servers for certain messages to determine and fix the problem. You also must look at your environment to determine if any existing issues need to be resolved. To find the message in the logs, you can either search the SystemOut*.log files for each catalog server, or you can use the message center to filter the logs for all the connected catalog servers. [Version 8.6 and later]For more information about searching the logs with the message center, see Viewing health event notifications in the message center.

Procedure

  1. Query quorum status with the xscmd utility.
    xscmd -c showQuorumStatus -cep cathost:2809
    Use this option to display the quorum status of a catalog service instance.

    [Version 8.6 and later]You can optionally use the -to or --timeout option on your command to reduce the timeout value to avoid waiting for operating system or other network timeouts during a network brown out or system loss. The default timeout value is 30 seconds.

    The command displays the current quorum status of each catalog server. The quorum column displays one of the following outcomes:
    • TRUE: The server has quorum enabled and the system is working normally. Quorum is met.
    • FALSE: The server has quorum enabled, but quorum is lost. The catalog servers do not allow changes to the catalog service domain.
    • UNAVAILABLE: The server cannot be contacted. It is either not running or there is a network problem and the server cannot be reached.
    • DISABLED: The server does not have quorum enabled.
  2. Remove catalog servers that are having heartbeating failures.
    1. Examine the log files for each catalog server. If you see the following message in multiple catalog server logs, multiple catalog servers have declared themselves as the primary catalog server.
      CWOBJ8106I: The master catalog service cluster activated with cluster {0}
    2. Declare one primary catalog server by manually ending the processes for the other catalog servers.

      Manually stop the processes that are associated with the primary catalog servers that you have chosen not to use. End the processes with the command that is appropriate for your operating system, with a command such as the kill command.

    3. On the servers where you stopped catalog server processes, resolve any garbage collection, operating system, hardware, or networking issues. Garbage collection information is in the file where the JVM stores the garbage collection information.
  3. Remove catalog servers that are having heartbeating failures.
    1. Examine the log files for each catalog server.
      To find the message in the logs, you can either search the SystemOut*.log files for each catalog server, or you can use the message center to filter the logs for all the connected catalog servers. [Version 8.6 and later]For more information about searching the logs with the message center, see Viewing health event notifications in the message center. The following messages indicate that multiple catalog servers have declared themselves as the primary catalog server:
      CWOBJ1506E: More than one primary replication group member exists in this group ({1}). Only one primary can be active. ({0}}
      CWOBJ0002W: ObjectGrid component, {1}, is ignoring an unexpected exception: Multiple primaries.  Network partition!
      CWOBJ8106I: The master catalog service cluster activated with cluster {0}
      The following messages indicate that the processor is overloaded, and as a result heartbeating might be stopped or the JVM might be hung:
      HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 118 seconds.
      DCSV0004W: DCS Stack POC_Core_Group1 at Member POCGRID1SDCCell\POCGRID1_Cell1_SDC32Managed2\POCGRID1SDC32_cell1_Container_Server1: 
      Did not receive adequate CPU time slice. Last known CPU usage time at 08:24:40:748 PST. Inactivity duration was 138 seconds. 
    2. Declare one primary catalog server by manually ending the processes for the other catalog servers.

      To choose a primary catalog server, look for the CWOBJ1506E messages. If a catalog server exists in your configuration that has not logged this message, choose that catalog server as the primary. If all the catalog servers logged this message, look at the time stamps and choose the server that logged the message most recently as the primary. Manually stop the processes that are associated with the primary catalog servers that you have chosen not to use. End the processes with the command that is appropriate for your operating system, with a command such as the kill command.

    3. On the servers where you stopped catalog server processes, resolve any garbage collection, operating system, hardware, or networking issues. Garbage collection information is in the file where the JVM stores the garbage collection information.
  4. Remove container servers that are having communication problems.
    1. Search for the following messages in the log files on the primary catalog server:
      These messages indicate which container servers are having communication issues.
      • CWOBJ7211I: As a result of a heartbeat (view heartbeat type) from leader {0} for core group {1} with member list {2}, the server {3} is being removed from the core group view.
      • CWOBJ7213W: The core group {0} received a heart beat notification from the server {1} with revision {2} and a current view listing {3} and previous listing {4} - such a combination indicates a partitioned core group.
      • CWOBJ7214I: While processing a container heart beat for the core group {0}, a difference between the defined set and view was detected. However, since the previous and current views are the same, {1}, this difference can be ignored.
      • CWOBJ7205I: Server, {0}, sent a membership change notice that is rejected because this member was removed from the core group.
      The following messages indicate that the processor is overloaded, and as a result heartbeating might be stopped or the JVM might be frozen:
      HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 118 seconds.
      DCSV0004W: DCS Stack POC_Core_Group1 at Member POCGRID1SDCCell\POCGRID1_Cell1_SDC32Managed2\POCGRID1SDC32_cell1_Container_Server1: Did not receive adequate CPU time slice. Last known CPU usage time at 08:24:40:748 PST. Inactivity duration was 138 seconds.
       
    2. End processes for problematic container servers.
      End the processes with the command that is appropriate for your operating system, with a command such as the kill command.
    3. On the servers where you stopped container server processes, resolve any garbage collection, operating system, hardware, or networking issues. Garbage collection information is in the file where the JVM stores the garbage collection information.
    4. Continue to monitor the catalog server log files for the messages. If you are no longer seeing these messages, you have successfully ended all of the container servers with heartbeating problems.
      Scan the primary catalog server logs for a few iterations of the heartbeat interval. For example, if your heartbeat interval is 30 seconds, wait 90 seconds. Determine if the CWOBJ72* log messages have stopped.
  5. Override quorum with the xscmd utility.
    xscmd -c overrideQuorum -cep hostname:port
    Running this command forces the surviving catalog servers to re-establish a quorum.
  6. Run the xscmd -c triggerPlacement command.
    Running this command initiates failure recovery so that the remaining system can service requests.
  7. Validate that recovery was successful.
    1. Run the xscmd -c showPrimaryCatalogServer command to verify that a single catalog server reports as the primary catalog server.
    2. Run the xscmd -c showPlacement command every 15 seconds for a minute. Confirm placement is stable and that no changes are occurring.
    3. Run the xscmd -c routetable command. This command displays the current route table by simulating a new client connection to the data grid. It also validates the route table by confirming that all container servers are recognizing their role in the route table, such as which type of shard for which partition.
      xscmd -c routetable -cep hostname:port -g myGrid 
    4. Run the xscmd -c showMapSizes command to track that data is flowing to the data grid as expected. Verify that key distribution is uniform over the shards in the key. If some container servers have more keys than others, then it is likely the hash function on the key objects has a poor distribution.
      xscmd -c showMapSizes -cep hostname:port -g myGrid -ms myMapSet
    5. Run the xscmd -c revisions command.
      If any revisions come back in the list, the primary and replica pairs are not completely replicated. Depending on your load, incomplete replication is okay. However, if you see a difference between the revision numbers of a primary and replica growing over time, then replication might not be working or is struggling to keep up with your load. Run this command multiple times to watch for trends.
    6. Run the xscmd -c listCoreGroups command to display a list of all the core groups for the catalog server.
      xscmd -c listCoreGroups -cep hostname:port