When the data center enters a failure scenario, consider
overriding quorum so that the container server life cycle events are
not ignored based on your analysis of the failure scenario. You can
use the xscmd utility to query about and run quorum
tasks, such as the quorum status and overriding quorum.
About this task
Override quorum in a data center failure scenario only. When
you override quorum, any surviving catalog server instance can be
used. All survivors are notified when one is told to override quorum.To
resolve these communication issues, you search the logs for your catalog
servers for certain messages to determine and fix the problem. You
also must look at your environment to determine if any existing issues
need to be resolved. To find the message in the logs, you can either
search the SystemOut*.log files for each catalog
server, or you can use the message center to filter the logs for all
the connected catalog servers. For more information
about searching the logs with the message center, see Viewing health event notifications in the message center.
Procedure
- Query quorum status with the xscmd utility.
xscmd -c showQuorumStatus -cep cathost:2809
Use
this option to display the quorum status of a catalog service instance.
You can optionally use the -to or --timeout option
on your command to reduce the timeout value to avoid waiting for operating
system or other network timeouts during a network brown
out or system loss. The default timeout value is 30 seconds.
The
command displays the current quorum status of each catalog server.
The quorum column displays one of the following outcomes:
- TRUE: The server has quorum enabled
and the system is working normally. Quorum is met.
- FALSE: The server has quorum enabled,
but quorum is lost. The catalog servers do not allow changes to the
catalog service domain.
- UNAVAILABLE: The server cannot be
contacted. It is either not running or there is a network problem
and the server cannot be reached.
- DISABLED: The server does not have
quorum enabled.
- Remove catalog servers that are having
heartbeating failures.
- Examine the log files for each catalog server. If you
see the following message in multiple catalog server logs, multiple
catalog servers have declared themselves as the primary catalog server.
CWOBJ8106I: The master catalog service cluster activated with cluster {0}
- Declare one primary catalog server by manually ending
the processes for the other catalog servers.
Manually stop the processes that are associated with the primary
catalog servers that you have chosen not to use. End the processes
with the command that is appropriate for your operating system, with
a command such as the kill command.
- On the servers where you stopped catalog server processes,
resolve any garbage collection, operating system, hardware, or networking
issues. Garbage collection information is in the file where the
JVM stores the garbage collection information.
- Remove catalog servers that are having
heartbeating failures.
- Examine the log files for each catalog server.
To find the message in the logs, you can either search the
SystemOut*.log files for each catalog server,
or you can use the message center to filter the logs for all the connected
catalog servers.
For more information about
searching the logs with the message center, see Viewing health event notifications in the message center. The following messages
indicate that multiple catalog servers have declared themselves as
the primary catalog server:
CWOBJ1506E: More than one primary replication group member exists in this group ({1}). Only one primary can be active. ({0}}
CWOBJ0002W: ObjectGrid component, {1}, is ignoring an unexpected exception: Multiple primaries. Network partition!
CWOBJ8106I: The master catalog service cluster activated with cluster {0}
The
following messages indicate that the processor is overloaded, and
as a result heartbeating might be stopped or the JVM might be hung:
HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 118 seconds.
DCSV0004W: DCS Stack POC_Core_Group1 at Member POCGRID1SDCCell\POCGRID1_Cell1_SDC32Managed2\POCGRID1SDC32_cell1_Container_Server1:
Did not receive adequate CPU time slice. Last known CPU usage time at 08:24:40:748 PST. Inactivity duration was 138 seconds.
- Declare one primary catalog server by manually ending
the processes for the other catalog servers.
To choose a primary catalog server, look for
the CWOBJ1506E messages. If a catalog server exists
in your configuration that has not logged this message, choose that
catalog server as the primary. If all the catalog servers logged
this message, look at the time stamps and choose the server that
logged the message most recently as the primary. Manually stop the
processes that are associated with the primary catalog servers that
you have chosen not to use. End the processes with the command that
is appropriate for your operating system, with a command such as the kill command.
- On the servers where you stopped catalog server processes,
resolve any garbage collection, operating system, hardware, or networking
issues. Garbage collection information is in the file where the JVM
stores the garbage collection information.
- Remove container servers that are
having communication problems.
- Search for the following messages in the log files on
the primary catalog server:
These messages indicate which
container servers are having communication issues.
- CWOBJ7211I: As a result of a heartbeat (view heartbeat
type) from leader {0} for core group {1} with member list {2}, the
server {3} is being removed from the core group view.
- CWOBJ7213W: The core group {0} received a heart beat notification
from the server {1} with revision {2} and a current view listing
{3} and previous listing {4} - such a combination indicates a partitioned
core group.
- CWOBJ7214I: While processing a container heart beat for
the core group {0}, a difference between the defined set and view
was detected. However, since the previous and current views are the
same, {1}, this difference can be ignored.
- CWOBJ7205I: Server, {0}, sent a membership change notice
that is rejected because this member was removed from the core group.
The following messages indicate that the processor is overloaded,
and as a result heartbeating might be stopped or the JVM might be
frozen:
HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 118 seconds.
DCSV0004W: DCS Stack POC_Core_Group1 at Member POCGRID1SDCCell\POCGRID1_Cell1_SDC32Managed2\POCGRID1SDC32_cell1_Container_Server1: Did not receive adequate CPU time slice. Last known CPU usage time at 08:24:40:748 PST. Inactivity duration was 138 seconds.
- End processes for problematic container servers.
End the processes with the command that is appropriate for your
operating system, with a command such as the kill command.
- On the servers where you stopped container server processes,
resolve any garbage collection, operating system, hardware, or networking
issues. Garbage collection information is in the file where the JVM
stores the garbage collection information.
- Continue to monitor the catalog server log files for
the messages. If you are no longer seeing these messages, you have
successfully ended all of the container servers with heartbeating
problems.
Scan the primary catalog server logs for a few
iterations of the heartbeat interval. For example, if your heartbeat
interval is 30 seconds, wait 90 seconds. Determine if the CWOBJ72* log
messages have stopped.
- Override quorum with the xscmd utility.
xscmd -c overrideQuorum -cep hostname:port
Running this command forces the surviving
catalog servers to re-establish a quorum.
- Run the xscmd -c triggerPlacement command.
Running this command initiates failure recovery so that the
remaining system can service requests.
- Validate that recovery was successful.
- Run the xscmd
-c showPrimaryCatalogServer command to verify that a single
catalog server reports as the primary catalog server.
- Run the xscmd -c showPlacement command
every 15 seconds for a minute. Confirm placement is stable and that
no changes are occurring.
- Run the xscmd -c routetable command.
This command displays the current route table by simulating a new
client connection to the data grid. It also validates the route table
by confirming that all container servers are recognizing their role
in the route table, such as which type of shard for which partition.
xscmd -c routetable -cep hostname:port -g myGrid
- Run the xscmd -c showMapSizes command
to track that data is flowing to the data grid as expected. Verify
that key distribution is uniform over the shards in the key. If some
container servers have more keys than others, then it is likely the
hash function on the key objects has a poor distribution.
xscmd -c showMapSizes -cep hostname:port -g myGrid -ms myMapSet
- Run the xscmd -c revisions command.
If any revisions come back in the list, the primary and replica
pairs are not completely replicated. Depending on your load, incomplete
replication is okay. However, if you see a difference between the
revision numbers of a primary and replica growing over time, then
replication might not be working or is struggling to keep up with
your load. Run this command multiple times to watch for trends.
- Run the xscmd -c listCoreGroups command
to display a list of all the core groups for the catalog server.
xscmd -c listCoreGroups -cep hostname:port