Troubleshooting high availability

Use this information to troubleshoot high availability.

Procedure

  • Problem: Multiple catalog servers try to act as the primary catalog server.

    Cause: When quorum is not enabled and a brownout occurs, multiple catalog servers might declare themselves as the primary catalog server. As a result, a split brain condition occurs in the environment.

    Diagnosis: The following messages indicate that multiple catalog servers have declared themselves as the primary catalog server:
    CWOBJ1506E: More than one primary replication group member exists in this group ({1}). Only one primary can be active. ({0}}
    CWOBJ0002W: ObjectGrid component, {1}, is ignoring an unexpected exception: Multiple primaries.  Network partition!
    CWOBJ8106I: The master catalog service cluster activated with cluster {0}

    Solution: When this process to process communication is prevented, determine which catalog servers declared themselves as the primary, and use the logs to determine which catalog server to keep as a single primary. End the processes of any other remaining catalog servers that declared themselves as the primary. For more information, see Managing data center failures when quorum is not enabled.

  • Problem: Quorum is lost among the catalog servers.

    Cause: When the data center enters a failure scenario, consider overriding quorum so that container server life cycle events are not ignored based on your analysis of the failure scenario.

    Diagnosis: The following message indicates that quorum is lost:
    CWOBJ1254W: The catalog service is waiting for quorum.

    Solution: Temporarily override quorum. Determine which catalog servers are having communication issues, and remove these servers from the configuration. Then, you can reenable quorum on the remaining catalog servers. For more information, see Managing data center failures when quorum is enabled.

  • Problem: Data loss might occur during a container server administrative shutdown.

    Cause: During container server shutdown, primary shards are moved off the container server that is stopping by promoting replica shards to primary shards. At the same time, replica shards are moved off the container server that is stopping by creating replica shards on a running container server. If after 1 minute the shards are not successfully moved off the stopping container server, the container server is shut down. The shards that remain on the container server that was shut down are lost.

    Diagnosis: The following messages in the systemout.log file indicate the potential data loss:
    CWOBJ1122W : A timeout occurred while waiting for shards to be moved off server. The following shards
          are remaining: {0}
    CWOBJ1129W : Some of the shards were not removed before the container terminate completed on {0}
          container. Shards left: {1}

    Solution: The data in the data grid is lost when you see these messages. You must reload the data into the data grid.