Troubleshooting multiple data center configurations

Use this information to troubleshoot multiple data center configurations, including linking between catalog service domains.

Before you begin

You must use the xscmd utility to troubleshoot your multiple data center configurations. For more information, see Administering with the xscmd utility.

Procedure

  • [Version 8.6 and later]Problem: You must determine whether data replication is synchronized across container servers and catalog service domains.

    Solution: Run the xscmd -c showReplicationState or xscmd.sh -c showDomainReplicationState command. These commands display information about the status of replication in the environment. For more information, see Monitoring with the xscmd utility.

  • [Version 8.6 and later]Problem: You must check which catalog service domains are linked to your local catalog service domain.

    Solution: Run the xscmd -c showLinkedDomains command. This command lists the foreign catalog service domains that are linking to the local catalog service domain.

  • [Version 8.6 and later]Problem: You want to detect any configuration problems with your primary shard links to catalog service domains, without going through the entire output of the xscmd -c showLinkedPrimaries command.
    Solution: Use the -hc or the --linkHealthCheck option with this command. For example, xscmd -c showLinkedPrimaries -hc or xscmd -c showLinkedPrimaries --linkHealthCheck. The command verifies that the primary shards have the appropriate number of catalog service domain links. The command lists any primary shards that have the wrong number of links. If they are all linked correctly (for example, your domain is linked to 1 other domain, then all of the individual primary shards are expected to have 1 link), you get a message that indicates they are linked:
    CWXSI0092I: All primary shards for {0} data grid and {1} map set have the correct number of links 
    to foreign primary shards.
    If you discover problems, try some of the following possible solutions:
    • Review your network and firewall settings to ensure that the servers that are hosting container servers in the domains can communicate with each other.
    • Review the SystemOut and FFDC logs for the primary shards with the incorrect links for more specific error messages.
    • Close and re-establish the link between the domains.
  • Problem: Data is missing in one or more catalog service domains. For example, you might run the xscmd -c establishLink command. When you look at the data for each linked catalog service domain, the data looks different, for example from the xscmd -c showMapSizes command.

    Solution: You can troubleshoot this problem with the xscmd -c showLinkedPrimaries command. This command prints each primary shard, and including which foreign primaries are linked.

    In the described scenario, you might discover from running the xscmd -c showLinkedPrimaries command that the first catalog service domain primary shards are linked to the second catalog service domain primary shards, but the second catalog service domain does not have links to the first catalog service domain. You might consider rerunning the xscmd -c establishLink command from the second catalog service domain to the first catalog service domain.

  • Problem: The catalog service domains are not replicating data. The output of the command showMapsizes or showDomainReplicationState do not match between the catalog service domains as expected. The command showLinkedPrimaries shows links in the recovery state instead of the online state.

    Diagnosis: Investigate the multi-master links between the primary shards in the recovery state. The recovery state indicates that WebSphere eXtreme Scale cannot successfully replicate between the primary shards in each catalog service domain. When a primary shard encounters an exception, it goes into an auto-recovery state and sends a ping to the foreign primary shard. If the ping is successful, replication starts again. If the ping fails, the primary shard sleeps and pings again in the future. Each primary shard is responsible for maintaining replication with its foreign primary in the foreign domain. For example, the primary shard for partition 1 in domain 1 replicates directly with the primary shard for partition 1 in domain 2.

    1. Review the output for the command showLinkedPrimaries and locate a shard in recovery state. Example output:
      CWXSI0068I: Executing command: showLinkedPrimaries
      CWXSI0091I: Verifying the primary shards have the correct number of links to foreign primary shards.
      
      *** Displaying results for inventory data grid and aSet map set. Expected number of online links: 1.
      
      *** Listing Primary Shards with the incorrect number of links for local domain: 
      domain1, Container: server0_C-0, Server: server0, 
      Host: myHost.rchland.ibm.com ***
      
      Grid Name Map Set Name Partition Domain  Container      Status   
      --------- ------------ --------- ------  ---------      -------  
      inventory aSet         0         domain2 server20_C-1  recovery
      inventory aSet         1         domain2 server20_C-1  recovery
      
    2. Review the SystemOut or JVM logs and FFDC of a link in recovery state.
      In the showLinkedPrimaries example that is provided, take note of the first entry, that is partition 0, for the grid inventory and map set aSet. The local primary shard for partition 0 runs on server0 and the foreign primary shard for partition 0 runs on server20. To find out more information about the link, locate the SystemOut or JVM log file for server0. Search the file for the inventory grid for partition 0. To aid in the search, the shard identification string is formatted as objectGridName:mapSetName:partitionID in the log. In this case, the shard identification string is inventory:aSet:0. You should search for several messages in the CWOBJ1500-CWOBJ1599 range. The relevant messages for this showLinkedPrimaries example include CWOBJ1511I, CWOBJ1542I, CWOBJ1550W and CWOBJ1551I.

      Example log messages:

      ReplicatedPar I   CWOBJ1511I: inventory:aSet:0 (primary) is open for business. 
      PrimaryShardI I   CWOBJ1542I: Primary inventory:aSet:0 started or continued replicating 
      from foreign primary (domain2:server20_C-1). Replicating for maps: [movie, book] 
      PrimaryShardI W   CWOBJ1550W: 
      The primary (inventory:aSet:0) shard received exceptions while replicating from the primary shard on 
      the domain2:server20_C-1 primary container. 
      The primary shard continues to poll the primary shard. 
      Exception received: org.omg.CORBA.NO_RESPONSE: Request 180 timed out vmcid: IBM minor code:B01 completed: Maybe 	
      at com.ibm.rmi.iiop.Connection.getCallStream(Connection.java:2339) 	
      at com.ibm.rmi.iiop.Connection.send(Connection.java:2266) 	
      at com.ibm.rmi.iiop.ClientRequestImpl.invoke(ClientRequestImpl.java:330) 	
      at com.ibm.rmi.corba.ClientDelegate.invoke(ClientDelegate.java:445) 	
      at com.ibm.CORBA.iiop.ClientDelegate.invoke(ClientDelegate.java:1193) 	
      at com.ibm.rmi.corba.ClientDelegate.invoke(ClientDelegate.java:800) 	
      at com.ibm.CORBA.iiop.ClientDelegate.invoke(ClientDelegate.java:1223) 	
      at org.omg.CORBA.portable.ObjectImpl._invoke(ObjectImpl.java:484) 	
      at com.ibm.ws.objectgrid.partition._IDLPrimaryShardStub.queryRevision(_IDLPrimaryShardStub.java:420) 	
      at com.ibm.ws.objectgrid.partition.IDLPrimaryShardWrapperImpl.queryRevision(IDLPrimaryShardWrapperImpl.java:96) 	
      at com.ibm.ws.objectgrid.replication.PrimaryShardImpl$RevisionQueryHandler.run(PrimaryShardImpl.java:4209) 	
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) 	
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) 	
      at com.ibm.ws.objectgrid.thread.XSThreadPool$Worker.run(XSThreadPool.java:309) 
      CWOBJ1551I: Primary inventory:aSet:0successfully recovered and replicated after several exceptions from the primary on domain2:server20_C-1. 
      [Version 8.6 and later]When a MMR linkis in recovery state, look for CWOBJ1550W messages. These messages contain the exception received during replication. 
      If the primary shard automatically recovers, a CWOBJ1551I message occurs.  
    3. [Version 8.6 and later]Review the SystemOut or JVM logs and FFDC of a link in recovery state on the foreign domain side.
      It is important to review the foreign primary side as well to see whether there are companion messages. If an org.omg.CORBA.NO_RESPONSE orcom.ibm.ws.xsspi.xio.exception.MessageTimeOutException exception occurs, then general network issues, hung threads, database problems, or other exceptions that prevent a timely response to the caller might be the cause of the problem. To review the foreign primary side, return to the showLinkedPrimaries command output and find the server name from the foreign domain. In the provided example, the foreign primary is running on server server20 in domain2. Search on the same shard identification inventory:aSet:0 in the SystemOut or JVM logs and the FFDC. Also, look for CWOBJ7853W messages that indicate hung threads. You should also look for HMGR0152W messages that indicate processor starvation that can prevent the server from operating efficiently. In this example, searching through the FFDC revealed database exceptions.

      Example FFDC:
      key = java.lang.reflect.InvocationTargetException com.ibm.ws.xs.osgi.service.BackingMapServiceHandler.invoke 90 
      Exception = java.lang.reflect.InvocationTargetException 
      Source = com.ibm.ws.xs.osgi.service.BackingMapServiceHandler.invoke probeid = 90 
      Stack Dump = java.lang.reflect.InvocationTargetException 	
      at sun.reflect.GeneratedMethodAccessor67.invoke(Unknown Source) 	
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) 
      at java.lang.reflect.Method.invoke(Method.java:611) 	
      at com.ibm.ws.xs.osgi.service.XSServiceHandler.invoke(XSServiceHandler.java:87) 	
      at com.ibm.ws.xs.osgi.service.BackingMapServiceHandler.invoke(BackingMapServiceHandler.java:74) 	
      at $Proxy39.batchUpdate(Unknown Source) 	
      at com.ibm.ws.objectgrid.map.BaseMap.applyCacheLoader(BaseMap.java:1410) 	
      at com.ibm.ws.objectgrid.ObjectMapImpl$CacheLoaderApplyPrivilegedAction.run(ObjectMapImpl.java:2189) 	
      at java.security.AccessController.doPrivileged(AccessController.java:251) 	
      at com.ibm.ws.objectgrid.ObjectMapImpl.internalFlush(ObjectMapImpl.java:1684) 	
      at com.ibm.ws.objectgrid.SessionImpl.internalFlush(SessionImpl.java:2770) 	
      at com.ibm.ws.objectgrid.SessionImpl.commit(SessionImpl.java:1566) 	
      at com.ibm.ws.objectgrid.ObjectGridImpl.applyRevision(ObjectGridImpl.java:5923) 	
      at com.ibm.ws.objectgrid.replication.PrimaryShardImpl$RevisionQueryHandler.run(PrimaryShardImpl.java:4138) 	
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:897) 	
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919) 	
      at java.lang.Thread.run(Thread.java:736) 
      Caused by: com.ibm.websphere.objectgrid.plugins.LoaderException: ... 
      Caused by: java.sql.BatchUpdateException: ORA-01013: user requested cancel of current operation ...
    [Version 8.6 and later]
    Exceptions Workarounds
    • (ORB) org.omg.CORBA.NO_RESPONSE
    • (XIO) com.ibm.ws.xsspi.xio.exception.MessageTimeOutException
    These messages indicate that the transport layer did not determine whether a successful connection was made. Or, it might indicate that a connection was successful, but a response did not occur within the configured timeout.

    Consider checking for the following issues:

    • Are there any network problems that prevent connections? For example, the network is intermittently down. The firewall blocks ports. The DNS service has intermittent problems. The ORB or XIO port must be open between the two containers that are replicating data in a multi-master environment. The primary shards on the containers servers connect directly to each other.
    • Are there CWOBJ messages that indicate hung threads on the remote container, such as CWOBJ7853W? If the domain uses a database, then search for database-related exceptions on the container servers. For example, com.ibm.websphere.objectgrid.plugins.LoaderException or java.sql.BatchUpdateException. Resolve the database problem.
    • (XIO) com.ibm.ws.xsspi.xio.exception.ConnectionRefusedException
    • (ORB) org.omg.CORBA.TRANSIENT
    • (ORB)org.omg.CORBA.COMM_FAILURE
    These messages indicate the remote server might not be contacted and the JVM process is gone. This exception is normally temporary and the remote primary shard fails over to a new location and the links are updated.
    If the link does not recover, then consider the following steps:
    1. Check to see whether either domain has quorum that is enabled and if the system is out of quorum. Issue the showQuorumStatus command. For more information, see Managing data center failures when quorum is enabled.
      1. If the domain is out of quorum, placement changes are not done.
      2. If the link does not recover and quorum is not the issue, check if the foreign primary is placed in a new location.
    2. Review the showPlacement and routetable command output for the foreign primary shard.
      1. If the foreign primary is not placed or marked as "not reachable" in the routetable output, then run the triggerPlacement command in the foreign domain.
      2. If the foreign primary shard is placed and reachable on a new container server, then run triggerPlacement on the local domain.
    • org.omg.CORBA.OBJECT_NOT_EXIST (ORB)
    • com.ibm.ws.xsspi.xio.exception.ActorNotFoundException (XIO)
    • com.ibm.ws.xsspi.xio.exception.InvalidXIORefException (XIO)
    These messages indicate that the remote server might be contacted, but the foreign primary shard was not found. This exception is normally temporary and the remote primary shard fails over to a new location. The links are also updated.
    If the link does not recover and quorum is not the issue, consider the following steps:
    1. Check to see whether either domain has quorum that is enabled and if the system is out of quorum. Issue the showQuorumStatus command. For more information, see Managing data center failures when quorum is enabled.
      1. If the domain is out of quorum, placement changes are not done.
      2. If the link does not recover and quorum is not the issue, check if the foreign primary is placed in a new location.
    2. Check to see whether the foreign primary is placed in a new location. Review the showPlacement and routetable command output for the foreign primary shard.
      1. If the foreign primary is not placed or marked as "not reachable" in the route table output, issue the triggerPlacement command in the foreign domain.
      2. If the foreign primary shard is placed and reachable on a new container server, run triggerPlacement on the local domain.
  • [Version 8.6.0.6 and later] Problem: The multimaster replication link was dismissed, but the foreign domain or collective could not be contacted. The link is in the DISMISSING_LINK state in the monitoring console, or the link is displayed in the DISMISSING_LINK state when you run the xscmd -c showLinkedDomains -v command. The foreign domain or collective cannot be restarted or contacted to resolve the dismiss link request. The link stays in DISMISSING_LINK state because the local domain tries again to connect to the foreign domain to complete the dismissal request.

    Solution: Run the xscmd -c dismissLink command with the -force option to dismiss the link once with the foreign domain and then clean up the local domain.