You can configure the amount of time between system checks for failed servers with the
heartbeat interval setting. This setting applies to catalog servers only.
About this task
Configuring failover depends on the type of environment you are using. If you are using a
stand-alone environment, you can configure failover with the command line. If you are using a WebSphere® Application Server Network
Deployment environment, you must configure failover in the WebSphere Application Server Network
Deployment administrative console.With the failover management
available with XIO, the high availability (HA) manager is still running, and core groups exist; but
they are ignored. You can disable the high availability (HA) manager and Distribution and
Consistency Services (DCS) on containers when you use
WebSphere Application Server.
However, the HA manager and DCS must remain enabled on the catalog servers.
Tip: Make
sure that your catalog and container servers are not in the same core group if you do this.
Otherwise, the catalog that are still running HA manager and DCS, and the failover detection will
mistakenly assume that the containers are down.
With this model,
the first layer of failure detection is to identify lost XIO socket connections. No tuning is
required here. When the socket connection is lost, failure recovery for the container server
commences. Now, when clients have difficulty communicating with containers, aside from asking for
the new routes, they tell the catalog which containers did not respond to the clients in time. The
client-side XIO request timeout, XIO connection timeout, and
WebSphere eXtreme Scale request retry timeout all dictate when the client times
out, and tells the catalog server that a problem occurred. Also, containers periodically check with
any of the catalog servers, where replica catalogs report container checks to the primary catalog.
Whether clients report issues, or whether containers do not check with catalog servers, the catalog
server will submit a call to the container to get a response. If no response occurs, then failure
recovery for the container server commences. How long the catalog gives the container server to
respond is determined by the XIO request timeout and XIO connection timeout.
Table 1. Layers of failure detection
Detection Layer |
Description |
Socket connection |
No tuning required. When the socket connection is lost, failure recovery for the container
server commences. |
Client report |
Request, connection, and request retry timeout values determine when the client times out and
tells the catalog server that a problem occurred. |
Container check |
Containers periodically check with any of the catalog servers, which submit a call to
containers to get a response. No response means that failure recovery continues. XIO request and
connection timeout values determine how long the catalog gives the container to respond. |
What to do next
When these settings are modified to provide short failover times, there are some system-tuning
issues to be aware of. First, Java™ is not a real-time
environment. It is possible for threads to be delayed if the JVM is experiencing long garbage collection times.
Threads might also be delayed if the machine hosting the JVM is heavily loaded (due to the JVM itself
or other processes running on the machine). If threads are delayed, heartbeats might not be sent on
time. In the worst case, they might be delayed by the required failover time. If threads are
delayed, false failure detections occur. The system must be tuned and sized to ensure that false
failure detections do not happen in production. Adequate load testing is the best way to ensure
this.
Note: The current version of eXtreme Scale
supports WebSphere Real Time.