Tuning the heartbeat interval setting for failover detection

You can configure the amount of time between system checks for failed servers with the heartbeat interval setting. This setting applies to catalog servers only.

About this task

Configuring failover depends on the type of environment you are using. If you are using a stand-alone environment, you can configure failover with the command line. If you are using a WebSphere® Application Server Network Deployment environment, you must configure failover in the WebSphere Application Server Network Deployment administrative console.

With the failover management available with XIO, the high availability (HA) manager is still running, and core groups exist; but they are ignored. You can disable the high availability (HA) manager and Distribution and Consistency Services (DCS) on containers when you use WebSphere Application Server. However, the HA manager and DCS must remain enabled on the catalog servers.

Tip: Make sure that your catalog and container servers are not in the same core group if you do this. Otherwise, the catalog that are still running HA manager and DCS, and the failover detection will mistakenly assume that the containers are down.

With this model, the first layer of failure detection is to identify lost XIO socket connections. No tuning is required here. When the socket connection is lost, failure recovery for the container server commences. Now, when clients have difficulty communicating with containers, aside from asking for the new routes, they tell the catalog which containers did not respond to the clients in time. The client-side XIO request timeout, XIO connection timeout, and WebSphere eXtreme Scale request retry timeout all dictate when the client times out, and tells the catalog server that a problem occurred. Also, containers periodically check with any of the catalog servers, where replica catalogs report container checks to the primary catalog. Whether clients report issues, or whether containers do not check with catalog servers, the catalog server will submit a call to the container to get a response. If no response occurs, then failure recovery for the container server commences. How long the catalog gives the container server to respond is determined by the XIO request timeout and XIO connection timeout.

Table 1. Layers of failure detection
Detection Layer	Description
Socket connection	No tuning required. When the socket connection is lost, failure recovery for the container server commences.
Client report	Request, connection, and request retry timeout values determine when the client times out and tells the catalog server that a problem occurred.
Container check	Containers periodically check with any of the catalog servers, which submit a call to containers to get a response. No response means that failure recovery continues. XIO request and connection timeout values determine how long the catalog gives the container to respond.

Procedure

Configure failover for stand-alone environments.

Note:

With XIO, the -heartbeat parameter is still used for failure detection among catalog servers, and this parameter still has the same behavior with ORB. With XIO and the containers, instead of controlling how often containers in an HA core group check in with each other, the -heartbeat parameter controls how often the containers check in with the catalogs. The same time values apply for the different levels of failover that are available when the containers check in with the catalogs. If the parameter is not specified, a default value of 30 seconds is used for the container to check in with the catalogs.

Otherwise, the following information about HA core group activity only applies to environments that use the ORB transport only.

With the -heartbeat parameter in the startOgServer or startXsServer script when you start the catalog server.
With the heartBeatFrequencyLevel property in the server properties file for the catalog server.

Use one of the following values:

Table 2. Valid heartbeat values. Values from `-1` for aggressive heartbeat to `1` for relaxed heartbeat specify how often a server failover is detected.
Value	Action	Description
-1	Aggressive	Specifies an aggressive heartbeat level. With this value, failures are detected more quickly, but more processor and network resources are used. This level is more sensitive to missing heartbeats when the server is busy. Failovers are typically detected within 5 seconds.
-10	Semi-aggressive	Failovers are typically detected within 15 seconds.
0	Typical (default)	Specifies a heartbeat level at a typical rate. With this value, failover detection occurs at a reasonable rate without overusing resources. Failovers are typically detected within 30 seconds.
10	Semi-relaxed	Failovers are typically detected within 90 seconds.
1	Relaxed	Specifies a relaxed heartbeat level. With this value, a decreased heartbeat frequency increases the time to detect failures, but also decreases processor and network use. Failovers are typically detected within 180 seconds.

An aggressive heartbeat interval can be useful when the processes and network are stable. If the network or processes are not optimally configured, heartbeats might be missed, which can result in a false failure detection.

Configure failover for WebSphere Application Server environments.

Note: With the high availability management available with XIO, you can disable the high availability (HA) manager and Distribution and Consistency Services (DCS) on containers when you use WebSphere Application Server. However, the HA manager and DCS must remain enabled on the catalog servers.

You can configure WebSphere Application Server Network Deployment Version 7.0 and later to allow WebSphere eXtreme Scale to fail over very quickly. The default failover time for hard failures is approximately 200 seconds. A hard failure is a physical computer or server crash, network cable disconnection or operating system error. Failures because of process crashes or soft failures typically fail over in less than one second. Failure detection for soft failures occurs when the network sockets from the dead process are closed automatically by the operating system for the server hosting the process.

Core group heartbeat configuration
Note: Core groups still exist but they are ignored since high availability management is available with XIO, and therefore, there is no longer a dependency on WebSphere Application Server to inherit failover characteristics from the core group settings. Therefore, the following information applies to environments that use the ORB transport only.

WebSphere eXtreme Scale running in a WebSphere Application Server process inherits the failover characteristics from the core group settings of the application server. The following sections describe how to configure the core group heartbeat settings for different versions of WebSphere Application Server Network Deployment:
- Update the core group settings for WebSphere Application Server Network Deployment Version 7.0
  
  WebSphere Application Server Network Deployment Version 7.0 provides two core group settings that can be adjusted to increase or decrease failover detection:
  - Heartbeat transmission period. The default is 30000 milliseconds.
  - Heartbeat timeout period. The default is 180000 milliseconds.
  For more details on how change these settings, see the WebSphere Application Server Network Deployment Information center: Discovery and failure detection settings.
  
  Use the following settings to achieve a 1500 ms failure detection time for WebSphere Application Server Network Deployment Version 7 servers:
  - Set the heartbeat transmission period to 750 milliseconds.
  - Set the heartbeat timeout period to 1500 milliseconds.

What to do next

When these settings are modified to provide short failover times, there are some system-tuning issues to be aware of. First, Java™ is not a real-time environment. It is possible for threads to be delayed if the JVM is experiencing long garbage collection times. Threads might also be delayed if the machine hosting the JVM is heavily loaded (due to the JVM itself or other processes running on the machine). If threads are delayed, heartbeats might not be sent on time. In the worst case, they might be delayed by the required failover time. If threads are delayed, false failure detections occur. The system must be tuned and sized to ensure that false failure detections do not happen in production. Adequate load testing is the best way to ensure this.

Note: The current version of eXtreme Scale supports WebSphere Real Time.