Configuring the default Failure Detection Protocol for a core group

The default Failure Detection Protocol monitors the core group network connections that the default Discovery Protocol establishes, and notifies the default Discovery Protocol if a connection failure occurs.

Before you begin

  • Understand the concepts that are described in the topic Core group discovery and failure detection protocols.
  • Check your operating system settings that are relevant to TCP/IP socket closing events.
  • Determine your failure detection goals and which settings must change to accomplish these goals.

    The value that you specify for the Heartbeat timeout period should equal the product of multiplying the value specified for the Heartbeat transmission period property, times the Number of missed consecutive heartbeats property.

    • The heartbeat transmission period specifies the frequency at which a core group member sends a heartbeat packet over every established connection. The default value for the heartbeat transmission period is 30 seconds.
    • The heartbeat timeout period specifies the failure detection time. If no packets are received during the specified time period, a failure is declared. The default value for the heartbeat transmission period is 180 seconds.

About this task

You might want to perform this task if:
  • You want to change the failover characteristics of your system.
  • Your core groups are large and analysis indicates excessive CPU usage is spent monitoring heartbeats.

The heartbeat transmission period and heartbeat timeout period are configurable. Use the administrative console or the wsadmin tool to adjust these settings if the default values are not appropriate for your environment, unless you are running in a mixed cell environment that includes core groups that contain a mixture of Version 7.0 and Version 6.x processes,

Mixed-version environment: If you are running in a mixed cell environment, and you have core groups that contain a mixture of Version 7.0 and Version 6.x processes, you must continue to use the IBM_CS_FD_PERIOD_SECS and IBM_CS_FD_CONSECUTIVE_MISSED core group custom properties to adjust these settings. To specify these custom properties:
  1. In the administrative console, click Servers > Core Groups > Core group settings > core_group_name. Then, in the Additional Properties section, click Custom properties > New.
  2. In theName field, specify either IBM_CS_FD_PERIOD_SECS or IBM_CS_FD_CONSECUTIVE_MISSED, and then specify a new value for these properties in the Value field.

    The IBM_CS_FD_PERIOD_SECS custom property specifies how frequently the Failure Detection Protocol checks the core group network connections that the discovery protocol establishes.

    The IBM_CS_FD_CONSECUTIVE_MISSED property specifies the number of consecutive heartbeats that a member can missed before it is communication with that member is discontinued.

Remember, when you use the administrative console or the wsadmin tool to configure the Failure Detection Protocol, you configure the heartbeat transmission period, and the heartbeat timeout period. However if you are use the custom properties to configure the Failure Detection Protocol, you configure the heartbeat transmission period, and the number of missed consecutive heartbeats.

To use the administrative console to change the settings for the default Failure Detection Protocol complete the following steps.

Procedure

  1. In the administrative console, click Servers > Core Groups > Core group settings > core_group_name.
  2. Then, in the Additional Properties section, click Discovery and failure detection.
    The Use the default protocol providers option must be selected. If this option is not selected, do not perform any more of the steps in this task.
  3. Specify, in milliseconds, a new value for the Heartbeat transmission period property.

    The default value for this property is 30000 milliseconds, which equals 30 seconds.

  4. Specify, in milliseconds, a new value for the Heartbeat timeout period property.

    The default value for this property is 180000 milliseconds, which equals 180 seconds.

  5. Click OK and then click Review.
  6. Select Synchronize changes with nodes, and then click Save.
  7. Restart all of the members of the core group.

Results

After the servers restart, the core group members all run with the new Failure Detection Protocol settings.