IBM Support

Understanding IBM web server Plug-in failover in a clustered environment

Troubleshooting


Problem

How does failover work in the IBM web server Plug-in?

Cause

.

Environment

Starting with WebSphere Application Server version 8.5.5, the IBM web server Plug-in now has an Intelligent Management feature which uses On Demand Configuration (ODC) information to route requests dynamically like an On Demand Router (ODR). The information in this technote is relevant for Plug-in version 8.5.5 ONLY when the Intelligent Management feature is turned OFF.

Resolving The Problem

The following document is designed to assist you in understanding how IBM web server Plug-in failover works, along with providing you some helpful tuning parameters and suggestions to better maximize the ability of the IBM web server Plug-in to failover effectively and in a timely manner.

Note: The following information is written specifically for the IBM HTTP Server (IHS), however, this information in general is applicable to other Web servers which currently support the IBM web server Plug-in (for example: IIS, SunOne, Domino®, and so on).



Failover
  • Background
    In clustered IBM WebSphere Application Server environments, the IBM web server Plug-in has the ability to provide failover in the event the IBM web server Plug-in is no longer able to send requests to a particular cluster member. By default, there are several conditions under which the IBM web server Plug-in will mark a particular cluster member "down" and failover client requests to another cluster member that is still able to receive connections. They are listed as follows:
    • The IBM web server Plug-in is unable to establish a connection to a cluster member's Application Server transport.
    • The IBM web server Plug-in detects a newly connected socket that was prematurely closed by a cluster member during an active read or write.
    There are several configurable settings in the plugin-cfg.xml that can be tuned to affect how quickly the IBM web server Plug-in will mark a cluster member down and failover to another cluster member.

  • ConnectTimeout
    The ConnectTimeout attribute of a Server element enables the IBM web server Plug-in to perform non-blocking connections with a back-end cluster member. Non-blocking connections are beneficial when the IBM web server Plug-in is unable to contact the destination to determine if the port is available or unavailable for a particular cluster member.

  • <Server CloneID="10k66djk2" ConnectTimeout="10" ExtendedHandshake="false" LoadBalanceWeight="1000" MaxConnections="0" Name="Server1_WebSphere_Appserver" WaitForContinue="false">
    <Transport Hostname="server1.domain.com" Port="9091" Protocol="http"/>
    </Server>

    If ConnectTimeout is set to "0", the IBM web server Plug-in performs a blocking connect in which the IBM web server Plug-in sits until the socket is connected, or an operating system TCP timeout occurs (as long as 2 minutes depending on the platform). Using a value of "0" is not recommended.
    A value greater than "0" specifies the number of seconds you want the IBM web server Plug-in to wait for a successful connection. If a connection does not occur after that time interval, the IBM web server Plug-in temporarily marks that cluster member "down" and fails over to one of the other members defined in the cluster. A ConnectTimeout value of "5" seconds is usually recommended.

  • ServerIOTimeout
    The ServerIOTimeout attribute of a server element enables the IBM web server Plug-in to set a time out value, in seconds, for sending requests to and reading responses from a cluster member. If the ServerIOTimeout is set to a value of "0", the IBM web server Plug-in, will use blocked I/O to write requests to, and read responses from the cluster member until the TCP connection times out. It is not recommended to use "0" for ServerIOTimeout. It is much better to choose an appropriate value based on the responsiveness of the application. For example, if you set ServerIOTimeout="120", like this:

  • <Server CloneID="10k66djk2" ServerIOTimeout="120" ConnectTimeout="10" ExtendedHandshake="false" LoadBalanceWeight="1000" MaxConnections="0" Name="Server1_WebSphere_Appserver" WaitForContinue="false">
    <Transport Hostname="server1.domain.com" Port="9091" Protocol="http"/>
    </Server>

    the IBM web server Plug-in will wait 120 seconds (2 minutes) before timing out the TCP connection. This allows adequate time for the application to response to each request.

    When selecting a value for this attribute, remember that sometimes it might take a couple of minutes for a cluster member to process a request. Setting the value of the ServerIOTimeout attribute too low could cause the IBM web server Plug-in to timeout prematurely.

    ServerIOTimeout value can be either positive or negative. If positive, when the ServerIOTimeout pops, the plug-in will not mark that server down. If negative, when the ServerIOTimeout pops, it will mark that server down. If your application uses HttpSession object, then there will be session affinity in play, so it would be best to choose a negative ServerIOTimeout value, to ensure that the retry will not be sent back to the same server that just timed-out. Since that server will be marked down, the retry will go to a different appserver in the cluster.

  • RetryInterval
    An integer specifying the length of time that should elapse from the time that a server is marked down to the time that the IBM web server Plug-in will retry a connection to that server. The default is 60 seconds.

    This setting is specified in the ServerCluster element. An example of this in the plugin-cfg.xml file is as follows:

  • <ServerCluster CloneSeparatorChange="false" LoadBalance="Round Robin"
    Name="Server_WebSphere_Cluster" PostSizeLimit="10000000" RemoveSpecialHeaders="true" RetryInterval="120">

    This would mean that if a cluster member were marked down, the IBM web server Plug-in would not try to use that server again for at least 120 seconds.

    There is no way to recommend one specific value; the value chosen depends on your environment. For example, if you have numerous cluster members, and one cluster member being unavailable does not affect the performance of your application, then you can safely set the RetryInterval to a higher value to allow that server more time to recover before the web server Plug-in will try to use it again.

    Alternatively, if your optimum load has been calculated assuming all cluster members to be available or if you do not have very many, then you will want your cluster members to be retried more often to maintain the load.

    Also, take into consideration the time it takes to restart your server. If a server takes a long time to boot up and load applications, then you will need a longer retry interval.

  • PrimaryServers versus BackupServers
    The IBM web server Plug-in can be configured for failover by using PrimaryServers and BackupServers Elements in the plugin-cfg.xml configuration file. When any of the Primary servers are available, the web server Plug-in will load balance and failover using ONLY the primary servers. But, if none of the primary servers are available, the web server Plug-in will forward requests to an available backup server. When requests are directed to the BackupServers group, all requests go to a single server to allow for exception handling when all primary servers are unavailable. There is no load-balancing between backup servers. For example, in the following configuration, the plug-in will load balance between both servers, Server1_WebSphere_Appserver and Server2_WebSphere_Appserver defined in PrimaryServers element only. However, in the event that both Server1_WebSphere_Appserver and Server2_WebSphere_Appserver become unavailable and marked down, the IBM web server Plug-in will then failover and start sending requests to Server3_WebSphere_Appserver defined in the BackupServers Element.

  • <ServerCluster CloneSeparatorChange="false" LoadBalance="Round Robin"
    Name="Server_WebSphere_Cluster" PostSizeLimit="10000000" RemoveSpecialHeaders="true" RetryInterval="120">

    <Server CloneID="10k66djk2" ServerIOTimeout="120" ConnectTimeout="10" ExtendedHandshake="false" LoadBalanceWeight="1000" MaxConnections="0" Name="Server1_WebSphere_Appserver" WaitForContinue="false">
    <Transport Hostname="server1.domain.com" Port="9091" Protocol="http"/>
    </Server>

    <Server CloneID="10k67eta9" ServerIOTimeout="120" ConnectTimeout="10" ExtendedHandshake="false" LoadBalanceWeight="999" MaxConnections="0" Name="Server2_WebSphere_Appserver" WaitForContinue="false">
    <Transport Hostname="server2.domain.com" Port="9091" Protocol="http"/>
    </Server>

    <Server CloneID="10k68xtw10" ServerIOTimeout="120" ConnectTimeout="10" ExtendedHandshake="false" LoadBalanceWeight="998" MaxConnections="0" Name="Server3_WebSphere_Appserver" WaitForContinue="false">
    <Transport Hostname="server3.domain.com" Port="9091" Protocol="http"/>
    </Server>

    <PrimaryServers>
    <Server Name="Server1_WebSphere_Appserver"/>
    <Server Name="Server2_WebSphere_Appserver"/>
    </PrimaryServers>
    <BackupServers>
    <Server Name="Server3_WebSphere_Appserver"/>
    </BackupServers>
    </ServerCluster>
    When at least one of the PrimaryServers becomes available, the web server Plug-in will stop using the backup server, and will resume using ONLY the primary servers.


[{"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Plug-in","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"8.0;7.0","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}},{"Product":{"code":"SSNVBF","label":"Runtimes for Java Technology"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"Java SDK","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB36","label":"IBM Automation"}}]

Document Information

Modified date:
15 June 2018

UID

swg21219808