IBM Support

db2 configured with TSA in a HA environment may take long time to fail over if DB2 is abnormally shut down on primary

Troubleshooting


Problem

When db2 is configured in a HA environment with TSA as the cluster manager and db2 on the primary machine is abnormally shut down, it may take longer than expected for db2 to fail over to the standby machine.

Symptom

In the system log, for example in /var/log/messages in Linux, you may see messages similar to below reported from the start sript, where you can see the script repeatedly attempts to start DB2 in every 30 seconds, but all failed, until it was interrupted due to the TSA startCommandTimeout period being expired.

Sep 6 20:06:25 host2 db2V10_start.ksh[80004]: Entered /usr/sbin/rsct/sapolicies/db2/db2V10_start.ksh, db2inst1, 0
Sep 6 20:06:27 host2 db2V10_start.ksh[80004]: *************** Attempted to start partition db2inst1,0 and failed - rc=2, rcg=1
Sep 6 20:06:27 host2 db2V10_start.ksh[80004]: Killing db2 processes : db2inst1, 0
Sep 6 20:06:27 host2 db2V10_start.ksh[80004]: Removing IPCs

Sep 6 20:06:59 host2 db2V10_start.ksh[80004]: *************** Attempted to start partition db2inst1,0 and failed - rc=2, rcg=1
Sep 6 20:06:59 host2 db2V10_start.ksh[80004]: Killing db2 processes : db2inst1, 0
Sep 6 20:06:59 host2 db2V10_start.ksh[80004]: Removing IPCs

...<snippets>

Sep 6 20:20:58 host2 db2V10_start.ksh[80004]: *************** Attempted to start partition db2inst1,0 and failed - rc=2, rcg=1
Sep 6 20:20:58 host2 db2V10_start.ksh[80004]: Killing db2 processes : db2inst1, 0
Sep 6 20:20:58 host2 db2V10_start.ksh[80004]: Removing IPCs

...<command timed out>


Sep 6 20:21:25 host2 GblResRM[38741]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,Application.C,10.0.0.1,6584 :::GBLRESRM_COMMAND_TIMEOUT
Sep 6 20:21:25 host2 IBM.Application start/stop command timed out.
Sep 6 20:21:25 host2 Resource name
Sep 6 20:21:25 host2 db2_db2inst1_0-rs

At the same time, you may also see related db2 start error messages in db2diag.log.

Cause

When db2 is abnormally shut down on the primary, for example db2 crashes or is killed, TSA will try to bring it back online first by calling the start script as specified in the StartCommand attribute of the DB2 TSA source. If it fails to bring DB2 back online on the primary machine in a certain period, it will then fail over to start DB2 on the standby machine.

However in the DB2 TSA start script, for example, db2V97_start.ksh or db2V10_start.ksh or db2V105_start.ksh, it will attempt to start DB2 repeatedly in a while loop sleeping in every 30 seconds until db2 is started successfully. In some case, if db2 can't be started within the TSA startCommandTimeout period, the start script will then be timed out by TSA and TSA will fail over DB2 to the standby machine. Therefore in this case, it may take startCommandTimeout of time for TSA to start to fail over DB2 to the standby machine, which might be longer than expected.

Diagnosing The Problem

To solve the issue:

1) You will first need to fix any issue relating to db2 start as per messages in db2diag.log.

2) Alternatively as a workaround, you may want to reduce the TSA startCommandTimeout so that DB2 can fail over to the standby machine sooner.

[{"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"High Availability - Cluster Management","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"9.5;9.7;9.8","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
16 June 2018

UID

swg21685935