IBM Support

IC75196: A CONTROLLED TAKEOVER IN A TSA / HADR ENV MAY BE FOLLOWED BY AN IMMEDIATE FAILBACK IF THERE IS LATENCY IN CHANGES TO RESOURCES

Subscribe

You can track all active APARs for this component.

APAR status

  • Closed as program error.

Error description

  • In a TSA / HADR environment, the state of the database is
    monitored by scripts located under
    /usr/sbin/rsct/sapolicies/db2/.  During the course of a user
    initiated HADR takeover, DB2 issues requests to TSA to created,
    lock, and unlock resources.  If those operations are not
    propagated quickly enough, there is a chance the monitor scripts
    will report the database is down on both servers when there are
    no locks or flags in place.  If that happens, TSA will issue a
    second takeover by force.  A manual reintegration may be
    required.
    
    Note: this issue only affects controlled takeover and it is
    expected to happen only in rare cases.
    
    
    
    
    Node 1 syslogs:
    
    ## The database is reporting as online (return code 1):
    Jan 25 12:23:42 server-a user:debug
    /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[3735742]:
    Returning 1 : db2inst1 db2inst1 SAMPLE
    Jan 25 12:24:04 server-a user:debug
    /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[3735792]:
    Returning 1 : db2inst1 db2inst1 SAMPLE
    
    ## The takeover is issued and has started several seconds ago,
    but the takeover is not finished and the monitor script runs and
    reports offline (return code 2):
    Jan 25 12:24:26 server-a user:debug
    /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[983346]:
    Returning 2 : db2inst1 db2inst1 SAMPLE
    Jan 25 12:24:48 server-a user:debug
    /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[2752700]:
    Returning 2 : db2inst1 db2inst1 SAMPLE
    Jan 25 12:25:10 server-a user:debug
    /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[3866888]:
    Returning 2 : db2inst1 db2inst1 SAMPLE
    Jan 25 12:25:32 server-a user:debug
    /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[2228742]:
    Returning 2 : db2inst1 db2inst1 SAMPLE
    
    
    
    Node 2 syslogs:
    
    ## This was the standby, so reporting offline is expected
    Jan 25 12:23:45 server-b user:debug
    /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[2294478]:
    Returning 2 : db2inst1 db2inst1 SAMPLE
    Jan 25 12:24:07 server-b user:debug
    /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[1508116]:
    Returning 2 : db2inst1 db2inst1 SAMPLE
    
    ## The takeover starts and resource changes are made
    Jan 25 12:24:07 server-b user:debug root[2687334]: Entering
    /usr/sbin/rsct/sapolicies/db2/lockreqprocessed
    db2_db2inst1_db2inst1_SAMPLE-rg lock
    Jan 25 12:24:13 server-b user:debug root[2294526]: Exiting
    /usr/sbin/rsct/sapolicies/db2/lockreqprocessed
    db2_db2inst1_db2inst1_SAMPLE-rg lock: 1
    Jan 25 12:24:23 server-b user:debug root[1769768]: Entering
    /usr/sbin/rsct/sapolicies/db2/lockreqprocessed
    db2_db2inst1_db2inst1_SAMPLE-rg unlock
    Jan 25 12:24:23 server-b user:debug root[3211824]: Exiting
    /usr/sbin/rsct/sapolicies/db2/lockreqprocessed
    db2_db2inst1_db2inst1_SAMPLE-rg unlock: 0
    Jan 25 12:24:26 server-b user:debug root[2687360]: Entering
    /usr/sbin/rsct/sapolicies/db2/lockreqprocessed
    db2_db2inst1_db2inst1_SAMPLE-rg lock
    
    ## The monitor on node 1 ran at this point and returned offline.
     The last status for node 2 is also offline and we are still
    modifying the resources:
    
    Jan 25 12:24:29 server-b user:debug
    /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[2294596]:
    Returning 2 : db2inst1 db2inst1 SAMPLE
    Jan 25 12:24:32 server-b user:debug root[3473642]: Exiting
    /usr/sbin/rsct/sapolicies/db2/lockreqprocessed
    db2_db2inst1_db2inst1_SAMPLE-rg lock: 1
    Jan 25 12:24:32 server-b user:debug root[3146076]: Entering
    /usr/sbin/rsct/sapolicies/db2/lockreqprocessed
    db2_db2inst1_db2inst1_SAMPLE-rg unlock
    Jan 25 12:24:32 server-b user:notice
    /usr/sbin/rsct/sapolicies/db2/hadrV95_start.ksh[3211836]:
    Entering : db2inst1 db2inst1 SAMPLE
    Jan 25 12:24:32 server-b user:debug
    /usr/sbin/rsct/sapolicies/db2/hadrV95_start.ksh[2949514]: su -
    db2inst1 -c db2gcf -t 3600 -u -i db2inst1 -i db2inst1 -h SAMPLE
    -L
    Jan 25 12:24:32 server-b user:debug root[2163396]: Exiting
    /usr/sbin/rsct/sapolicies/db2/lockreqprocessed
    db2_db2inst1_db2inst1_SAMPLE-rg unlock: 0
    Jan 25 12:24:33 server-b user:notice
    /usr/sbin/rsct/sapolicies/db2/hadrV95_start.ksh[2163400]:
    Returning 0 : db2inst1 db2inst1 SAMPLE
    Jan 25 12:24:33 server-b user:debug root[3211870]: Entering
    /usr/sbin/rsct/sapolicies/db2/lockreqprocessed
    db2_db2inst1_db2inst1_SAMPLE-rg lock
    Jan 25 12:24:34 server-b user:debug
    /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[2687424]:
    Returning 1 : db2inst1 db2inst1 SAMPLE
    Jan 25 12:24:39 server-b user:debug root[2687428]: Exiting
    /usr/sbin/rsct/sapolicies/db2/lockreqprocessed
    db2_db2inst1_db2inst1_SAMPLE-rg lock: 1
    Jan 25 12:24:40 server-b user:debug root[3211884]: Entering
    /usr/sbin/rsct/sapolicies/db2/lockreqprocessed
    db2_db2inst1_db2inst1_SAMPLE-rg unlock
    Jan 25 12:24:40 server-b user:debug root[2753468]: Exiting
    /usr/sbin/rsct/sapolicies/db2/lockreqprocessed
    db2_db2inst1_db2inst1_SAMPLE-rg unlock: 0
    Jan 25 12:24:56 server-b user:debug
    /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[6553680]:
    Returning 1 : db2inst1 db2inst1 SAMPLE
    

Local fix

  • Verify the current TSA resources OpState and the actual DB2 HADR
    roles. If manual reintegration is required, issue "db2 start
    hadr on <dbname> as standby".
    
    If issue is readily reproducible, there may be an underlying
    latency problem.  Resolve latency problem to reduce the chance
    the monitor scripts will run in the middle of resource
    management.
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * Users of TSAMP HA solutions.                                 *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See Problem Description above.                               *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Upgrade to DB2 Version 9.7 Fix Pack 5.                       *
    ****************************************************************
    

Problem conclusion

  • First Fixed in Version 9.7 Fix Pack 5.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IC75196

  • Reported component name

    DB2 FOR LUW

  • Reported component ID

    DB2FORLUW

  • Reported release

    970

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2011-03-23

  • Closed date

    2011-12-23

  • Last modified date

    2011-12-23

  • APAR is sysrouted FROM one or more of the following:

    IC74257

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    DB2 FOR LUW

  • Fixed component ID

    DB2FORLUW

Applicable component levels

  • R970 PSN

       UP



Document information

More support for: DB2 for Linux, UNIX and Windows

Software version: 9.7

Reference #: IC75196

Modified date: 23 December 2011