IBM Support

IC71861: DB2 HADR PAIR CAN HANG WHILE PROCESSING AN INFORMATIONAL LOG RECORD ON STANDBY

Subscribe

You can track all active APARs for this component.

APAR status

  • Closed as program error.

Error description

  • A DB2 HADR pair can hang showing connect status "Congested" in
    the db2pd -hadr output:
    
    Database Partition 0 -- Database SAMPLE -- Active --
    
    HADR Information:
    Role    State                SyncMode HeartBeatsMissed
    LogGapRunAvg (bytes)
    Primary Peer                 Nearsync 0                  991669
    
    ConnectStatus ConnectTime                           Timeout
    Congested     Wed Sep  8 20:31:26 2010 (1283970686) 120
    
    The ouput on standby will show that buffer is 100% full.
    
    The problem is caused while processing an informational log
    record on the
    STANDBY system.
    
    Note: The 'Congested' state is just an external symptom. A
    'Congested' state
    will not always indicate a hang issue.
    
    A typical stack of db2redom in this situation will be:
    
    Thread 51 (Thread 0x2aaac17fe940 (LWP 12900)):
    #0  0x000000333a4d517a in semtimedop () from /lib64/libc.so.6
    #1  0x00002aaaabca8d8b in sqloWaitEDUWaitPost () from
    /home/inst01/sqllib/lib64/libdb2e.so.1
    #2  0x00002aaaad25ed66 in sqlprWaitDuringPRec(sqeAgent*,
    SQLO_EDUWAITPOST*) () from
    /home/inst01/sqllib/lib64/libdb2e.so.1
    #3  0x00002aaaad25c6c6 in sqlpPRecReadLog(sqeAgent*, SQLP_ACB*,
    SQLP_DBCB*) () from /home/inst01/sqllib/lib64/libdb2e.so.1
    #4  0x00002aaaad24e388 in sqlpParallelRecovery(sqeAgent*,
    sqlca*) () from /home/inst01/sqllib/lib64/libdb2e.so.1
    #5  0x00002aaaac5ec2b4 in sqleSubCoordProcessRequest(sqeAgent*)
    () from /home/inst01/sqllib/lib64/libdb2e.so.1
    #6  0x00002aaaab8d3d8e in sqeAgent::RunEDU() () from
    /home/inst01/sqllib/lib64/libdb2e.so.1
    #7  0x00002aaaabf7af94 in sqzEDUObj::EDUDriver() () from
    /home/inst01/sqllib/lib64/libdb2e.so.1
    #8  0x00002aaaabf7aeeb in sqlzRunEDU(char*, unsigned int) ()
    from /home/inst01/sqllib/lib64/libdb2e.so.1
    #9  0x00002aaaabcf6d62 in sqloEDUEntry () from
    /home/inst01/sqllib/lib64/libdb2e.so.1
    #10 0x000000333b00673d in start_thread () from
    /lib64/libpthread.so.0
    #11 0x000000333a4d3d1d in clone () from /lib64/libc.so.6
    
    Normal idle would look like:
    
    sqlpPRecReadLog -> sqlpshrScanNext -> sqlorest (etc.)
    
    Where the hang shows:
    
    sqlpPRecReadLog -> sqlprWaitDuringPRec -> sqloWaitEDUWaitPost
    

Local fix

  • The fewer redo workers you have, the more likely this is to be
    hit.
    You can use DB2BPVARS to configure the number of redo workers
    like described below.
    
    Step 1: set DB2BPVARS to point to the file that contains the new
    value:
    
    db2set DB2BPVARS=/home/userid/bpvars.txt   (you can use
    whatever filename they want)
    
    Step 2:  Add 1 line to this file: NOTE: the value '5' includes 4
    workers and a master.   If you want to try 6 (or 8) workers,
    they need to set this value to 7 (or 9).
    
    PREC_NUM_AGENTS=5
    
    so the file looks like this:
    
    $cat /home/userid/bpvars.txt
    PREC_NUM_AGENTS=5
    
    NOTE: the database needs to be re-cycled for this value to be
    picked up.
    

Problem summary

  • ****************************************************************
    
    * USERS AFFECTED:                                              *
    * ALL                                                          *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See Problem Description above.                               *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Upgrade to DB2 Version 9.7 Fix Pack 4.                       *
    ****************************************************************
    

Problem conclusion

  • First fixed in DB2 Version 9.7 Fix Pack 4.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IC71861

  • Reported component name

    DB2 FOR LUW

  • Reported component ID

    DB2FORLUW

  • Reported release

    970

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2010-10-13

  • Closed date

    2011-05-03

  • Last modified date

    2011-05-03

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    DB2 FOR LUW

  • Fixed component ID

    DB2FORLUW

Applicable component levels

  • R970 PSY

       UP



Document information

More support for: DB2 for Linux, UNIX and Windows

Software version: 9.7

Reference #: IC71861

Modified date: 03 May 2011