IBM Support

IC75211: IN DPF, CONNECT OR CONNECT RESET HANGS DUE TO MISSING REPLY AFTER NODE FAILURE.

Subscribe

You can track all active APARs for this component.

APAR status

  • Closed as program error.

Error description

  • During failed connect (which does implicit connect reset), or
    connect reset processing in a multi-node environment, if a node
    failure occurs, an expected reply from a remote node to the
    connect reset can be missed. The coordinator agent will hang in
    the following stack:
    
    sqloWaitEDUWaitPost
    WaitRecvReady
    ReceiveBuffer
    getNextBuffer
    sqlkd_rcv_buffer
    sqlkd_rcv_get_next_buffer
    sqlkd_rcv_init
    sqlkdReceiveReply
    sqleReceiveAndMergeReplies
    sqlkdInterrupt
    sqleDssStopUsing
    ForwardStopRequest
    AppStopUsing
    sqlesrspWrp
    sqleUCagentConnectReset
    sqljsCleanup
    sqljsDrdaAsInnerDriver
    sqljsDrdaAsDriver
    RunEDU
    
    A log should be made in the db2diag.log on the coord node
    similar to:
    
    2011-03-02-04.15.40.706078+540 I601932A472        LEVEL: Error
    PID     : 4841666              TID  : 4885        PROC : db2sysc
    1
    INSTANCE: db2inst              NODE : 001         DB   : P64816
    APPHDL  : 1-51                 APPID: *N1.dpfv971.110301191344
    AUTHID  : DB2INST
    EDUID   : 4885                 EDUNAME: db2agent (sample) 1
    FUNCTION: DB2 UDB, buffer dist serv, sqlkdReceiveReply, probe:10
    RETCODE : ZRC=0x81590016=-2124873706=SQLKF_NODE_FAILED "Node
    Recovery"
    
    
    Another indication of this hang is seeing one or more subagents
    for the stop using coord, stuck in log term sync, on a
    non-coord node with this callstack:
    
    sqloWaitEDUWaitPost
    WaitRecvReady
    ReceiveBuffer
    getNextBuffer
    sqlkd_rcv_buffer
    sqlkd_rcv_get_next_buffer
    sqlkd_rcv_init
    sqlkdReceiveReply
    sqlpLSrequestor
    sqlpPerformTermLogSync
    sqlpTermLogSync
    sqlpterm
    CleanDB
    TermDbConnect
    AppStopUsing
    sqleSubAgentStopUsing
    sqleSubRequestRouter
    
    As a result of the hang problem, a connection attempt to the
    node will fail with SQL1229N.
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * Users using DPF environment                                  *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * During failed connect (which does implicit connect reset),   *
    * or connect reset processing in a multi-node environment,     *
    * if a node failure occurs, an expected reply from a remote    *
    * node to the connect reset can be missed. The coordinator     *
    * agent will hang in the following stack:                      *
    *                                                              *
    *                                                              *
    *                                                              *
    * sqloWaitEDUWaitPost                                          *
    * WaitRecvReady                                                *
    * ReceiveBuffer                                                *
    * getNextBuffer                                                *
    * sqlkd_rcv_buffer                                             *
    * sqlkd_rcv_get_next_buffer                                    *
    * sqlkd_rcv_init                                               *
    * sqlkdReceiveReply                                            *
    * sqleReceiveAndMergeReplies                                   *
    * sqlkdInterrupt                                               *
    * sqleDssStopUsing                                             *
    * ForwardStopRequest                                           *
    * AppStopUsing                                                 *
    * sqlesrspWrp                                                  *
    * sqleUCagentConnectReset                                      *
    * sqljsCleanup                                                 *
    * sqljsDrdaAsInnerDriver                                       *
    * sqljsDrdaAsDriver                                            *
    * RunEDU                                                       *
    *                                                              *
    * A log should be made in the db2diag.log on the coord node    *
    *                                                              *
    * similar to:                                                  *
    *                                                              *
    *                                                              *
    *                                                              *
    * 2011-03-02-04.15.40.706078+540 I601932A472        LEVEL:     *
    * Error                                                        *
    * PID    : 4841666              TID  : 4885        PROC :      *
    * db2sysc                                                      *
    * 1                                                            *
    *                                                              *
    * INSTANCE: db2inst              NODE : 001        DB  :       *
    * P64816                                                       *
    * APPHDL  : 1-51                APPID:                         *
    * *N1.dpfv971.110301191344                                     *
    * AUTHID  : DB2INST                                            *
    *                                                              *
    * EDUID  : 4885                EDUNAME: db2agent (sample) 1    *
    * FUNCTION: DB2 UDB, buffer dist serv, sqlkdReceiveReply,      *
    * probe:10                                                     *
    * RETCODE : ZRC=0x81590016=-2124873706=SQLKF_NODE_FAILED "Node *
    *                                                              *
    * Recovery"                                                    *
    *                                                              *
    *                                                              *
    *                                                              *
    * Another indication of this hang is seeing one or more        *
    * subagents for the stop using coord, stuck in log term        *
    * sync, on a non-coord node with this callstack:               *
    *                                                              *
    *                                                              *
    * sqloWaitEDUWaitPost                                          *
    * WaitRecvReady                                                *
    * ReceiveBuffer                                                *
    * getNextBuffer                                                *
    * sqlkd_rcv_buffer                                             *
    * sqlkd_rcv_get_next_buffer                                    *
    * sqlkd_rcv_init                                               *
    * sqlkdReceiveReply                                            *
    * sqlpLSrequestor                                              *
    * sqlpPerformTermLogSync                                       *
    * sqlpTermLogSync                                              *
    * sqlpterm                                                     *
    * CleanDB                                                      *
    * TermDbConnect                                                *
    * AppStopUsing                                                 *
    * sqleSubAgentStopUsing                                        *
    * sqleSubRequestRouter                                         *
    *                                                              *
    *                                                              *
    * As a result of the hang problem, a connection attempt to the *
    * node will fail with SQL1229N.                                *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Upgrade to Version 9.5 FixPack 8.                            *
    ****************************************************************
    

Problem conclusion

  • Problem was first fixed in DB2 UDB Version 9.5 FixPack 8.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IC75211

  • Reported component name

    DB2 FOR LUW

  • Reported component ID

    DB2FORLUW

  • Reported release

    950

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2011-03-23

  • Closed date

    2011-06-30

  • Last modified date

    2011-06-30

  • APAR is sysrouted FROM one or more of the following:

    IC74901

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    DB2 FOR LUW

  • Fixed component ID

    DB2FORLUW

Applicable component levels

  • R950 PSN

       UP



Document information

More support for: DB2 for Linux, UNIX and Windows

Software version: 9.5

Reference #: IC75211

Modified date: 30 June 2011