IC94740: LOGICAL LOGS ON HDR SECONDARY CAN LOSE BACKED UP FLAG LEADING TO PING TIMEOUTS AND INABILITY FOR SERVERS TO RECONNECT AND RESYNC

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

APAR status

  • Closed as program error.

Error description

  • If the B flag on some of the logical logs on the secondary get
    lost as part of bouncing the secondary, when the logs wrap and
    these logs need to be reused, this will lead to a ping timeout.
    So first you would see something like this in your secondary
    MSGPATH file:
    
    15:49:42  DR: ping timeout
    15:49:47  DR: Received connection request from remote server
    when DR is not Off
                  [Local  type: Secondary, Current state: FAILED]
                  [Remote type: Primary]
    
    15:49:59  DR: Received connection request from remote server
    when DR is not Off
                  [Local  type: Secondary, Current state: FAILED]
                  [Remote type: Primary]
    
    So you see the ping timeout error, which is then followed by
    attempts to reconnect by the primary, which are rejected because
    the secondary is failing to shutdown HDR properly.   When you
    look at the state of the secondary some items to key in on would
    be the following:
    
    1) onstat -l (from the secondary) would show the next log that
    the server would be trying to use would only have the U flag as
    follows
    7000000203c6860  15       U---C--  372      1:30263
    1250     1250   100.00
    7000000203c68c8  16       U------  342      1:31513
    1250     1250   100.00
    7000000203c6930  17       U------  343      1:43707
    1250     1250   100.00
    
    Now on the primary, the onstat -l output for the 2 logs in
    question would look like this:
    7000000203c6860  15       U------  372      1:30263
    1250     1250   100.00
    7000000203c68c8  16       U---C--  373      1:31513
    1250      310    24.80
    7000000203c6930  17       U-B----  343      1:43707
    1250     1250   100.00
    
    So you can see on the primary, it was able to switch into log 16
    (logid 373) and the only way it could do that is if it had been
    backed up (like log 17 shows), but on the secondary both log 16
    and 17 show only the "U" flag.
    
    2) The state of the logrecover thread on the secondary would be
    in a nearly constant running state, and it's stack would look
    something like this:
    
    mt_yield
    lgr_set_logf
    dologrecvr
    startup
    
    or
    
    get_logfile_by_id
    freeonelog_by_id
    lgr_set_logf
    dologrecv
    startup
    
    The important portion is having lgr_set_logf in the stack.
    
    Lastly, this appears to happen because the secondary server was
    bounced and there would have been a transaction open prior to
    when ontape -a was run on the primary and the logical recovery
    message on the secondary servers MSGPATH file would look
    something like this:
    
    15:38:44  DR: Failure recovery from disk in progress ...
    15:38:44  Logical Recovery Started.
    15:38:44  23 recovery worker threads will be started.
    15:38:44  Start Logical Recovery - Start Log 342, End Log ?
    15:38:44  Starting Log Position - 357 0x107018
    15:38:45  Logical Log 342 Complete, timestamp: 0xe068ea.
    15:38:46  Logical Log 343 Complete, timestamp: 0xe068ea.
    15:38:46  Logical Log 344 Complete, timestamp: 0xe068ea.
    15:38:46  Logical Log 345 Complete, timestamp: 0xe068ea.
    15:38:47  Logical Log 346 Complete, timestamp: 0xe068ea.
    15:38:47  Logical Log 347 Complete, timestamp: 0xe068ea.
    15:38:47  Logical Log 348 Complete, timestamp: 0xe068ea.
    15:38:48  Logical Log 349 Complete, timestamp: 0xe068ea.
    15:38:48  Logical Log 350 Complete, timestamp: 0xe068ea.
    15:38:48  Logical Log 351 Complete, timestamp: 0xe068ea.
    
    So from the above the server starts recovery in 342 (the start
    of the old transaction) but recovery  starting log position
    doesn't start until
    357 which is a more recent checkpoint.  In this case, the log
    back up on the primary would have happened some time after the
    log 342 was used and the transaction was opened but the logs
    would have to be backed up prior to the last checkpoint.
    
    If the secondary is bounced at this point, when it comes up, it
    will start receiving log files, but again it will hit a point
    where it appears to hang and again if you look at the onstat -l
    output, you will see the current log is 100% full and the next
    log file will show U and not backed up (so no B), and the stack
    trace of the logrecover thread will be the same as previously
    mentioned.
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * Users of HDR environment.                                    *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See Error Description                                        *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Update to IDS-11.70.xC8                                      *
    ****************************************************************
    

Problem conclusion

  • Problem Fixed In IDS-11.70.xC8
    

Temporary fix

Comments

APAR Information

  • APAR number

    IC94740

  • Reported component name

    INFORMIX SERVER

  • Reported component ID

    5725A3900

  • Reported release

    B70

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2013-08-08

  • Closed date

    2014-02-26

  • Last modified date

    2014-02-26

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    INFORMIX SERVER

  • Fixed component ID

    5725A3900

Applicable component levels

  • RB70 PSY

       UP



Rate this page:

(0 users)Average rating

Add comments

Document information


More support for:

Informix Servers

Software version:

B70

Reference #:

IC94740

Modified date:

2014-02-26

Translate my page

Machine Translation

Content navigation