IBM Support

IC98387: SECOND ERROR ON PPRC VOLUME AFTER A SUSPEND BY EVENT WHILE SERVER IS SHUTDOWN WILL CAUSE PAIR TO INCORRECTLY BE MARKED.

Fixes are available

Refresh Pack 5.2.2 (June 2014) for Tivoli Storage Productivity Center
Fix Pack 5.1.1.5 (July 2014) for Tivoli Storage Productivity Center
Refresh Pack 5.2.3 (August 2014) for Tivoli Storage Productivity Center
Fix Pack 5.2.4 (November 2014) for Tivoli Storage Productivity Center
Fix Pack 5.2.4.1 (December 2014) for Tivoli Storage Productivity Center
Refresh Pack 5.2.5 (March 2015) for Tivoli Storage Productivity Center (withdrawn)
Fix Pack 5.1.1.6 (March 2015) for Tivoli Storage Productivity Center
Fix Pack 5.2.5.1 (April 2015) for Tivoli Storage Productivity Center (withdrawn)
Refresh Pack 5.2.6 (June 2015) for Tivoli Storage Productivity Center
Refresh Pack 5.2.7 (August 2015) for Tivoli Storage Productivity Center
Fix Pack 5.1.1.9 (October 2015) for Tivoli Storage Productivity Center
IBM Spectrum Control V5.2.8 (December 2015)
IBM Spectrum Control V5.2.9 (February 2016)
IBM Spectrum Control V5.2.10 (May 2016)
IBM Spectrum Control V5.2.10.1 (July 2016)
IBM Spectrum Control V5.2.11 (August 2016)
Fix Pack 5.1.1.12 (October 2016) for Tivoli Storage Productivity Center
Fix Pack 5.1.1.13 (February 2017) for Tivoli Storage Productivity Center
Fix Pack 5.1.1.14 (June 2017) for Tivoli Storage Productivity Center
Fix Pack 5.1.1.15 (Sept 2017) for Tivoli Storage Productivity Center
IBM Spectrum Control V5.2.12 (November 2016)
IBM Spectrum Control V5.2.13 (March 2017)
IBM Spectrum Control V5.2.14 (May 2017)
IBM Spectrum Control V5.2.15 (August 2017)
IBM Spectrum Control V5.2.15.2 (November 2017)
IBM Spectrum Control V5.2.16 (March 2018)
IBM Spectrum Control V5.2.17 (May 2018)
Fix Pack 5.1.1.8 (July 2015) for Tivoli Storage Productivity Center
Fix Pack 5.2.7.1 (February 2016) for Tivoli Storage Productivity Center

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as fixed if next.

Error description

  • 1. While the TPC-R server is stopped or has connectivity issues,
    if there is an error in a pprc volume the affected volume is
       marked with SUSPEND status and non
       recoverable when the TPC-R server is started or reconnected.
    
    2. If there is another error in a pprc volume (freeze trigger)
       TPC-R detects it and performs the FREEZE as expected, but the
       TPC-R will mark the previous volume as recoverable.
       The pair will be marked with an error IWNR2052E, showing
       that there may be a problem.
    
    This is due to the fact that after the Freeze operation TPC-R
    will query the status of the secondaries and do a best guess to
    determine consistency.
    In this case the secondary, in fact, may or may not be
    consistent because TPC-R cannot query whether the long busy is
    still occurring or more importantly whether data has been
    written since the thaw.
    At this point the pair is marked as recoverable if the hardware
    returns that the state is Full Duplex.
    
    The APAR will mark the pair as non-recoverable if TPC-R cannot
    determine consistency instead of marking the pair as
    recoverable.
    
    Circumvention:
    Use Hardened Freeze option with TPC-R to ensure consistency and
    correct status while TPC-R server has become inactive or
    disconnected.
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * Customers managing DR could experience this if the           *
    * Replication server is brought down and problems occur on the *
    * copy relationships during this period.                       *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * | fix pack | 5.1.1-TIV-TPC-FP0005 - target 2Q 2014 |         *
    * | release  | 5.2.1-TIV-TPC-FP0000 - target 1Q 2014 |         *
    *                                                              *
    * http://www-01.ibm.com/support/docview.wss?&uid=swg21320822   *
    *                                                              *
    * The target dates for future fix packs do not represent a     *
    * formal                                                       *
    * commitment by IBM. The dates are subject to change without   *
    * notice.                                                      *
    *                                                              *
    *                                                              *
    *                                                              *
    * 1. Suppose that TPC-R is stopped or has connectivity issues, *
    * in this case if we have an error in a pprc volume the        *
    * affected volume becomes                                      *
    *      in SUSPEND status and non recoverable when the TPC-R    *
    * becomes active again.                                        *
    *                                                              *
    *                                                              *
    * 2. If now there is another error in a pprc volume (freeze    *
    * trigger) the TPC-R detects it and performs the FREEZE as     *
    * expected, but the TPC-R                                      *
    *     will mark the previous volume as recoverable.            *
    *     The volume should stay in a 'non recoverable' status as  *
    * it was prior to the FREEZE.                                  *
    *                                                              *
    *                                                              *
    * This could confuse the customer and decide to move to the    *
    * secondary copy while it is not in a consistent status.       *
    *                                                              *
    *                                                              *
    *                                                              *
    * To a degree TPC-R is working as expected, however the        *
    * product should error on the side of caution if tpc-r is shut *
    * down during the time of error:                               *
    *                                                              *
    * TPC-R is behaving as expected. Because TPC_R was shut down   *
    * during the suspend event there is no way that we can         *
    * guarantee consistency. When TPC-R comes back up it does      *
    * detect that the suspend event and react the best as          *
    * possible:                                                    *
    * On startup the session knows about the suspended volume      *
    * Session Name = Session_NAME                                  *
    * Session State = Prepared                                     *
    * State Description = No Description Provided                  *
    * Session Status = Severe                                      *
    * Description =                                                *
    * Is Recoverable? = false                                      *
    * Is Shadowing? = true                                         *
    * Total # copysets = 786                                       *
    * Were Errors Found = true                                     *
    * Production Host = H1                                         *
    * Production Host with Mode = H1                               *
    * Copy Rules:                                                  *
    *    Name = Metro Mirror Failover/Failback                     *
    *    Type = MM                                                 *
    *    Number of Volumes for copytype = 2                        *
    *    HWTypes in session = ESS |                                *
    * Sites:                                                       *
    * Site 1:                                                      *
    * Sant Cugat                                                   *
    * Site 2:                                                      *
    * Cerdanyola CD1                                               *
    * Status Messages on session Session_NAME:                     *
    * Sequences in session Session_NAME:                           *
    * Sequence Name = H1-H2                                        *
    * isRecoverable = false                                        *
    * isShadowing = true                                           *
    *       # Exceptions = 0                                       *
    *       # Shadowing = 785                                      *
    *       # Recoverable = 785                                    *
    *       # in HW CG = 0                                         *
    *   direction = true                                           *
    *       Timestamp = n/a                                        *
    * Progress = CopyProgress: total=132224400, copied=132224399,  *
    * progress=99, timeEstimate=null                               *
    *       # of Pairs = 786                                       *
    *         Base Copy Type = MM                                  *
    * Pair State Counts:                                           *
    * State Name: Suspended  # volumes in this state: 1 *<--- note *
    * volume is suspended*                                         *
    * State Name: Prepared  # volumes in this state: 785           *
    *                                                              *
    * The pair that was tampered with has return code 8 and the    *
    * others 10 suspended to maintain consistency                  *
    * 2013-11-21 15:20:08.837+0100 CSMSEC-2F RepMgr D              *
    * com.ibm.csm.server.hw.ElementCatalogEventNotifier$EventManag *
    * erEventNotifier sendBulkEvent TRACE: (HWL-EVENT) Sending 131 *
    * normal events:                                               *
    * ET=1:PAIR:RC=2:DS8000:2107.CT171:VOL:0001:DS8000:2107.KH321: *
    * VOL:0001:SUSPENDED (8) :numberOutOfSync=0::TS=1385043608821  *
    * ET=1:PAIR:RC=2:DS8000:2107.CT171:VOL:0002:DS8000:2107.KH321: *
    * VOL:0002:SUSPENDED (10) :numberOutOfSync=0::TS=1385043608821 *
    *                                                              *
    * TPC-R then thaws the sequence:                               *
    * 2013-11-21 15:20:09.196+0100 CSMSEC-30 RepMgr I              *
    * com.ibm.csm.server.session.SessionMgr === ZOS_MM_HOST_SITE0  *
    * === runOperation for session ZOS_MM_HOST_SITE0 KEY EVENT:    *
    * ********************  Running CmdAction thaw to sequence     *
    * H1-H2                                                        *
    *                                                              *
    * TPC-R then marks the session as suspended by a freeze        *
    * operation:                                                   *
    * 2013-11-21 15:20:12.353+0100 CSMSEC-30 KeyEventLog D         *
    * StateMgr changeSessionState   KEY EVENT: SESSION:            *
    * ZOS_MM_HOST_SITE0 -- STATE:         Suspended -- H1          *
    * *** STATE DESCRIPTION:  No Description Provided              *
    * *** CAUSED BY OPERATION OR EVENT:  FREEZE_OP                 *
    *                                                              *
    * There is a separate thread that then gets kicked off that    *
    * checks for consistency on the target. Unfortunately, since   *
    * there isn't a query on the hardware for whether the pair is  *
    * consistent or not.  TPC-R can't query if the long busy is    *
    * still occurring...or really more importantly if the volume   *
    * has actually been updated since it's thaw. Therefore TPC-R   *
    * checks the state of the target volume which comes back as    *
    * duplex and according to the algorithm that means its         *
    * consistent, thus mark the volume consistent.                 *
    * 2013-11-21 15:20:12.400+0100 CSMSRV-1CC RepMgr D >           *
    * com.ibm.csm.server.session.policy.actions.CheckMMConsistency *
    * $CheckThread CheckThread() Entry, parm 1 = check thread      *
    * created for driver                                           *
    * ESS:2107.CT171:37920:0::ESS:2107.KH321:5408:0                *
    * 2013-11-21 15:20:18.790+0100 CSMSRV-1CC RepMgr D > StateMgr  *
    * setRecoverable Entry, parm 1 = boolean: true                 *
    *                                                              *
    * HOWEVER, that pair is still marked with an error, that will  *
    * be surfaced to the user that something is wrong:             *
    * source: DS8000:2107.CT171:VOL:0001, target:                  *
    * DS8000:2107.KH321:VOL:0001, db state: Suspended, db pend     *
    * state: Suspended, hw event state: SUSPENDED, hw reason code: *
    * 8, hw source state: 2, hw target state: 2                    *
    * Suspended |  SOURCE ID: DS8000:2107.CT171:VOL:0001 |  SOURCE *
    * NICKNAME: S0F007 |  TARGET ID: DS8000:2107.KH321:VOL:0001 |  *
    * TARGET NICKNAME: S0F007 |  RECOVERABLE: true |  SHADOWING:   *
    * false |  LAST RESULT MSG ID: IWNR2052E                       *
    *                                                              *
    * If the customer is running on zOS and wants to be able to    *
    * manage situations where TPC-R is down they will need to look *
    * into using the Hardened freeze option ->                     *
    * http://www.redbooks.ibm.com/redpieces/pdfs/sg247563.pdf      *
    * Enable Hardened Freeze                                       *
    *                                                              *
    * Use this option to let z/OS Input/Output Supervisor (IOS)    *
    * manage freeze operations for                                 *
    *                                                              *
    * the volumes in the session, which prevents Tivoli Storage    *
    * Productivity Center for                                      *
    *                                                              *
    * Replication from freezing the volumes and possibly freezing  *
    * itself. We recommend you to                                  *
    *                                                              *
    * use this option if you put system volumes, like SYSRES and   *
    * page data sets, into the Copy                                *
    *                                                              *
    * Sets of a Metro Mirror session.                              *
    *                                                              *
    * Customers will need the following pre-requisites to          *
    * implement this function:                                     *
    *                                                              *
    * ? z/OS at 1.13 level, with APAR OA37632 installed;           *
    *                                                              *
    * ? z/OS address spaces Basic HyperSwap Management (HSIB) and  *
    * Basic HyperSwap                                              *
    *                                                              *
    * API (HSIBAPI) must be active, even if you are not going to   *
    * exploit BHS                                                  *
    *                                                              *
    * Hardened Freeze puts the configuration in a paged (freeze    *
    * safe) area within IOS to ensure consistency.                 *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Always use a High Availability standby server and issue a    *
    * takeover in the event that the active server requires a      *
    * shutdown for maintenance or outage.                          *
    ****************************************************************
    

Problem conclusion

Temporary fix

Comments

APAR Information

  • APAR number

    IC98387

  • Reported component name

    TPC

  • Reported component ID

    5608TPC00

  • Reported release

    511

  • Status

    CLOSED FIN

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2013-12-18

  • Closed date

    2014-02-17

  • Last modified date

    2014-07-29

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

Applicable component levels

  • R511 PSY

       UP

  • R520 PSY

       UP

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SS5R93","label":"IBM Spectrum Control"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"511","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
23 March 2022