z/OS MVS Setting Up a Sysplex
Previous topic | Next topic | Contents | Contact z/OS | Library | PDF


Planning the failure detection interval and operator notification interval

z/OS MVS Setting Up a Sysplex
SA23-1399-00

Planning the failure detection interval: On each system in the sysplex, MVS™ periodically updates its own status and monitors the status of other systems in the sysplex. A status update missing condition occurs when a system in the sysplex does not update its status information within a certain time interval. This time interval is the failure detection interval. The effective failure detection interval is the larger one of the user-specified failure detection interval and spin failure detection interval:
  • The user-specified failure detection interval comes from the following sources:
    • The INTERVAL keyword in COUPLExx (explicitly or by default)
    • The SETXCF COUPLE,INTERVAL command
    • The value set by cluster management instrumentation software
    When the value is omitted, the default INTERVAL value equals the spin failure detection interval.
  • The spin failure detection interval is derived from the excessive spin parameters specified in the EXSPATxx parmlib member. The value is computed as follows, where N is the number of excessive spin recovery actions, +1 indicates the implicit SPIN action, and SpinTime is the excessive spin loop timeout interval:
    spinfdi = (N+1)*SpinTime + 5
    When EXSPATxx default specifications are used, the default spin failure detection interval can be 45 seconds or 165 seconds depending on the configuration. Seez/OS MVS Initialization and Tuning Reference for more details.

Exception: You can enable the USERINTERVAL option on the COUPLExx parmlib member or on the SETXCF FUNCTIONS command to force the user-specified INTERVAL value to be the effective failure detection value, even though it is smaller than the spin failure detection interval.

Planning the operator notification interval: The operator notification interval is the amount of time between when a system no longer updates its status and when another system issues message IXC402D to inform the operator of the condition. You specify the operator notification interval on the OPNOTIFY keyword in COUPLExx..

After IPL, you can modify the failure detection and operator notification intervals using the SETXCF COUPLE command.

The OPNOTIFY value can be specified as an absolute value or as a relative value.
  • When an absolute value is specified, the effective OPNOTIFY value used by the system is the greater one of the specified OPNOTIFY value and the effective failure detection interval.
  • When a relative value is specified, the effective OPNOTIFY value used by the system is the sum of the effective failure detection interval plus the specified relative OPNOTIFY value.

Considerations: The operator notification interval must be equal to or greater than the failure detection interval. For many installations, the default values for the failure detection interval and the operator notification interval are reasonable.

If you are not using the default failure detection interval, you need to evaluate the tradeoff between removing a failing system immediately to prevent disruption to your sysplex and tolerating a period of non-productivity so that the system has a chance to recover.

If you specify these intervals, consider the following:
  • If SFM is active in the sysplex, and ISOLATETIME is specified, then the operator is not prompted. However, if SFM fails to isolate the failing system, the operator is still prompted.

    If SFM or XCFPOLxx is active and RESETTIME or DEACTTIME is specified, then the operator notification time runs in parallel to the RESETTIME or the DEACTTIME.

    Otherwise, to ensure that the operator receives timely notification of a failure, specify an operator notification interval that is the same as or only slightly greater than the failure detection interval.

  • XCF signaling uses the failure detection interval of the inbound system to determine when an outbound signaling path is inoperative. If signals are queued for transfer longer than the receiving system's failure detection interval, the path is restarted.

Go to the previous page Go to the next page




Copyright IBM Corporation 1990, 2014