When work on one system in a sysplex cannot complete because the
system fails, other systems in the sysplex remain available to recover
the work and continue processing the workload. The goals of failure
management in a sysplex are to minimize the impact that a failing
system might have on the sysplex workload so that work can continue,
and to do this with little or no operator intervention. You
should be aware that in some cases sysplex delays may occur while
other systems attempt to recover work from a failing system. See Lock structure considerations for examples of sysplex delays that may occur.
The actions MVS™ is to take
in failure situations is determined by the information specified through
the COUPLExx parmlib member, the SFM policy, the XCFPOLxx parmlib
member, the automatic restart management policy,
the
use of the system status detection (SSD) partitioning protocol, and
system defaults.
- COUPLExx Parmlib Member
From COUPLExx, MVS obtains basic failure-related
information, such as when to consider a system to have failed and
when to notify the operator of the failure. (COUPLExx parmlib specifications
might be different for each system, depending on its workload, processor
capacity, or other factors.)
- SFM Policy
If all systems in a sysplex are running OS/390® or MVS/ESA SP Version 5, you can use the sysplex
failure management (SFM) policy to define how MVS is to handle system failures, signaling connectivity
failures, or PR/SM™ reconfiguration
actions. Although you can use SFM in a sysplex without a coupling
facility, to take advantage of the full range of failure management
capabilities that SFM offers, a coupling facility must be configured
in the sysplex.
SFM makes use of some information specified
in COUPLExx and includes all the function available through XCFPOLxx.
The
SFM policy also can be used in conjunction with the REBUILDPERCENT
specification in the CFRM policy to determine whether MVS should initiate a structure rebuild when loss
of connectivity to a coupling facility occurs.
- XCFPOLxx Parmlib Member
In a multisystem
sysplex on a processor with the PR/SM feature,
XCFPOLxx functions can provide some of the same capabilities as those
provided by the SFM policy. XCFPOLxx functions are also referred
to as the XCF PR/SM policy.
- Automatic Restart Management Policy
Use the automatic
restart management policy to specify how batch jobs and started tasks
that are registered as elements of automatic restart management should
be restarted. The policy can specify different actions to be taken
when a system fails, and when an element fails. Automatic restart
management uses the IXC_WORK_RESTART exit, the IXC_ELEM_RESTART
exit, the event exit, and the IXCARM macro parameters, in conjunction
with the automatic restart management policy (the specified values
and the defaults) when determining how to restart elements.
- System Status Detection (SSD) Partitioning
Protocol Using BCPii
XCF uses the SSD partitioning protocol
and BCPii services to enhance and expedite sysplex partitioning processing
of systems in the sysplex. With BCPii services, XCF can automatically
detect when a system in the sysplex has become demised. Then XCF can
initiate partitioning the demised system immediately, bypassing the
failure detection interval and the cleanup interval and avoiding the
need for system fencing and manual operator intervention. A system
image is considered demised when XCF determines that the system is
removable from the sysplex without further delay. The system might
encounter one of the following conditions:
- The system enters a non-restartable disabled wait state.
- The system experiences a LOAD operation.
- The system has experienced a RESET or other equivalent action
(such as system reset, checkstop, and power-down).
- System Default Status Update Missing
(SUM) Action
If no active SFM policy or PR/SM policy is defined, the default SUM action is
used in response to a status update missing condition. Before z/OS® V1R11, the default SUM action
is to prompt the operator when a system is detected to be status update
missing. As of z/OS V1R11,
if a system is in the status update missing condition and is not sending
any XCF signals, the system is to be isolated immediately using the
fencing services through the coupling facility.
Other sysplex and couple data set failures, such as those caused
by a power failure, might require operator intervention. See Handling concurrent system and couple data set failures.