The Parallel Sysplex® solution satisfies a major customer requirement for continuous
24-hour-a-day, 7-day-a-week availability, while providing techniques for achieving
simplified Systems Management consistent with this requirement. Some of the
features of the Parallel
Sysplex solution that contribute to increased availability also help
to eliminate some Systems Management tasks.
- Workload management (WLM) component
- The workload management (WLM) component of z/OS® provides sysplex-wide workload management
capabilities based on installation-specified performance goals and the business
importance of the workloads. WLM tries to attain the performance goals through
dynamic resource distribution. WLM provides the Parallel Sysplex cluster with
the intelligence to determine where work needs to be processed and in what
priority. The priority is based on the customer's business goals and is managed
by sysplex technology.
- Sysplex Failure Manager (SFM)
- The Sysplex Failure Management policy allows the installation to specify
failure detection intervals and recovery actions to be initiated in the event
of the failure of a system in the sysplex.
Without SFM, when one of the
systems in the Parallel
Sysplex fails, the operator is notified and prompted to take some recovery
action. The operator may choose to partition the non-responding system from
the Parallel
Sysplex, or to take some action to try to recover the system. This period
of operator intervention might tie up critical system resources required by
the remaining active systems. Sysplex Failure Manager allows the installation
to code a policy to define the recovery actions to be initiated when specific
types of problems are detected, such as fencing off the failed image that
prevents access to shared resources, logical partition deactivation, or central
storage and expanded storage acquisition, to be automatically initiated following
detection of a Parallel
Sysplex failure.
- Automatic Restart Manager (ARM)
- Automatic Restart Manager enables fast recovery of subsystems that might
hold critical resources at the time of failure. If other instances of the
subsystem in the Parallel Sysplex need any of these critical
resources, fast recovery will make these resources available more quickly.
Even though automation packages are used today to restart the subsystem to
resolve such deadlocks, ARM can be activated closer to the time of failure.
ARM
reduces operator intervention in the following areas:
- Detection of the failure of a critical job or started task
- Automatic restart after a started task or job failure
After an abend
of a job or started task, the job or started task can be restarted with specific
conditions, such as overriding the original JCL or specifying job dependencies,
without relying on the operator.
- Automatic redistribution of work to an appropriate system following a
system failure
This removes the time-consuming step of human evaluation
of the most appropriate target system for restarting work
- Cloning and symbolics
- Cloning refers to replicating the hardware and software configurations
across the different physical servers in the Parallel Sysplex. That is,
an application that is going to take advantage of parallel processing might
have identical instances running on all images in the Parallel Sysplex. The hardware
and software supporting these applications could also be configured identically
on all systems in the Parallel Sysplex to reduce the amount of
work required to define and support the environment.
The concept of symmetry allows
new systems to be introduced and enables automatic workload distribution in
the event of failure or when an individual system is scheduled for maintenance.
It also reduces the amount of work required by the system programmer in setting
up the environment. Note that symmetry does not preclude the need for
systems to have unique configuration requirements, such as the asymmetric
attachment of printers and communications controllers, or asymmetric workloads
that do not lend themselves to the parallel environment.
System symbolics
are used to help manage cloning. z/OS provides support for the substitution
values in startup parameters, JCL, system commands, and started tasks. These
values can be used in parameter and procedure specifications to allow unique
substitution when dynamically forming a resource name.
- zSeries® resource
sharing
- A number of base z/OS components have discovered that the IBM® coupling facility
shared storage provides a medium for sharing component information for the
purpose of multisystem resource management. This exploitation, called IBM zSeries Resource
Sharing, enables sharing of physical resources such as files, tape drives,
consoles, and catalogs with improvements in cost, performance and simplified
systems management. This is not to be confused with Parallel
Sysplex data sharing by the database subsystems. zSeries Resource Sharing delivers immediate
value even for customers who are not leveraging data sharing, through native
system exploitation delivered with the base z/OS software stack.
One of the goals
of the Parallel
Sysplex solution is to provide simplified systems management by reducing
complexity in managing, operating, and servicing a Parallel Sysplex, without requiring
an increase in the number of support staff and without reducing availability.