[z/OS]

Peer restart and recovery

The goal of every system is to have as little downtime as possible. Sometimes, however, system failures are inevitable. For example, a system failure might occur because the power unexpectedly goes out in your main system. When a system failure occurs, a restart action you can take is to restart on a peer system in the sysplex. This type of restart uses the peer restart and recovery function. Starting a server on a system to which it was not configured implicitly places it into peer restart and recovery mode.

Deprecated feature: Peer restart and recovery (PRR) functionality is deprecated. You should use the integrated high availability support for the transaction service subcomponent, instead of Peer Restart and Recovery for transaction recovery.

When you experience a main system failure that results in InDoubt transactions with unknown outcomes, you need to obtain those intended transactional outcomes (ideally correctly) before the data can be utilized again. Peer restart and recovery provides an automated means of accomplishing this by restarting the controller on a peer system so that the "locks" that block the data can be dropped and the outcomes determined. This is in contrast to how a system usually handles a failure by automatically rolling back.

If a failure occurs, automatic restart management:
  • Can restart the product and related servers on the same system, or
  • Can use the peer restart and recovery function to restart related servers on an alternate system in the cell.

    The server is not a recoverable resource manager. It is a recoverable communication manager. It has no recoverable locks of its own and it does not need to manage locks nor manage lock states in a log. It just needs to make sure that both callers and callees are connected in each of the communications sessions of a distributed transaction.

Peer restart and recovery restarts the controller on another system and goes through the transaction restart and recovery process so that we can assign outcomes to transactions that were in progress at the time of failure. During this transaction restart and recovery process, data might be temporarily inaccessible until the recovery process is complete. The restart and recovery process does not result in lost data.

Resource managers, such as DB2®, that were being accessed at the time of failure may hold locks that are scoped to a transaction UR (unit of recovery). Once an outcome has been assigned to a UR, the resource managers will, generally, drop those locks.