Path recovery

When you define your I/O configuration, many devices share common hardware components (such as channels, channel cards, switches, control unit ports, control unit adapters, and fiber-optic links). For example, all devices for a specific control unit definition share hardware components because they share channels and control unit ports. Therefore, when a hardware-related error occurs on a channel path, multiple devices are affected.

When an error occurs on a channel path, the system performs path recovery, which consists of issuing one or more recovery-related I/Os to test the channel path to see if it is still usable. If path recovery determines that the channel path is no longer usable, the path is removed (varied offline) from the affected device. Otherwise, the channel path remains online to the device.

Path recovery is typically performed one device at a time. This means that when an error occurs on one device, only that device is processed. Errors on other devices are processed independently, even if they share common hardware components. This may affect application performance since the application is delayed while the system performs path recovery and then retries the original I/O request. If the application uses multiple devices that share a failing or malfunctioning hardware component, additional errors are encountered and further delays occur.

Additionally, certain types of path errors can be intermittent. That is, an error occurs, but path recovery is successful, so the path is not removed from the device. This also affects performance because applications might encounter errors multiple times. If this occurs, you may need to manually remove the bad path or paths from the affected devices to stop the errors from occurring.

Specify a PATH_SCOPE of either CU or DEVICE to enable path recovery either for all devices that are attached to the control unit (CU) or on a device-by-device basis (DEVICE). The default is PATH_SCOPE=DEVICE.

The PATH_SCOPE option of CU, along with the PATH_THRESHOLD and PATH_INTERVAL options, allows you to reduce the elapsed time that it takes for the system to recover from channel path-related errors and helps prevent system performance problems that can occur when a significant amount of time is spent in repetitive channel path error recovery.