Preparing for disk failures

Because your data is spread across your disks, it is important that you consider how to protect your data if one of those disks fails. Disk protection helps ensure the availability of data stored on the disks.

Disk storage is the storage that is either internal to your system or is attached to it. This disk space, together with your system's main memory, is regarded by your system as one large storage area. When you save a file, you do not assign it to a storage location; instead, the system places the file in the location that ensures the best performance. It might spread the data in the file across multiple disk units. When you add more records to the file, the system assigns additional space on one or more disk units. This way of addressing storage is known as single-level storage.

In addition to internal disk storage, you can also use IBM® System Storage DS® products to attach a large volume of external disk units. These storage products provide enhanced disk protection, the ability to copy data quickly and efficiently to other storage servers, and the capability of assigning multiple paths to the same data to eliminate connection failures. For additional information about IBM System Storage DS products and to determine whether this solution is right for you, see Enterprise disk storage Link outside Information Center.

Device parity protection

Start of changeDevice parity protection allows your system to continue to operate when a disk fails or is damaged. When you use device parity protection, the disk input/output adapter (IOA) calculates and saves a parity value for each bit of data. The IOA computes the parity value from the data at the same location on each of the other disk units in the device parity set. When a disk failure occurs, the data can be reconstructed by using the parity value and the values of the bits in the same locations on the other disks. Your system continues to run while the data is being reconstructed. End of change

The IBM i supports two types of device parity protection:

RAID 5

With RAID 5, the system can continue to operate if one disk fails in a parity set. If more than one disk fails, data will be lost and you must restore the data for the entire system (or only the affected disk pool) from the backup media. Logically, the capacity of one disk is dedicated to storing parity data in a parity set consisting of 3 to 18 disk units.

RAID 6

With RAID 6, the system can continue to operate if one or two disks fail in a parity set. If more than two disk units fail, you must restore the data for the entire system (or only the affected disk pool) from the backup media. Logically, the capacity of two disk units is dedicated to storing parity data in a parity set consisting of 4 to 18 disk units.

Write cache and auxiliary write cache IOA

When the system sends a write operation, the data is first written to the write cache on the disk IOA and then later written to the disk. If the IOA experiences a failure, the data in the cache might be lost and cause an extended outage to recover the system.

The auxiliary write cache is an additional IOA that has a one-to-one relationship with a disk IOA. The auxiliary write cache protects against extended outages due to the failure of a disk IOA or its cache by providing a copy of the write cache which can be recovered following the repair of the disk IOA. This avoids a potential system reload and gets the system back online as soon as the disk IOA is replaced and the recovery procedure completes. However, the auxiliary write cache is not a failover device and cannot keep the system operational if the disk IOA (or its cache) fails.

Hot-spare disks

Start of changeA disk designated as a hot-spare disk is used when another disk that is part of a parity set on the same IOA fails. It joins the parity set and rebuilding the data for this disk is started by the IOA without user intervention. Because the rebuild operation occurs without having to wait for a new disk to be installed, the time that the parity set is exposed is greatly reduced. See Hot spare protection for additional informationEnd of change

Mirrored protection

Disk mirroring is recommended to provide the best system availability and the maximum protection against disk-related component failures. Data is protected because the system keeps two copies of the data on two separate disk units. When a disk-related component fails, the system can continue to operate without interruption by using the mirrored copy of the data until the failed component is repaired.

Different levels of mirrored protection are possible, depending on what hardware is duplicated. The level of mirrored protection determines whether the system keeps running when different levels of hardware fail. To understand these different levels of protection, see Start of change Determining the level of mirrored protection that you wantEnd of change.

You can duplicate the following disk-related hardware:

  • Disk unit
  • Disk controllers
  • I/O bus unit
  • I/O adapter
  • I/O processors
  • A bus
  • Expansion towers
  • High-speed link (HSL) ring

Start of changeHot-spare disksEnd of change

Start of changeA disk designated as a hot-spare disk is used when another disk that is mirror-protected fails. A hot spare disk unit is stored on the system as a non-configured disk. When a disk failure occurs, the system exchanges the hot spare disk unit with the failed disk unit. The exchange of a mirrored subunit with the hot spare disk unit does not occur until mirror-protection has been suspended for 5 minutes and the replacement disk has been formatted. After the exchange occurs, the system synchronizes the data on the new disk unit. See Hot spare protection for additional information.End of change

Independent disk pools

With independent disk pools (also called independent auxiliary storage pools), you can prevent certain unplanned outages because the data on them is isolated from the rest of your system. If an independent disk pool fails, your system can continue to operate on data in other disk pools. Combined with different levels of disk protection, independent disk pools provide more control in isolating the effect of a disk-related failure as well as better prevention and recovery techniques.

Start of change

Cross site mirroring

Start of changeThere are different varieties of cross-site mirroring. Geographic mirroring is a function that keeps two identical copies of an independent disk pool at two sites to provide high availability and disaster recovery. The copy owned by the primary node is the production copy and the copy owned by a backup node at the other site is the mirror copy. User operations and applications access the independent disk pool on the primary node that owns the production copy. Geographic mirroring is a sub-function of cross-site mirroring (XSM), which is a part of IBM i Option 41, High Available Switchable Resources. End of change

Start of changeMetro Mirror and Global Mirror are a combination of IBM System Storage DS copy services technology and IBM i clustering technology. The definition is similar to that of geographic mirroring, however the System Storage® technology does the replication instead of the IBM i.End of change

End of change

Multipath disk units

You can define up to eight connections from each logical unit number (LUN) created on the IBM System Storage DS products to the input/output processors (IOPs) on the system. Assigning multiple paths to the same data allows the data to be accessed even though some failures might occur in other connections to the data. Each connection for a multipath disk unit functions independently. Several connections provide availability by allowing disk storage to be used even if a single path fails.