Data resilience

Start of change You can use a number of technologies to address the data resilience requirements that are described in the “Benefits of High Availability” section. Following are the data resilience technologies. Keep in mind that multiple technologies can be used in combination to further strengthen your data resiliency. End of change

Logical replication

Start of change Logical replication is a widely deployed multisystem data resiliency topology for high availability (HA) in the IBM® i space. It is typically deployed through a product that is provided by a high availability independent software vendor (ISV). Replication is run (through software methods) on objects. Changes to the objects (for example file, member, data area, or program) are replicated to a backup copy. The replication is near or in real time (synchronous remote journaling) for all journaled objects. Typically if the object such as a file is journaled, replication is handled at a record level. For such objects as user spaces that are not journaled, replication is handled typically at the object level. In this case, the entire object is replicated after each set of changes to the object is complete. End of change

Most logical replication solutions allow for additional features beyond object replication. For example, you can achieve additional auditing capabilities, observe the replication status in real time, automatically add newly created objects to those being replicated, and replicate only a subset of objects in a given library or directory.

To build an efficient and reliable multisystem HA solution using logical replication, synchronous remote journaling as a transport mechanism is preferable. With remote journaling, IBM i continuously moves the newly arriving data in the journal receiver to the backup server journal receiver. At this point, a software solution is employed to “replay” these journal updates, placing them into the object on the backup server. After this environment is established, there are two separate yet identical objects, one on the primary server and one on the backup server.

With this solution in place, you can rapidly activate your production environment on the backup server by doing a role-swap operation.

A key advantage of this solution category is that the backup database file is live. That is, it can be accessed in real time for backup operations or for other read-only application types such as building reports. In addition, that normally means minimal recovery is needed when switching over to the backup copy.

The challenge with this solution category is the complexity that can be involved with setting up and maintaining the environment. One of the fundamental challenges lies in not strictly policing undisciplined modification of the live copies of objects residing on the backup server. Failure to properly enforce such a discipline can lead to instances in which users and programmers make changes against the live copy so that it no longer matches the production copy. If this happens, the primary and the backup versions of your files are no longer identical.

Start of change Another challenge that is associated with this approach is that objects that are not journaled must go through a check point, be saved, and then sent separately to the backup server. Therefore, the granularity of the real-time nature of the process may be limited to the granularity of the largest object being replicated for a given operation. End of change

For example, a program updates a record residing within a journaled file. As part of the same operation, it also updates an object, such as a user space, that is not journaled. The backup copy becomes completely consistent when the user space is entirely replicated to the backup system. Practically speaking, if the primary system fails, and the user space object is not yet fully replicated, a manual recovery process is required to reconcile the state of the non-journaled user space to match the last valid operation whose data was completely replicated.

Start of change Logical replication solutions can typically cover all types of outages, depending on the implementation. Recovery point objective (RPO) can be 0 if the distance between systems allows for synchronous remote journaling and all replicated objects are journaled. Using asynchronous remote journaling and having objects that must be replicated from the audit journal increases the RPO. End of change

Start of change Another possible challenge that is associated with this approach lies in the latency of the replication process. This refers to the amount of lag time between the time at which changes are made on the source system and the time at which those changes become available on the backup system. Synchronous remote journal can mitigate this to a large extent. Regardless of the transmission mechanism that is used, you must adequately project your transmission volume and size your communication lines and speeds properly to help ensure that your environment can manage replication volumes when they reach their peak. In a high volume environment, replay backlog and latency may be an issue on the target side even if your transmission facilities are properly sized. End of change

Hardware replication

Hardware replication is done at the operating system or disk level instead of at the object level. An advantage of these technologies over logical replication is that the replication is done at a lower level, and when done synchronously, there is a guarantee that both copies of the data are identical. The disadvantage of the technology is that the data is only accessible from one copy, and the second copy cannot be used during active replication.

Within hardware replication, there are again two categories, independent auxiliary storage pool (IASP) replication and full system replication. IBM PowerHA® SystemMirror® for i delivers several hardware replication technologies based on independent auxiliary storage pools or IASPs. An independent ASP or IASP is a set of disk units, which can be configured separately from a specific host system and can be independently varied on or off. An IASP is used to segregate application data from the operating system. Thus, the application data can be replicated by using hardware replication while not replicating the operating system. The IBM i implementation of IASPs supports both directory objects (such as the integrated file system (IFS)) and library objects (such as database files). While migrating the application data into the IASP is a separate step in setting up the environment, there are several advantages to only replicating the data and not the operating system. Planned and unplanned switches to the backup system are faster than if the entire system is replicated. The backup system contains a separate copy of the OS and can be used for other work while it is also used as a backup system for production. These technologies can be used for planned OS upgrades since there are again two copies of the operating system.

If migrating the application data into an IASP is not feasible, it is also possible to use hardware replication at the system level, typically called full system replication. Geographic mirroring, which is an IBM i replication technology, can be used in an i hosted environment to replicate a production system. The replication technologies that are provided by the IBM storage systems can also be used to replicate an entire system. While easier to initially set up, full system replication does require more bandwidth than IASP-based replication. Full system replication is considered more of a disaster recovery technology than high availability, since there is only one production environment and it must be IPL'd on another physical system for a planned or unplanned outage. There are tools and service agreements available from IBM Lab Services, which helps to automate and customize a full system replication environment if wanted.

Switchable device

A switchable device is a collection of hardware resources such as disk units, communication adapters, and tape devices that can be switched from one system to another. For data resilience, the disk units can be configured into a special class of auxiliary storage pool (ASP) that is independent of a particular host system. The practical outcome of this architecture is that switching an independent disk pool from one system to another involves less processing time than a full initial program load (IPL). The IBM i implementation of independent disk pools supports both directory objects (such as the integrated file system (IFS)) and library objects (such as database files). This is commonly referred to as switched disks.

Start of change The benefit of using independent disk pools for data resiliency lies in their operational simplicity. The single copy of data is always current, meaning there is no other copy with which to synchronize. No in-flight data, such as data that is transmitted asynchronously, can be lost, and there is minimal performance overhead. Role swapping or switching is relatively straight forward, although you might need to account for the time that is required to vary on the independent disk pool. End of change

Start of change Another key benefit of using independent disk pools is zero-transmission latency, which can affect any replication-based technology. The major effort that is associated with this solution involves setting up the direct-access storage device (DASD) configuration, the data, and application structure. Making an independent disk pool switchable is relatively simple. End of change

Start of change Limitations are also associated with the independent disk pool solution. First, there is only one logical copy of the data in the independent disk pool. This can be a single point of failure, although the data should be protected using RAID 5, RAID 6, RAID 10, or mirroring. The data cannot be concurrently accessed from both hosts. Things such as read access or backup to tape operations cannot be done from the backup system. Certain object types, such as configuration objects, cannot be stored in an independent disk pool. You need another mechanism, such as periodic save and restore operations, clustering administrative domain or logical replication, to ensure that these objects are appropriately maintained. End of change

Start of change Another limitation involves hardware associated restrictions. An example would be outages that are associated with certain hardware upgrades. The independent disk pool cannot be brought online to an earlier system. With this in mind, up-front system environment design and analysis are essential. End of change

Switched logical unit (LUN)

Start of change Switched logical units allow data that is stored in the independent disk pool from logical units that are created in an IBM System Storage® to be switched between systems providing high availability. End of change

A switched logical unit is an independent disk pool that is controlled by a device cluster resource group and can be switched between nodes within a cluster. When switched logical units are combined with IBM i clusters technology, you can create a simple and cost effective high availability solution for planned and some unplanned outages.

The device cluster resource group (CRG) controls the independent disk pool which can be switched automatically in the case of an unplanned outage, or it can be switched manually with a switchover.

Start of change A group of systems in a cluster can take advantage of the switchover capability to move access to the switched logical unit pool from system to system. A switchable logical unit must be in an IBM System Storage connected through a storage area network. Switched logical units operate similar to switched disks, but hardware is not switched between logical partitions. When the independent disk pool is switched the logical units within the IBM System Storage unit are reassigned from one logical partition to another. End of change

Geographic Mirroring

Geographic mirroring is a function of the IBM i operating system. All the data that is placed in the production copy of the independent disk pool is mirrored to a second independent disk pool on a second, perhaps remote system.

The benefits of this solution are essentially the same as the basic switchable device solution with the added advantage of providing disaster recovery to a second copy at increased distance. The biggest benefit continues to be operational simplicity. The switching operations are essentially the same as that of the switchable device solution, except that you switch to the mirror copy of the independent disk pool, making this a straightforward HA solution to deploy and operate. As in the switchable device solution, objects not in the independent disk pool must be handled by some other mechanism and the independent disk pool cannot be brought online to an earlier system. Geographic mirroring also provides real-time replication support for hosted integrated environments such as Microsoft Windows and Linux. This is not generally possible through journal-based logical replication.

Since geographic mirroring is implemented as a function of the IBM i, a potential limitation of a geographic mirroring solution is performance impacts in certain workload environments.

When running input/output (I/O) intensive batch jobs, some performance degradation on the primary system is possible. Also, be aware of the increased central processing unit (CPU) overhead that is required to support geographic mirroring, and the backup copy of the independent disk pool cannot be accessed while the data synchronization is in process. For example, if you want to back up to tape from the geographically mirrored copy, you must quiesce operations on the source system and detach the mirrored copy. Then you must vary on the detached copy of the independent disk pool on the backup system, perform the backup procedure, and then reattach the independent disk pool to the original production host. Synchronization of the data that was changed while the independent disk pool was detached will then be performed. Your HA solution is running exposed, meaning there is no up-to-date second data set, while doing the backups and when synchronization is occurring. Using source and target side tracking minimizes this exposure.

Metro Mirror

Metro mirroring is a function of the IBM System Storage Server. The data that is stored in independent disk pools data is on disk units that are in the System Storage Server. This solution involves replication at the hardware level to a second storage server using IBM System Storage Copy Services. An independent disk pool is the basic unit of storage for the System Storage Peer-to-Peer Remote Copy (PPRC) function. PPRC provides replication of the independent disk pool to another System Storage Server. IBM i provides a set of functions to combine the PPRC, independent disk pools, and IBM i cluster resource services for coordinated switchover and failover processing through a device cluster resource group (CRG).

You can combine this solution with other System Storage based copy services functions, including FlashCopy®, for save window reduction.

Metro Mirror data transfer is done synchronously. You must also be aware of the distance limitations and bandwidth requirements that are associated with transmission times as with any solution when synchronous communications are used.

Global Mirror

Global Mirror uses the same base technology as Metro Mirror except the transmission of data is done in an asynchronous manner and FlashCopy to a third set of disks is required to maintain data consistency. Because this data transmission is asynchronous, there is no limit to how geographically dispersed the System Storage servers can be from each other.

DS8000 Full System HyperSwap

DS8000® Full System HyperSwap® is a single system solution that is a function of the IBM System Storage Server. The data that is stored on the system is on disk units that are in the IBM System Storage Server. The logical partition has access to two IBM Systems Storage Servers that are using IBM System Storage Copy Services Peer-to-Peer Remote Copy (PPRC) Metro Mirror function. IBM i provides the ability for the system to switch between the DS8000 servers for planned and unplanned storage side outages without losing access to the data during the switch.

Since HyperSwap uses DS8000 Metro Mirror function, data transfer is done synchronously. You must be aware of the same distance limitations, and bandwidth requirements that are associated with transmission times as with any solution when synchronous communications are sent.

See IBM PowerHA SystemMirror for i wiki Link outside Information Center for a graphic on Full System HyperSwap.

DS8000 HyperSwap with IASP

DS8000 HyperSwap with independent auxiliary storage pools (IASPs) is a function of PowerHA and IBM System Storage Server. PowerHA provides the ability to combine HyperSwap technology with the PowerHA LUN level switching technology for not only storage level availability, but also server and partition level availability, making HyperSwap a complete high availability solution. You can configure the System ASPs and any user Asps with HyperSwap, so that the ASPs reside in one DS8000 and are mirrored to a second DS8000.