Geographically dispersed DB2® pureScale® clusters (GDPC) provide high availability and disaster recovery failover when a cluster member goes down.
Geographically dispersed DB2 pureScale clusters (GDPC) can automatically and transparently recover from the same hardware or software failures as a single-site DB2 pureScale cluster. In addition, because a GDPC spans multiple physical sites, a GDPC can also automatically and transparently recover from hardware failures that traditionally affect an entire site, for example, localized power outages or localized network disruptions.
The estimated time for a GDPC to recover from software faults is comparable to the recovery time for software faults in a single-site DB2 pureScale cluster. As with non-dispersed pureScale clusters, if SCSI-3 PR is not being used, there is a slightly longer impact to the workload for hardware failures that affect an entire system. Recovery time is dependent on many factors, such as the number of file systems, file size, and frequency of writes to the files.
Care must be taken to ensure that sufficient space is available for critical file systems such as /var and /tmp because a lack of space on these file systems might affect the operation of the cluster services.
For a single system failure, any members on that system are restarted in restart-light mode, either on other systems at the same site, or on systems at the other site. Note that no preference is given to starting the member in restart-light mode on another system at the same site . Although this might be the intuitive expectation, there is no benefit in terms of overall failure recovery time. The restarting member will need to communicate with members and CFs from both sites equally, so the same member failover logic is used. After a primary CF system failure, the primary CF role will be failed over to the secondary CF at the surviving site.
db2inst1@hostA1>db2instance –list -sharedfs
There is currently an alert for the shared file system filesystem_name in the data-sharing instance. Critical data resides on disks that are suspended or being deleted.
There is currently an alert for the shared file system filesystem_name in the data-sharing instance. The file system is not properly replicated. Run the db2cluster command: db2cluster -cfs -rebalance filesystem_name.
db2inst1@hostA1> /home/db2inst1/sqllib/bin/db2cluster –cfs –rebalance –filesystem filesystem_name
The db2cluster -rebalance command is a very I/O intensive operation and can have a significant impact on the running workload, so it is typically performed when the workload is at its lowest. However this must also be balanced with the requirement to re-enable full filesystem replication as soon as possible to be able to sustain future storage, system, or site failures.
Some disk I/O accesses can return to normal, while others are still delayed, waiting for a specific disk to return back an error. After all disks in a storage replica have been marked as failed, filesystem I/O times will return to normal, as GPFS will have stopped replicating data writes to the failed disks. Note that even though GDPC is operational during this entire period, after some disks or an entire storage replica has failed, there will only be a single copy of the filesystem data available, which will leave the GDPC exposed to a single point of failure until the problem has been resolved and replication has been restarted.
As mentioned earlier, one important thing to notice is that the storage failure recovery time is dependent on the storage controller’s configuration – in particular how fast the storage controller will return an error back up to GPFS so that GPFS can mark the affected disks as inaccessible. By default, some storage controllers are configured to either retry indefinitely on errors, or to delay reporting errors back up the I/O stack by a lengthy amount of time – sometimes even long enough to allow the storage controller to reboot. Although this is usually desirable when only one replica of storage is available (avoids returning a filesystem error if the error is possibly recoverable at the storage layer), this will increase the storage failure recovery time significantly, and in some cases will make the storage layer seem unresponsive, which might be enough to cause the rest of the cluster to assume that all members and CFs are also unresponsive, causing Tivoli® System Automation MP (Tivoli SA MP) to stop and restart them, which is undesirable. With GDPC, since there is a second replica of data, and a key requirement is automatic and transparent failure recovery of a wide variety of failures including storage failures, the storage controller failure detection time is reduced. A good starting point is to set the storage failure detection time to 20 seconds – the exact mechanism to do this will be dependent on the type of storage and storage controller being used. For an example on how to update the failure detection time for the AIX® MPIO multipath device driver, see Configuring the cluster for high availability.
root@hostA1>/usr/lpp/mmfs/bin/mmnsddiscover -a -N hostA1,hostA2,hostA3
root@hostA1>/usr/lpp/mmfs/bin/mmchdisk filesystem start -d gpfs disk identifier -N hostA1,hostA2,hostA3
root@hostA1>/usr/lpp/mmfs/bin/mmfsck filesystem -o
Note that the tiebreaker is not specified. To confirm that the disk has moved to up state, use mmlsdisk.
Consider a scenario where either site A or site B experiences a total failure, such as a localized power outage, and is expected to eventually come back online. This type of failure is handled automatically and transparently by GDPC. Systems on the surviving site will independently perform restart-light member crash recovery for each of the members from the failed site in parallel. All members that were configured on the failed site will remain in restart-light mode on guest systems on the surviving site until the members’ home systems on the failed site have been recovered, that is if only one member system on the failed site recovers, then only the member configured on that system will failback to its home system. If the site that failed contained the primary CF, then the primary CF role will automatically failover to the secondary CF located on the surviving site. During recovery, there will be a period of time where all write transactions will be paused. The read transactions might be paused as well, depending on whether the data being read is already cached by the member, and whether the data being read is separate from data that was being updated at the time of the site failure. Data that is not already cached by the member must be fetched from the CF, which will be delayed until recovery is complete. The length of time that transactions are paused depends mainly on the time required for GPFS to perform file system recovery. File system recovery time is primarily influenced by the number of file systems as well as the frequency and size of file system write requests around the time of failure, so workloads with a higher ratio of updates might be affected by longer file system recovery times.
It is important that all the hosts on the surviving site as well as host T remain online, otherwise quorum will not be reached (to maintain majority quorum access to all hosts on the surviving site plus the tiebreaker host is needed).
Consider a scenario where all connectivity is lost between site A and site B (such as, the dark fiber between sites is compromised, switch failure, InfiniBand extender failure). To reduce the chance of this type of failure, a redundancy in connectivity between site A and site B is a best practice.
If one site loses all connectivity with the other site, as well as loses connectivity to the tiebreaker site, this form of connectivity failure will be identical to a site failure. The site that can continue to communicate with the tiebreaker site will be the surviving site. Until such time that connectivity is restored to the site, all DB2 members from the systems at the failed site will be restarted in restart-light mode on hosts on the surviving site, and the primary CF role will be moved over to the surviving site, if necessary.
root@hostA1> /usr/lpp/mmfs/bin/mmchmgr –c primary_cf_system
As the location of the GPFS cluster manager may change, especially after a node reboot, it is monitored to ensure that it remains on the same site as the primary CF. If instead of a connectivity loss between sites A and B, all connectivity with the tiebreaker site is lost from both sites, the tiebreaker host T will be expelled from the cluster. As no DB2 member or CF is running on host T, there is no immediate functional impact on the GDPC instance. However, in the event of a subsequent site failure, quorum is lost.