DB2 10.5 for Linux, UNIX, and Windows

Scenario: Deploying HADR in a DB2 pureScale environment

This scenario describes the planning, configuring, and deploying of a high availability disaster recovery (HADR) setup for an online travel service called ExampleFlightsExpress (EFE), which is currently using the DB2® pureScale® Feature. All these steps can be done without any downtime.

Background

EFE chose to use the DB2 pureScale Feature for two reasons:

Continuous availability. Downtime is fatal in the online retailing business, where customers are literally accessing services 24x7.
Scalability. Because customer demand ebbs and flows depending on the time of year, EFE must be able to add capacity easily and without taking an outage.

EFE does not have a complete disaster recovery plan. EFE’s setup is resilient unless there is a widespread outage that brings down the whole DB2 pureScale cluster. To address this shortcoming, EFE is going to use HADR, which is supported with the DB2 pureScale Feature. In a DB2 pureScale environment, HADR has a few limitations, such as no support for reads on standby and no support for SYNC or NEARSYNC synchronization modes, but those are acceptable because EFE wants HADR only for disaster recovery.

Planning

EFE is going to use the company's head office (site A) as the location for the HADR primary cluster and a regional office (site B), which is 500 miles (800 km) away as the location for the standby cluster. The two sites are connected by a WAN. Other details about the environment are as follows:

Database name: hadr_db
Instance owner on all hosts: db2inst
TCP port that is used for HADR primary-standby communication: 4000
TCP port that is used for SQL client/server communication: 8000
Hosts for cluster caching facilities (with IDs 128 and 129) and members (with IDs 0, 1, 2, and 3) on the primary: cfp0, cfp1, p0, p1, p2, and p3
Hosts for cluster caching facilities and members on the standby: cfs0, cfs1, s0, s1, s2, and s3

Preferred replay members: Only one member on the standby performs all the replay for the logs that are generated on all the members on the primary. Therefore, EFE's DBA determines which member hosts in the primary and standby clusters have the most memory and CPU resources and designates those members as the preferred replay members. It is necessary to do this planning even for the primary because that designated member performs all the replay if the primary database fails and is reintegrated as the standby. On the primary, this is p0 and on the standby, it is s0; in both cases, member 0 is the resident member on those hosts.
Synchronization mode: EFE's DBA must choose between ASYNC (the default) and SUPERASYNC for the synchronization mode. To do this, the DBA analyzes the WAN and determines that network throughput is 300 Mbit/second and that the round-trip time is 80 ms. Next, the DBA measures the logging rate, which is 20 MB/second at the cluster level. The network throughput is sufficient to support the logging rate and allow peak logging to reach 37 MB/second, so ASYNC is a suitable mode. If the throughput were closer to the logging rate, SUPERASYNC would be a better choice because it would allow the primary to get far ahead of the standby during peak transaction time.
Scaling considerations: Because EFE tends to add temporary members during peak times of the year, EFE must decide how to scale out the standby, because the member topology must be the same across the HADR pair. To avoid additional costs, EFE decides that when it deploys additional members on the primary, on the standby cluster, it uses multiple members on the host machines. This would likely result in a less powerful database if the standby must take over as the new primary, but the savings are worth this potential drawback.

Configuring HADR

The DBA performs the following steps:

The DBA takes a backup of the intended primary database, hadr_db:
```
db2 BACKUP DB hadr_db TO backup_dir
```
The DBA restores the backup onto the intended standby cluster by issuing the following command:
```
db2 RESTORE DB hadr_db FROM backup_dir
```
On the primary, the DBA sets the cluster-level HADR parameters that specify the standby cluster and the synchronization mode. Particularly important is the hadr_target_list parameter, which lists the remote members. Only one remote member is required to be listed in the hadr_target_list. DB2 retrieves the other members' addresses after the initial contact with the listed member. However, providing multiple addresses prevents a single point of failure, that is, the clusters cannot connect to each other if the one and only listed member is down. The DBA issues the following command:
```
db2 "UPDATE DB CFG FOR hadr_db USING
     HADR_TARGET_LIST  {s0:4000|s1:4000|s2:4000|s3:4000}
     HADR_REMOTE_HOST  {s0:4000|s1:4000|s2:4000|s3:4000}
     HADR_REMOTE_INST  db2inst
     HADR_SYNCMODE     async"
```
Because there is only one standby, the hadr_remote_host parameter specifies the same group of addresses as the hadr_target_list parameter.

The DBA sets the member-level HADR parameters on the primary, which identify the address and port for each member:

For member 0:

db2 "UPDATE DB CFG FOR hadr_db MEMBER 0 USING
     HADR_LOCAL_HOST   p0
     HADR_LOCAL_SVC    4000"

For member 1:

db2 "UPDATE DB CFG FOR hadr_db MEMBER 1 USING
     HADR_LOCAL_HOST   p1
     HADR_LOCAL_SVC    4000"

For member 2:

db2 "UPDATE DB CFG FOR hadr_db MEMBER 2 USING
     HADR_LOCAL_HOST   p2
     HADR_LOCAL_SVC    4000"

For member 3:

db2 "UPDATE DB CFG FOR hadr_db MEMBER 3 USING
     HADR_LOCAL_HOST   p3
     HADR_LOCAL_SVC    4000"

On the standby, the DBA sets the cluster-level HADR parameters that specify the primary cluster and the synchronization mode:

db2 "UPDATE DB CFG FOR hadr_db USING
     HADR_TARGET_LIST  {p0:4000|p1:4000|p2:4000|p3:4000}
     HADR_REMOTE_HOST  {p0:4000|p1:4000|p2:4000|p3:4000}
     HADR_REMOTE_INST  db2inst
     HADR_SYNCMODE     async"

The DBA sets the member-level HADR parameters on the standby, which identify the address and port for each member:

For member 0:

db2 "UPDATE DB CFG FOR hadr_db MEMBER 0 USING
     HADR_LOCAL_HOST   s0
     HADR_LOCAL_SVC    4000"

For member 1:

db2 "UPDATE DB CFG FOR hadr_db MEMBER 1 USING
     HADR_LOCAL_HOST   s1
     HADR_LOCAL_SVC    4000"

For member 2:

db2 "UPDATE DB CFG FOR hadr_db MEMBER 2 USING
     HADR_LOCAL_HOST   s2
     HADR_LOCAL_SVC    4000"

For member 3:

db2 "UPDATE DB CFG FOR hadr_db MEMBER 3 USING
     HADR_LOCAL_HOST   s3
     HADR_LOCAL_SVC    4000"

Starting HADR

As with other HADR environments, the standby database must be started first. Because the member that the START HADR command is issued from is designated the preferred replay member, the DBA issues the following commands:

From member 0 on the standby:
```
db2 START HADR ON DB hadr_db AS STANDBY
```
From member 0 on the primary:
```
db2 START HADR ON DB hadr_db AS PRIMARY
```

To determine that HADR is up and running, the DBA calls the MON_GET_HADR table function from the primary:

select LOG_STREAM_ID, PRIMARY_MEMBER, STANDBY_MEMBER, HADR_STATE
from table (mon_get_hadr(-2))

  LOG_STREAM_ID PRIMARY_MEMBER STANDBY_MEMBER HADR_STATE            
  ------------- -------------- -------------- -----------------------
  0             0              0              PEER
  1             1              0              PEER
  2             2              0              PEER
  3             3              0              PEER

The DBA confirms that standby member 0, the preferred replay member, is indeed the current replay member by looking at the STANDBY_MEMBER field. Every log stream reports the same member on the standby because all the members on the primary are connected to that standby member.

Role switch

The DBA has to perform a role switch; that is, the current standby will take over the primary role, and the current primary will take over the standby role. This will allow some maintenance which requires a shutdown of the cluster to be performed at site A. This procedure takes place during a time of low usage in order to minimize impact on applications currently connected to the database.

The DBA ensures that none of the members on the primary are in an abnormal state:

SELECT ID,  
       varchar(STATE,21) AS STATE, 
       varchar(HOME_HOST,10) AS HOME_HOST, 
       varchar(CURRENT_HOST,10) AS CUR_HOST, 
       ALERT 
FROM SYSIBMADM.DB2_MEMBER

ID     STATE                 HOME_HOST  CUR_HOST   ALERT
------ --------------------- ---------- ---------- --------
0      STARTED               p0         p0         NO
1      STARTED               p1         p1         NO
2      STARTED               p2         p2         NO
3      STARTED               p3         p3         NO

  4 record(s) selected.

The DBA ensures that all of the log streams are in PEER state:

select LOG_STREAM_ID, PRIMARY_MEMBER, STANDBY_MEMBER, HADR_STATE
from table (mon_get_hadr(-2))

  LOG_STREAM_ID PRIMARY_MEMBER STANDBY_MEMBER HADR_STATE            
  ------------- -------------- -------------- -----------------------
  0             0              0              PEER
  1             1              0              PEER
  2             2              0              PEER
  3             3              0              PEER

At site B, the DBA issues the TAKEOVER HADR command on member 0:
```
TAKEOVER HADR ON DB hadr_db
```
After the command completes, member 0 on the new standby (the preferred replay member) is chosen as the replay member and the database is shut down on the other members on the standby cluster. On the new primary, the database is only activated on member 0; other members are activated with a client connection or if the DBA explicitly issues the ACTIVATE DATABASE command on each of them. Automatic client reroute sends any new clients to site B.
At site A, the DBA deactivates the database on the standby (this keeps the database in its role as an HADR standby):
```
DEACTIVATE DATABASE hadr_db
```
At site A, the DBA stops DB2 on the standby:
```
db2stop
```
At site A, the DBA performs the required maintenance.
At site A, the DBA starts DB2 on the standby:
```
db2start
```
At site A, the DBA activates the database on the standby:
```
ACTIVATE DATABASE hadr_db
```
The database is activated as an HADR primary with one replay member.
To revert to the original setup, the DBA issues the TAKEOVER HADR command on member 0 at site A:
```
TAKEOVER HADR ON DB hadr_db
```

Failover

The DBA has to perform a failover; that is, an unexpected outage at site A requires that the standby at site B take over the primary role. An important difference for HADR in a DB2 pureScale environment is that there is no support for using IBM® Tivoli® System Automation for Multiplatforms (SA MP) to automate the failover (it's already being used to ensure high availability in the DB2 pureScale cluster). At any rate, in this scenario the DBA wants to have manual control over this kind of response to an outage.

The DBA performs a forced takeover from the standby database at site B.
```
TAKEOVER HADR ON DB hadr_db BY FORCE
```
The standby sends a disabling message to shut down the primary. After stopping log shipping and retrieval, the standby completes the replay of any logs in its log path. Finally, the standby becomes the new primary.
The DBA issues the db2pd command on the new primary to ensure that it has taken over the primary role.
```
db2pd -hadr -db hadr_db
```
After addressing the cause of the outage and getting site A up and running again, the DBA attempts to reintegrate the old primary as a standby.
```
START HADR ON DB hadr_db AS STANDBY
```
The DBA verifies that the site A is now the standby by checking the HADR_CONNECT_STATUS and HADR_STATE fields to ensure that the show the database is connected and in either peer or remote catchup state.
```
db2pd -hadr -db hadr_db
```
Unfortunately, the log streams of the databases at the two sites have diverged, so the database is showing as disconnected. The DBA looks at the diag.log file of one of the members on the old primary and sees a message indicating that the database on site A cannot be made consistent with the new primary database.

The DBA has to drop the database and reinitialize it as an HADR standby at site A.

Drop the database:
```
DROP DATABASE DB hadr_db
```

Take a backup of the database at site B.

BACKUP DATABASE DB hadr_db ONLINE TO backup_dir

Restore the backup image to site A.
```
db2 RESTORE DB hadr_db FROM backup_dir
```

Set the cluster-level and member-level configuration parameters on the database at site A.

db2 "UPDATE DB CFG FOR hadr_db USING
     HADR_TARGET_LIST  {s0:4000|s1:4000|s2:4000|s3:4000}
     HADR_REMOTE_HOST  {s0:4000|s1:4000|s2:4000|s3:4000}
     HADR_REMOTE_INST  db2inst
     HADR_SYNCMODE     async"

For member 0:

db2 "UPDATE DB CFG FOR hadr_db MEMBER 0 USING
     HADR_LOCAL_HOST   p0
     HADR_LOCAL_SVC    4000"

For member 1:

db2 "UPDATE DB CFG FOR hadr_db MEMBER 1 USING
     HADR_LOCAL_HOST   p1
     HADR_LOCAL_SVC    4000"

For member 2:

db2 "UPDATE DB CFG FOR hadr_db MEMBER 2 USING
     HADR_LOCAL_HOST   p2
     HADR_LOCAL_SVC    4000"

For member 3:

db2 "UPDATE DB CFG FOR hadr_db MEMBER 3 USING
     HADR_LOCAL_HOST   p3
     HADR_LOCAL_SVC    4000"

The DBA wants to designate member 0 as the preferred replay member and issues the START HADR command from member 0 on the site A:
```
db2 START HADR ON DB hadr_db AS STANDBY
```
The DBA verifies that the site A is now the standby by checking the HADR_CONNECT_STATUS and HADR_STATE fields to ensure that the show the database is connected and is catching up to the primary.
```
db2pd -hadr -db hadr_db
```
To revert to the original setup, the DBA can perform a role switch as described in the previous section.