When an HADR standby database takes over as the primary
database in a DB2® pureScale® environment,
there are a number of important differences from HADR in other environments.
With HADR, there are two types of takeover: role
switch and failover. Role switch, sometimes called
graceful takeover or non-forced takeover, can be performed only when
the primary is available and it switches the role of primary and standby.
Failover, or forced takeover, can be performed when the primary is
not available. It is commonly used in primary failure cases to make
the standby the new primary. The old primary remains in the primary
role in a forced takeover, but the standby sends it a message to disable
it. Both types of
takeover are supported in a DB2 pureScale environment,
and both can be issued from any of the standby database members and
not just the current replay member. However, after the standby completes
the transition to the primary role, the database is only started on
the member that served as the replay member before the takeover. The
database can be started on the other members by issuing an ACTIVATE
DATABASE command or implicitly through a client connection.
Role switch
After a role switch, which is
initiated by issuing the
TAKEOVER HADR command
from any standby member, the standby cluster becomes the primary cluster
and vice versa. Role switch helps ensure that no data is lost between
the old primary and new primary. You can initiate a role switch in
the following circumstances only:
- Crash recovery is not occurring on the primary cluster,
including member crash recovery that is pending or in progress.
- All the log streams are in peer or assisted remote catchup state.
- All the log streams are in remote catchup state or in assisted
remote catchup state, and the synchronization mode is SUPERASYNC.
Before you initiate a role switch in remote catchup or assisted
remote catchup state, check the log gap between the primary and standby
log streams. A large gap can result in a long takeover time because
all of the logs in that gap must be shipped and replayed first.
During
a role switch, the following steps occur on the primary:
- New connections are rejected on all members, any open transactions
are rolled back, and all remaining logs are shipped to the standby.
- The primary cluster's database role changes to standby.
- A member that has a direct connection to the standby is chosen
as the replay member, with preference given to the preferred replay
member (that is, the member that HADR was started from).
- Log receiving and replay starts on the replay member.
- The database is shut down on the other non-replay members of the
cluster.
And the following steps occur on the standby:
- Log receiving is stopped on the replay member after the end of
logs is reached on each log stream, helping ensure no data loss.
- The replay member finishes replaying all received logs.
- After it is confirmed that the primary cluster is now in the standby
role, the replay member changes the standby cluster's role to primary.
- The database is opened for client connections, but it is only
activated on the member that was previously the standby replay member.
Failover
After a failover,
which is initiated by issuing the
TAKEOVER HADR command
with the
BY FORCE option from any standby member,
the standby cluster becomes the primary cluster. The old primary cluster
is sent a disabling message, but its role is not changed. Any member
on the primary that receives this message disables the whole primary
cluster. By initiating a failover, you are accepting the trade-off
between potential data loss and having a working database. You cannot
initiate a failover if the databases are in local catchup state.
Note: Unlike
in previous releases, you can now initiate a failover even if log
archive retrieval is in progress.
During
a failover, the following steps occur on the primary (assuming it
is online and connected to the standby):
- After it receives the disabling message, the database is shut
down and log writing is stopped.
And the following steps occur on the standby, all of which are
carried out from the replay member:
- A disabling message is sent to the primary, if it is connected.
- Log shipping and log retrieval is stopped, which entails a risk
of data loss.
- The replay member finishes replaying all received logs (that is,
the logs that are stored in the log path).
- Any open transactions are rolled back.
- The replay member changes the standby cluster's role to primary.
- The database is opened for client connections, but it is only
activated on the member that was previously the standby replay member.
You can reintegrate the old primary as a new standby only
if its log streams did not diverge from the new primary's log streams.
Before you can start HADR, the database must be offline on all of
the old primary's members; the cluster
caching facilities,
however, can stay online. If any members are online, kill them instead
of issuing the DEACTIVATE DATABASE command on them.