Client sessions may become blocked for processing on Primary during data transmission issue between the Primary and the HDR Secondary
This document applies only to the following language version(s):
When there is a data transmission issue between Primary and HDR Secondary (usually caused by a network problem) the client applications which are working with the Primary may become blocked and may look hung even if the data replication is configured to be asynchronous (DRINTERVAL > 0).
Once the ping timeout is written to the online.log file of the Primary/Secondary instance (see the sample output below), user sessions return to normal work.
11:27:57 DR: ping timeout
11:27:57 DR: Receive error
11:27:57 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr =
: Network connection is broken.
11:27:57 DR_ERR set to -1
11:27:59 DR: Turned off on primary server
When data replication is established, primary and secondary regularly exchange ping messages. If the ping acknowledge is not received by the time when DRTIMEOUT is elapsed, a server re-sends ping message three more times and then reports ping timeout and turns off the DR subsytem. From this, the time span between first ping and the "DR: ping timeout" message can be as large as (DRTIMEOUT x 4).
For example, if DRTIMEOUT is set to be 180 second, it will take 12 minutes before DR is turned off.
Although with asynchronous replication transactions do not wait for acknowledgement from HDR secondary after the logical log record was put in DR buffer, when there is a transmission failure, the DR buffer may fill up pretty quickly (the time required for that depends on DRTIMEOUT value, LOGBUFF value and the activity that the instance is having). Until DR is not turned off, a user session has to wait until DR buffer has enough space for the logical log record.
In addition to the above scenario, a checkpoint can be requested on Primary between the first ping failure and the time when the "DR: ping timeout" message is reported. The checkpoints are synchronous between Primary and Secondary regardless of the DRINTERVAL value. Once checkpoint is requested, it will prevent any threads from entering the critical section. The instance will remain blocked until checkpoint acknowledgment is received from the Secondary or until DR is turned off.
Diagnosing the problem
For scenario #1 check if the corresponding user thread demonstrates a stack similar to the following:
Stack for thread: 73 sqlexec
For scenario #2 check the 'onstat -g ath' output and see if the user threads are having "cond wait cp" status.
Resolving the problem
To resolve the problem it may be required to:
1) Fix any problems that can cause data transmission issues between Primary and HDR Secondary (e.g. increase network reliability and throughput)
2) Decrease the value of DRTIMEOUT configuration parameter.
Note: increasing the LOGBUFF may also help to reduce the blockage time, however having a large logical log buffer may result in data loss in case of the Primary failure.
More support for:
Software version: 11.5, 11.70, 12.1
Operating system(s): AIX, HP-UX, Linux, OS X, Solaris, Windows
Software edition: Enterprise, Growth, Ultimate, Workgroup
Reference #: 1643957
Modified date: 26 June 2014