IBM Support

IBM MQ Multi-Instance: is there a way for the IBM MQ code to expedite the failover?

Question & Answer


Question

You see that it takes several minutes for a standby instance to become active, when the power cord is unplugged from the host that has the active instance running for an IBM MQ Multi-Instance queue manager using NFS V4. You want to know if there is anything that can be done or fine tuned with IBM MQ to expedite the failover process?

Cause

The coordination between the active instance and the standby instance is done using file locks on the NFS V4 file system. NFS V4 uses a feature called "lease based file locking", which is critical for MQ Multi-Instance queue managers.

Note: NFS V3 does NOT have this feature, and thus, it cannot be used with MQ multi-instances.
There are 3 files that are locked for these purposes:
    Master : Held in EXCLUSIVE mode by the MQ Execution Controller (EC) of the active instance.
    Active : Held in SHARED mode by all QM processes (excluding those which are just internal applications), plus fastpath applications.
    Standby : Held in EXCLUSIVE mode by the EC of a standby instance.

Assuming that the shared files are located in: /mqexport/701/data/QMMI1
Then these are the 3 files used during the locking:
$ ls -ls active master standby
4 -rw-rw-rw- 1 mqm mqm 28 2010-01-18 15:19 active
4 -rw-rw-rw- 1 mqm mqm 28 2010-01-18 15:19 master
4 -rw-rw-rw- 1 mqm mqm 30 2010-01-18 15:23 standby
A standby instance:
- EC enters its main processing loop and does NOT complete its startup processing
- EC polls the file locks held by the active instance every 2 seconds
- it is responsive to requests to end (“endmqm–x”)
- it is responsive to requests by applications trying to connect, but rejects them
- Once the file locks are released by the active instance, the EC continues with the startup processing to become the active instance
When the active instance ends gracefully (such as when using "endmqm -s") the MQ code releases the file locks. Then, when the standby instance determines that the locks were released, then it acquires the locks and completes the startup processing.
When the active instance does not end gracefully (such as a power failure), then the MQ code does NOT release the file locks.
The locks need to be released by the NFS Server code which should eventually detect that the application that was using (leasing) these file locks is no longer active, and thus, it is up to the NFS server code to "break the lease" for that defunct application to release the locks. Then, when the standby instance determines that the locks were released, then it acquires the locks and completes the startup processing.
+ The queue manager does NOT interact directly with the File System.
It is worth to explain that MQ does NOT interact directly with the File System:
- The MQ Standby instance asks the Operating System (OS) to get a lock on a file.  
- In turn, the OS interacts with the appropriate components to request to lock a file. For example, if the file resides in an NFS mounted directory, then the OS will ask the NFS Client to read the file.  
- In turn, the NFS client will interact with the network layer to talk to the remote NFS server to read a file.  
- In turn, the remote NFS server will interact with the File System.  
- The File System is reporting that the file is still locked and passes it to the NFS server.  
- In turn, the NFS server passes the information that the file is still locked  to the NFS client.  
- In turn, the NFS client passes the information  to the OS.  
- In turn, the OS passes the information to the MQ queue manager.
- Because the queue manager could not get the lock for the file, then it continues to be a Standby instance.
- The queue manager waits for few seconds and will try again to get the lock.
The above sequence is repeated many times, until the File System decides to break the lease on the lock (many seconds later) due to reaching the grace period (which is NOT under the control of MQ).
- The MQ Standby instance asks the Operating System (OS) to get a lock on a file.  
- In turn, the OS interacts with the appropriate components to request to lock a file. For example, if the file resides in an NFS mounted directory, then the OS will ask the NFS Client to read the file.  
- In turn, the NFS client will interact with the network layer to talk to the remote NFS server to read a file.  
- In turn, the remote NFS server will interact with the File System.  
- The File System is reporting that the file is now free, and it locks it now and passes it to the NFS server.  
- In turn, the NFS server passes the information that the file is now locked as requested to the NFS client.  
- In turn, the NFS client passes the information  to the OS.  
- In turn, the OS passes the information to the MQ queue manager.
- The MQ queue manager, who is running as the Standby, sees that it was able to get the lock on the file and it restarts the queue manager to become the new Active.

Answer

The summary is that there is not much that the IBM MQ code can do to expedite a failover (such as unplugging the power cord) from the active instance of a queue manager.
To expedite the failover process of a multi-instance queue manager, in order for the standby instance to complete its startup sequence and to become an active instance, the system administrator for the server needs to fine tune the behavior of the NFS Server daemon in order to expedite the detection of locks that can be released when the application that leased them is no longer active.
+++ AIX: example of fine tuning attribute for the chnfs command
https://www.ibm.com/docs/en/aix/7.2?topic=c-chnfs-command
Online manual: AIX / 7.2 / chnfs Command
Purpose of chnfs
Changes the configuration of the system to invoke a specified number of nfsd daemons or to change NFS global configuration values.
Details on attribute relevant to this article:
-L v4_lease_time
Specifies the lease time that the state manager uses when granting a lock to a client. This flag sets the NFS Version 4 lease time in seconds. The lease time also affects the length of the grace period, the time when a client is deemed dead or expired, and the duration of time that a client has before getting timed out. The valid range is from 10 to 600 seconds. The default value is 120 seconds. This flag is valid only for NFS Version 4.

[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"ARM Category":[{"code":"a8m0z00000008NKAAY","label":"Components and Features->High Availability (HA)->Multi Instance Queue Managers"}],"ARM Case Number":"TS004265278","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)"}]

Product Synonym

WebSphere MQ WMQ

Document Information

Modified date:
27 October 2021

UID

swg21421805