Verifying shared file system behavior

Run amqmfsck to check whether a shared file system on UNIX [IBMi]and IBM® i systems meets the requirements for storing the queue manager data of a multi-instance queue manager. Run the IBM MQ MQI client sample program amqsfhac in parallel with amqmfsck to demonstrate that a queue manager maintains message integrity during a failure.

Before you begin

You need a server with networked storage, and two other servers connected to it that have IBM MQ installed. You must have administrator (root) authority to configure the file system, and be an IBM MQ Administrator to run amqmfsck.

About this task

Requirements for shared file systems describes the file system requirements for using a shared file system with multi-instance queue managers. The IBM MQ technote Testing and support statement for IBM MQ multi-instance queue managers lists the shared file systems that IBM has already tested with. The procedure in this task describes how to test a file system to help you assess whether an unlisted file system maintains data integrity.

Failover of a multi-instance queue manager can be triggered by hardware or software failures, including networking problems which prevent the queue manager writing to its data or log files. Mainly, you are interested in causing failures on the file server. But you must also cause the IBM MQ servers to fail, to test any locks are successfully released. To be confident in a shared file system, test all of the following failures, and any other failures that are specific to your environment:

  1. Shutting down the operating system on the file server including syncing the disks.
  2. Halting the operating system on the file server without syncing the disks.
  3. Pressing the reset button on each of the servers.
  4. Pulling the network cable out of each of the servers.
  5. Pulling the power cable out of each of the servers.
  6. Switching off each of the servers.

Create the directory on the networked storage that you are going to use to share queue manager data and logs. The directory owner must be an IBM MQ Administrator, or in other words, a member of the mqm group on UNIX. The user who runs the tests must have IBM MQ Administrator authority.

Use the example of exporting and mounting a file system in Create a multi-instance queue manager on Linux® [IBMi]or Mirrored journal configuration on an ASP using ADDMQMJRN to help you through configuring the file system. Different file systems require different configuration steps. Read the file system documentation.

Procedure

In each of the checks, cause all the failures in the previous list while the file system checker is running. If you intend to run amqsfhac at the same time as amqmfsck, do the task, Running amqsfhac to test message integrity in parallel with this task.

  1. Mount the exported directory on the two IBM MQ servers.

    On the file system server create a shared directory shared, and a subdirectory to save the data for multi-instance queue managers, qmdata. For an example of setting up a shared directory for multi-instance queue managers on Linux, see Example in Create a multi-instance queue manager on Linux

  2. Check basic file system behavior.

    On one IBM MQ server, run the file system checker with no parameters.

    Figure 1. On IBM MQ server 1
    
    amqmfsck /shared/qmdata
    
  3. Check concurrently writing to the same directory from both IBM MQ servers.

    On both IBM MQ servers, run the file system checker at the same time with the -c option.

    Figure 2. On IBM MQ server 1
    
    amqmfsck -c /shared/qmdata
    
    Figure 3. On IBM MQ server 2
    
    amqmfsck -c /shared/qmdata
    
  4. Check waiting for and releasing locks on both IBM MQ servers.

    On both IBM MQ servers run the file system checker at the same time with the -w option.

    Figure 4. On IBM MQ server 1
    
    amqmfsck -w /shared/qmdata
    
    Figure 5. On IBM MQ server 2
    
    amqmfsck -w /shared/qmdata
    
  5. Check for data integrity.
    1. Format the test file.

      Create a large file in the directory being tested. The file is formatted so that the subsequent phases can complete successfully. The file must be large enough that there is sufficient time to interrupt the second phase to simulate the failover. Try the default value of 262144 pages (1 GB). The program automatically reduces this default on slow file systems so that formatting completes in about 60 seconds

      Figure 6. On IBM MQ server 1
      
      amqmfsck -f /shared/qmdata
      
      The server responds with the following messages:
      
      Formatting test file for data integrity test.
      
      
      Test file formatted with 262144 pages of data.
      
    2. Write data into the test file using the file system checker while causing a failure.

      Run the test program on two servers at the same time. Start the test program on the server which is going to experience the failure, then start the test program on the server that is going to survive the failure. Cause the failure you are investigating.

      The first test program stops with an error message. The second test program obtains the lock on the test file and writes data into the test file starting where the first test program left off. Let the second test program run to completion.

      Table 1. Running the data integrity check on two servers at the same time
      IBM MQ server 1 IBM MQ server 2
      
      amqmfsck -a /shared/qmdata
      
       
      
      Please start this program on a second machine
      with the same parameters.
      
      
      File lock acquired.
      
      
      Start a second copy of this program
      with the same parameters on another server.
      
      
      
      Writing data into test file.
      
      
      
      To increase the effectiveness of the test,
      interrupt the writing by ending the process,
      temporarily breaking the network connection
      to the networked storage,
      rebooting the server or turning off the power.
      
      
      amqmfsck -a /shared/qmdata
      
      
      Waiting for lock...
      
      
      Waiting for lock...
      
      
      Waiting for lock...
      
      
      Waiting for lock...
      
      
      Waiting for lock...
      
      
      Waiting for lock...
      
      Turn the power off here.
       
      
      File lock acquired.
      
      
      Reading test file
      
      
      Checking the integrity of the data read.
      
      
      Appending data into the test file
      after data already found.
      
      
      The test file is full of data.
      It is ready to be inspected for data integrity.
      

      The timing of the test depends on the behavior of the file system. For example, it typically takes 30 - 90 seconds for a file system to release the file locks obtained by the first program following a power outage. If you have too little time to introduce the failure before the first test program has filled the file, use the -x option of amqmfsck to delete the test file. Try the test from the start with a larger test file.

    3. Verify the integrity of the data in the test file.
      Figure 7. On IBM MQ server 2
      
      amqmfsck -i /shared/qmdata
      
      The server responds with the following messages:
      
      File lock acquired
      
      
      Reading test file checking the integrity of the data read.
      
      
      The data read was consistent.
      
      
      The tests on the directory completed successfully.
      
  6. Delete the test files.
    Figure 8. On IBM MQ server 2
    
    amqmfsck -x /shared/qmdata
    
    Test files deleted.
    

    The server responds with the message:

    
    Test files deleted.
    

Results

The program returns an exit code of zero if the tests complete successfully, and non-zero otherwise.

Examples

The first set of three examples shows the command producing minimal output.

Successful test of basic file locking on one server

> amqmfsck /shared/qmdata
The tests on the directory completed successfully.
Failed test of basic file locking on one server

> amqmfsck /shared/qmdata
AMQ6245: Error Calling 'write()[2]' on file '/shared/qmdata/amqmfsck.lck' error '2'.
Successful test of locking on two servers
Table 2. Successful locking on two servers
IBM MQ server 1 IBM MQ server 2

> amqmfsck -w /shared/qmdata
Please start this program on a second
machine with the same parameters.
Lock acquired.
Press Return
or terminate the program to release the lock.
 
 

> amqmfsck -w /shared/qmdata
Waiting for lock...

[ Return pressed ]
Lock released.
 
 

Lock acquired.
The tests on the directory completed successfully
The second set of three examples shows the same commands using verbose mode.
Successful test of basic file locking on one server

> amqmfsck -v /shared/qmdata
System call: stat("/shared/qmdata")'
System call: fd = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fchmod(fd, 0666)
System call: fstat(fd)
System call: fcntl(fd, F_SETLK, F_WRLCK)
System call: write(fd)
System call: close(fd)
System call: fd = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fcntl(fd, F_SETLK, F_WRLCK)
System call: close(fd)
System call: fd1 = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fcntl(fd1, F_SETLK, F_RDLCK)
System call: fd2 = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fcntl(fd2, F_SETLK, F_RDLCK)
System call: close(fd2)
System call: write(fd1)
System call: close(fd1)
The tests on the directory completed successfully.
Failed test of basic file locking on one server

> amqmfsck -v /shared/qmdata
System call: stat("/shared/qmdata")
System call: fd = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fchmod(fd, 0666)
System call: fstat(fd)
System call: fcntl(fd, F_SETLK, F_WRLCK)
System call: write(fd)
System call: close(fd)
System call: fd = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fcntl(fd, F_SETLK, F_WRLCK)
System call: close(fd)
System call: fd = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fcntl(fd, F_SETLK, F_RDLCK)
System call: fdSameFile = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fcntl(fdSameFile, F_SETLK, F_RDLCK)
System call: close(fdSameFile)
System call: write(fd)
AMQxxxx: Error calling 'write()[2]' on file '/shared/qmdata/amqmfsck.lck', errno 2
(Permission denied).
Successful test of locking on two servers
Table 3. Successful locking on two servers - verbose mode
IBM MQ server 1 IBM MQ server 2

> amqmfsck -wv /shared/qmdata
Calling 'stat("/shared/qmdata")'
Calling 'fd = open("/shared/qmdata/amqmfsck.lkw",
O_EXCL | O_CREAT | O_RDWR, 0666)'
Calling 'fchmod(fd, 0666)'
Calling 'fstat(fd)'
Please start this program on a second
machine with the same parameters.
Calling 'fcntl(fd, F_SETLK, F_WRLCK)'
Lock acquired.
Press Return
or terminate the program to release the lock.
 
 

> amqmfsck -wv /shared/qmdata
Calling 'stat("/shared/qmdata")'
Calling 'fd = open("/shared/qmdata/amqmfsck.lkw",
O_EXCL | O_CREAT | O_RDWR,0666)'
Calling 'fd = open("/shared/qmdata/amqmfsck.lkw,
O_RDWR, 0666)'
Calling 'fcntl(fd, F_SETLK, F_WRLCK)
'Waiting for lock...

[ Return pressed ]
Calling 'close(fd)'
Lock released.
 
 

Calling 'fcntl(fd, F_SETLK, F_WRLCK)'
Lock acquired.
The tests on the directory completed successfully

Running amqsfhac to test message integrity

amqsfhac checks that a queue manager using networked storage maintains data integrity following a failure.

Before you begin

You require four servers for this test. Two servers for the multi-instance queue manager, one for the file system, and one for running amqsfhac as a IBM MQ MQI client application.

Follow step 1 in Procedure to set up the file system for a multi-instance queue manager.

About this task

Procedure

  1. Create a multi-instance queue manager on another server, QM1, using the file system you created in step 1 in Procedure.
  2. Start the queue manager on both servers making it highly available.
    On server 1:
    
    strmqm -x QM1
    
    On server 2:
    
    strmqm -x QM1
    
  3. Set up the client connection to run amqsfhac.
    1. Use the procedure in Verifying a client installation to set up a client connection, or the example scripts in Reconnectable client samples.
    2. Modify the client channel to have two IP addresses, corresponding to the two servers running QM1.
      In the example script, modify:
      
      DEFINE CHANNEL(CHANNEL1) CHLTYPE(CLNTCONN) TRPTYPE(TCP) +
      CONNAME('LOCALHOST(2345)') QMNAME(QM1) REPLACE
      
      To:
      
      DEFINE CHANNEL(CHANNEL1) CHLTYPE(CLNTCONN) TRPTYPE(TCP) +
      CONNAME('server1(2345),server2(2345)') QMNAME(QM1) REPLACE
      
      Where server1 and server2 are the host names of the two servers, and 2345 is the port that the channel listener is listening on. Usually this defaults to 1414. You can use 1414 with the default listener configuration.
  4. Create two local queues on QM1 for the test.
    Run the following MQSC script:
    
    DEFINE QLOCAL(TARGETQ) REPLACE
    DEFINE QLOCAL(SIDEQ) REPLACE
    
  5. Test the configuration with amqsfhac
    
    amqsfhac QM1 TARGETQ SIDEQ 2 2 2
    
  6. Test message integrity while you are testing file system integrity.
    
    amqsfhac QM1 TARGETQ SIDEQ 10 20 0
    

    If you stop the active queue manager instance, amqsfhac reconnects to the other queue manager instance once it has become active. Restart the stopped queue manager instance again, so that you can reverse the failure in your next test. You will probably need to increase the number of iterations based on experimentation with your environment so that the test program runs for sufficient time for the failover to occur.

Results

An example of running amqsfhac in step 6 is shown in Figure 9. The test is a success.

If the test detected a problem, the output would report the failure. In some test runs MQRC_CALL_INTERRUPTED might report Resolving to backed out. It makes no difference to the result. The outcome depends on whether the write to disk was committed by the networked file storage before or after the failure took place.

Figure 9. Output from a successful run of amqsfhac

Sample AMQSFHAC start
qmname = QM1
qname = TARGETQ
sidename = SIDEQ
transize = 10
iterations = 20
verbose = 0
Iteration 0
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Resolving MQRC_CALL_INTERRUPTED
MQGET browse side tranid=14 pSideinfo->tranid=14
Resolving to committed
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17
Iteration 18
Iteration 19
Sample AMQSFHAC end