Troubleshooting the network lock manager

Some network lock manager problems you encounter can be solved by using the following tips.

If you receive a message on a client similar to:

clnttcp_create: RPC: Remote System error - Connection refused
rpc.statd:cannot talk to statd at {server}

then the machine thinks there is another machine which needs to be informed that it might have to take recovery measures. When a machine restarts, or when the rpc.lockd and the rpc.statd daemons are stopped and restarted, machine names are moved from /var/statmon/sm to /var/statmon/sm.bak and the rpc.statd daemon tries to inform each machine corresponding to each entry in /var/statmon/sm.bak that recovery procedures are needed.

If the rpc.statd daemon can reach the machine, then its entry in /var/statmon/sm.bak is removed. If the rpc.statd daemon cannot reach the machine, it will keep trying at regular intervals. Each time the machine fails to respond, the timeout generates the above message. In the interest of locking integrity, the daemon will continue to try; however, this can have an adverse effect on locking performance. The handling is different, depending on whether the target machine is just unresponsive or semi-permanently taken out of production. To eliminate the message:

  1. Verify that the statd and lockd daemons on the server are running by following the instructions in Getting the current status of the NFS daemons. (The status of these two daemons should be active.)
  2. If these daemons are not running, start the rpc.statd and rpc.lockd daemons on the server by following the instructions in Starting the NFS daemons.
    Note: Sequence is important. Always start the statd daemon first.

    After you have restarted the daemons, remember that there is a grace period. During this time, the lockd daemons allow reclaim requests to come from other clients that previously held locks with the server, so you might not get a new lock immediately after starting the daemons.

Alternatively, eliminate the message by:

  1. Stop the rpc.statd and rpc.lockd daemons on the client by following the instructions in Stopping the NFS daemons.
  2. On the client, remove the target machine entry from /var/statmon/sm.bak file by entering:
    rm /var/statmon/sm.bak/TargetMachineName

    This action keeps the target machine from being aware that it might need to participate in locking recovery. It should only be used when it can be determined that the machine does not have any applications running that are participating in network locking with the affected machine.

  3. Start the rpc.statd and rpc.lockd daemons on the client by following the instructions in Starting the NFS daemons.

If you are unable to obtain a lock from a client, do the following:

  1. Use the ping command to verify that the client and server can reach and recognize each other. If the machines are both running and the network is intact, check the host names listed in the /var/statmon/hosts file for each machine. Host names must exactly match between server and client for machine recognition. If a name server is being used for host name resolution, make sure the host information is exactly the same as that in the /var/statmon/hosts file.
  2. Verify that the rpc.lockd and rpc.statd daemons are running on both the client and the server by following the instructions in Getting the current status of the NFS daemons. The status of these two daemons should be active.
  3. If they are not active, start the rpc.statd and rpc.lockd daemons by following the instructions in Starting the NFS daemons.
  4. If they are active, you might need to reset them on both clients and servers. To do this, stop all the applications that are requesting locks.
  5. Next, stop the rpc.statd and rpc.lockd daemons on both the client and the server by following the instructions in Stopping the NFS daemons.
  6. Now, restart the rpc.statd and rpc.lockd daemons, first on the server and then on the client, by following the instructions in Starting the NFS daemons.
    Note: Sequence is important. Always start the statd daemon first.

If the procedure does not alleviate the locking problem, run the lockd daemon in debug mode, by doing the following:

  1. Stop the rpc.statd and rpc.lockd daemons on both the client and the server by following the instructions in Stopping the NFS daemons.
  2. Start the rpc.statd daemon on the client and server by following the instructions in Starting the NFS daemons.
  3. Start the rpc.lockd daemon on the client and server by typing:
    /usr/sbin/rpc.lockd -d1
    When invoked with the -d1 flag, the lockd daemon provides diagnostic messages to syslog. At first, there will be a number of messages dealing with the grace period; wait for them to time out. After the grace period has timed out on both the server and any clients, run the application that is having lock problems and verify that a lock request is transmitted from client to server and server to client.

You can restrict the number range of IP ports used by the NFS client for communication with the NFS server by setting the NFS_PORT_RANGE variable in the /var/statmon/environment file.