IBM Support

Diagnosing and resolving ANR8779E with error 16/170 EBUSY failures on drive OPEN

Troubleshooting


Problem

Diagnosing and resolving ANR8779E with error number 16 / 170 (EBUSY) reported by the IBM Spectrum Protect server during drive OPEN operations.

Symptom


ANR8779E Unable to open drive /dev/rmtX, error number = 16
ANR8779E Unable to open drive mtx.y.z.n error number=170

Cause

What does message ANR8779E with error number 16/170 (EBUSY) on OPEN mean?
This is a return code returned to IBM Spectrum Protect (formerly Tivoli Storage Manager) by the operating system via the device driver (IBMtape or TSMtape) when attempting to open a device special file. On Unix operating systems, errno 16 means that the file was busy, or EBUSY. On Windows operating systems, the errno is 170, but the meaning is the same. This error indicates that the operating system could not satisfy IBM Spectrum Protect's request to open the device, because it was in use somewhere else. When dealing with special device files, this means that the device has an open reservation on it through a different HBA, which typically means on a separate host (but could be the same host if multiple HBA's are in use).

What sequence of events occur during a library sharing tape volume mount?
The Library Manager is responsible for drive allocation and assignments. Once a drive is as allocated,
the owning host (Library Client/STA) ensures exclusive ownership by "reserving" the drive and the Library Manager maintains the drive inventory and owners.

The following is an example of the sequence of events that tape place for a single drive assignment in a library sharing environment:

    1. A library client (LC) or storage agent (STA) requires a tape mount resource.
    2. The LC/STA requests a drive from the library manager (LM).
    3. The LM locates an available drive, reserves the drive, mounts a tape, and verifies the tape label.
    4. The LM then releases the drive reservation and passes the drive to the requesting LC/STA for use.
    5. The LC/STA reserves the drive, verifies the tape label for itself, and begins read/write operations.
    6. When the LC/STA is finished writing/reading, it verifies the label, releases the reservation, and passes the drive back to the LM.
    7. The LM reserves the drive, verifies the label, dismounts the volume (this may take time, depending on dismount delay), and then releases the drive reservation.

Generally, an EBUSY error is returned any time a reservation is held by a host that should not hold it, or is behaving outside of the intended design of a library sharing environment.

Known Tivoli Storage Manager causes of error number 16/170 (EBUSY):


Excessive device heartbeats causes premature drive reclamation

    Under somewhat rare conditions, a Tivoli Storage Manager library client can become overloaded with drive "in-use" confirmation heartbeats from library clients and/or storage agents. This behavior opens a window in which the library manager may not be able to process the heartbeats in a timely manner. Since the library manager is not able to process the heartbeats, it believes the drives are no longer in use and will attempt to reclaim them for use by other processes. An indicator of this problem may be the following warning:

    ANR8925W Drive <drive name> in library <library name> has not been confirmed for use by server <server name> for over <number> seconds. Drive will be reclaimed for use by others.

    If the drive in question is not being used by the library client or storage agent, then this warning would be expected and normal as there may have been an un-correctable problem with the hardware or software involved and the library manager should reclaim the drive. If the drive in question is actively being used by the library client or storage agent, this warning message and the actions it triggers are incorrect.

    When the drive is attempted to be reclaimed, EBUSY (ANR8779E) errors on the library manager may be printed for this drive because the library client will still be holding the reservation. Eventually, the library manager should be able to free the drive and the EBUSY errors will stop.

    The premature drive reclamation problem and potential solutions are documented by the following APAR's:

    IC54647 - This is a short term solution for V5 servers. A new option, LIBSHRTIMEOUT was introduced to try and mitigate the issue.
    IC55068 - This is the long term solution available only in V6+.
    IC63637 - This is an update to the long term solution.

    Recommendations: There is a design limitation in V5 that prevents implementation of the long term solution documented in IC55068 within the V5 code stream. All library managers should be upgraded to the latest V6 or newer level if possible. Please review the compatibility matrix here to determine the best V6 or newer release for your library sharing environment.

    If the library manager cannot be upgraded to a V6+ level to get the long term fix, the short term fix is to increase the heartbeat timeout using the library manager server option "LIBSHRTIMEOUT." The default is 15 minutes, and the maximum is 60 minutes. If increasing the timeout to 60 minutes does not resolve the problem (which is possible in complex and busy environments), the only other option in V5 is reduce the amount of drive/library activity occurring.

Drive error recovery routine causes temporary drive reservations.

    In response to an abnormal CHECK CONDITION received during a SCSI release of a drive when using the Tivoli Storage Manager device driver, a temporary reservation can be placed on a drive that prevents other hosts from using it. This routine has been changed to prevent the reservation after a CHECK CONDITION is received.

    The following APAR addresses this issue: IC85107

    Recommendation: Upgrade any storage agents, library managers, and library clients to levels 5.5.7, 6.2.5.0, 6.3.3.0, 7.1.0.0 or higher to obtain the fix. The Tivoli Storage Manager device driver must also be upgraded to obtain the fix.

Network communication failures can orphan drive reservations


    If a library manager cannot confirm that a library client is still using a drive due to a network communication failure between the two hosts, Tivoli Storage Manager can fail to properly recover the drive which can lead to a reservation conflict. Errors ANR0454E, ANR8926W, ANR8213W, and ANR9778E are good indicators of a communication problem. Although the root cause of this issue is external to Tivoli Storage Manager control, improvements will be made via APAR IC97434 to allow for better error recovery.

    Recommendation: Prevent the communications failures by investigating the root cause at the network layer. Upgrade any storage agents, library managers, and library clients to 6.2.6.0, 6.3.5.0, 7.1.1.0 or higher to obtain the error recovery improvements for IC97434.

Tape devices report reservation conflict after Windows cluster failover


    After a successful Windows cluster failover, drives may become inaccessible and report ANR8779E with error number 170 (EBUSY). These errors can be reported on library managers, library clients, or storage agents. This problem applies only to non-IBM devices in Windows clustered environments, and is documented by APAR IC89826.

    Recommendation: Upgrade any storage agents, library managers, and library clients to levels 6.3.4.0 or 7.1.0.0 or higher to obtain the fix.



Known external error number 16/170 (EBUSY) causes:

IBM AIX CFGMGR leaves drives in an inaccessible state

    AIX defect IV05718 documents an issue on AIX 6.1 systems that can leave tape drives in an inaccessible state after running cfgmgr, which is known to manifest itself as an EBUSY situation on drive open operations within Tivoli Storage Manager.

    Recommendation: Review the AIX APAR and apply fixing maintenance to prevent this issue.

IBM manufactured HBA microcode defects

    Any IBM manufactured HBA's should have the following microcode/firmware levels applied (depending on the model number):

    df1000fd-0002.271304
    df1000fd-0002.271310
    df1000fe-0002.271315

    Recommendation: Apply current microcode/firmware levels to any IBM manufactured HBA to avoid potential defects.

SAN monitoring utilities

    SAN status/health monitoring utilities such as SanSurfer and HBAExplorer have been known to place reserves on devices. There have been several reported defects in older versions of these utilities that can place reserves on devices.

    It has also been reported that HP DDMI (Discovery and Dependency Mapping Inventory) may place reserves on drives during scans. This utility is part of the HP OpenView suite. There is no known fix for this behavior at this time, so this application should be removed or the collection for SAN attached devices should be disabled.

    Recommendation: Upgrade any SAN monitoring utilities to current levels to avoid known defects that can place reserves. Alternatively, cease using these utilities, especially when the library environment is active or in-use.

Tapeutil/ITDT usage and/or scripting

    Utilities to interact with tape devices such as tapeutil and ITDT can place reserves on devices. Some customer's implement scripts to monitoring drive/library statuses using tapeutil/ITDT. If any administrator or script is using such a utility to gather information on library/drive status, it can place reserves on drives.

    Recommendation: Consult with your administration/operator teams to determine if any such scripts are being used to collect device information. This scripts should be disable permanently. Also confirm if tapeutil/ITDT is being used, and disable use of these utilities when the library environment is active/in-use.

Tapeutil/ITDT defect

    Tapeutil/ITDT contains a defect in which the application can leave a reserve on a device even if it was properly closed before exiting the application. As such, any other application attempting to reserve a device that tapeutil/ITDT still holds an orphaned lock against will receive a device busy error.

    Recommendation: Tapeutil has been deprecated by the ITDT (IBM Tape Diagnostic Tool). Upgrade and use only the most currently available version of ITDT to avoid known defects. ITDT can be downloaded from Fix Central.

Windows 64-bit HBA API

    Environments using 64-bit storage agents may be exposed to a timing condition within the implementation and use of the Windows 64-bit HBA API. This timing condition can cause device busy errors. The Tivoli Storage Manager storage agent code for Windows storage agents has been altered to mitigate this problem by introducing retry logic. This work was introduced via APAR IC61104, included in the following Tivoli Storage Manager levels: 5.4.5.1, 5.4.6, 5.5.3, 6.1.2, 6.2.0, 6.3.0, and 7.1.0.

    Recommendation: Upgrade all Tivoli Storage Manager servers and storage agents to the level 5.4.5.1, 5.4.6, 5.5.3, 6.1.2, 6.2.0, 6.3.0, 7.1.0 or greater.

IBMtape device driver

    There have been numerous defects in the IBM tape (IBMtape/Atape/lin_tape) device driver code that can cause reservation conflicts.

    Recommendation: Upgrade all IBMtape/Atape/lin_tape device drivers to current levels. These device drivers can be downloaded from Fix Central.

Drive "retain_reserve" set to "yes"

    Review each drive's configuration using the "lsattr -El /dev/rmtxx" on AIX platforms(where xx is the number of the drive). The "retain_reserve" setting returned should be set "no". If it is set to yes, the drive can retain the reservation which can cause reservation conflicts when the drive is given to another host for lanfree activity. The "chdev" command can be used to change the attributes. Please contact the system administrator and/or AIX support if assistance is required.

    Recommendation: Disable the retain_reserve option on any involved AIX systems.

HP-UX Ignite-UX system management utility


    The HP-UX Ignite-UX system management utility can incorrectly obtain device reservations on tape drives. This can occur on any system managed by Ignite-UX, that has access to the SAN attached drives, regardless of whether or not the system is actually using the drives.

    Recommendation: Remove Ignite-UX from any systems that have access to SAN attached tape drives and/or libraries.

Protectier VTL can fail to respond to a drive inquiry within 30 seconds

    Under some circumstances, Protectier VTL devices can fail to respond to a drive inquiry request from an application within 30 seconds. This can lead to stale or orphaned drive reservations. This issue has been resolved in Protectier version 3.3.4, which is now available.

    Recommendation: Upgrade Protectier to version 3.3.4 or higher.

Resolving The Problem

General diagnosis and resolution strategy:

The following steps should be completed before engaging IBM support:

1. Review the Tivoli Storage Manager activity log, operating system logs (errpt, messages file, event log), and SAN/device logs to completely define the problem:


    * Are there any patterns? Do the device reservation conflicts always occur at the same time? Do they always occur against the same devices(drives)?
    * Do the errors happen on a library manager, or library client, or storage agent? Do they always involve a specific host, drive, operation?
    * Are there any library sharing communications failures?
    * Are there any hardware or SAN errors around the time of the reservation conflict?

2. Review and validate the entire library sharing environment configuration:

    * Validate that all library definitions are correct (QUERY LIBRARY F=D).
    * Validate that all drive definitions are correct (QUERY DRIVE F=D).
    * Validate that all path definitions are correct (QUERY PATH F=D).
    * Validate that all device WWN's, serial numbers, and device special files are correct for every host that has paths defined (Library Managers, Library Clients, and Storage Agents).

    Note that the VALIDATE LANFREE command can be leveraged to confirm working storage agent communications.

3. Review ALL of the above "Known causes" and apply any fixes applicable to your environment. In summary, the following must be completed:

    * Upgrade all Tivoli Storage Manager library clients, library managers, and storage agents to currently available code. Do not forget to upgrade the Tivoli Storage Manager device driver if it is in use.
    * Upgrade all IBM tape device drivers to current.
    * Upgrade all Tapeutil/ITDT implementations to current across the entire environment.
    * Upgrade any IBM manufactured HBA microcode/firmware to current.
    * Upgrade any SAN monitoring software to current.
    * Remove HP-UX Ignite-UX if in use.

4. Consider implementing the other recommendations, suggestions and/or best practices referenced below where applicable. For example:

    * Enable Tivoli Storage Manager's SANDISCOVERY function on any applicable hosts to automatically correct any device pathing inconsistencies.
    * Enable persistent binding on all HBA's to prevent potential pathing issues.
    * Analyze the SAN fabric and introduce proper zoning and/or LUN masking if possible.

5. Use ITDT to determine the host where the conflict is being held to further isolate the problem. Only a host that owns the reservation will be able to access the drive using ITDT. Hosts that do not own the reservation will fail to access the drive.

6. If none of the above steps help to identify or resolve the issue, collect the the data in the following MustGather document before contacting IBM Tivoli Storage Manager support for further assistance: MustGather: Collecting Data for Tivoli Storage Manager: Library sharing


Best practices for successful Library Sharing in a Tivoli Storage Manager environment:

Enable Persistent Binding
    Persistent binding should be enabled at the HBA layer on all involved machines in the lanfree environment. This reduces device and path churn and can reduce failures.

Enable the RESETDRIVES option
    Consider enabling the RESETDRIVES parameter on the library definition, if possible. This option can allow the device driver to attempt(!) to break a reservation. It is important to note that if the Persistent Reservation option is enabled on the HBA, RESETDRIVES cannot send a LUN reset to break a reservation.

    More information can be located in the following TechNote:

    http://www-01.ibm.com/support/docview.wss?uid=swg21249613

Enable the SANDISCOVERY option
    Enable SANDISCOVERY to rediscover devices that have disappeared from the SAN. This can often self-correct pathing issues. Note that SANDISCOVERY is a function of the HBA, so the HBA must be capable of accepting SANDISCOVERY requests from Tivoli Storage Manager.

Leverage the SANREFRESHTIME option
    Leverage the Tivoli Storage Manager SANREFRESHTIME option if available.

    Significantly increasing the value may reduce reservation conflicts if the SAN is not healthy such that devices are regularly disappearing and re-appearing on the SAN.

    Further SANREFRESHTIME recommendations can be found here:

    http://www-01.ibm.com/support/docview.wss?uid=swg21312774


General recommendations and suggestions:

Recovery Recommendations
    If a drive is stuck in a reserved state, power cycling it at the physical hardware level can often free the reservation, even if the holder is not known. This can temporarily provide relief for the situation. Recycling the entire library and/or systems using the library is an alternative option, but additional care must be taken to not do this while the library/drives is actively being used.

Problem Prevention Recommendations
    Consider implementing fencing within the SAN zoning and/or via LUN masking to prevent hosts that don't need to access the devices within the environment from accessing them. This can help reduce the number of potential offending hosts.

Reservation Host Identification Suggestions
    Using the AIX errpt:

    Newer levels of AIX (6.1+) can report a reservation conflict within the AIX error report (errpt). Within the errpt, you might find a record similar to the following, suggesting a reservation conflict:

    ---------------------------------------------------------------------------
    LABEL:          RESERVE_CONFLICT
    IDENTIFIER:     BF05CF18

    Date/Time:       Sun Apr  6 11:38:52 MDT 2014
    Sequence Number: 54807
    Machine Id:      1234567ABCDE
    Node Id:         adsm
    Class:           H
    Type:            INFO
    WPAR:            Global
    Resource Name:   rmt0           
    Resource Class:  tape
    Resource Type:   3592
    Location:        U78C5.001.DQD01JD-P2-C3-T1-W51234567ABCDE-L0

    VPD:            
            Manufacturer................IBM    
            Machine Type and Model......03592E05        
            Serial Number...............1234567ABCDE
            Device Specific.(FW)........1EC7
            Loadable Microcode Level....A1700D5C

    Description
    RESERVATION CONFLICT

    Probable Causes
    TAPE DRIVE

    Failure Causes
    TAPE DRIVE

            Recommended Actions
            CLEAR RESERVING HOST RESERVATION

    Detail Data
    ADDITIONAL INFORMATION
    Reserving host key 0600000035730978 WWPN 11000024FF62A922
    ---------------------------------------------------------------------------

    This tells us that from server "adsm," the drive "rmt0" cannot be accessed because a host with WWPN 11000024FF62A922 holds a reservation on the drive. Investigation of this host matching this WWPN should be completed to determine why the host has the drive reserved.

    Using the DataDomain (DDOS) interface:
    If you are using a DataDomain VTL to emulate a tape library device for use with Tivoli Storage Manager, you can use command line DDOS command-line utilities to identify the host holding the reservation for a particular drive. The DDOS command to do this is "scsitarget persistent-reservation show list". Refer to the DDOS user guide and command reference for additional information on this command.


Other Notes:
Users may also experience a perceived EBUSY error during other drive operations besides the open, including during drive writes, reads, dismounts, and closes. Related TechNotes and defects:

EBUSY on WRITE:
Diagnosing and resolving ANR8311E with error 16/170 EBUSY failures on drive WRITE

EBUSY on CLOSE/PREEMPTABORT:
IT02504: THE RETURN CODE FOR CLOSE DEVICE OPERATION IS NOT CHECKED, DISMOUNT MAY FAIL WITH ANR8311E FOR PREEMPTABORT OPERATION

[{"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Server","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Supported Versions","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Product Synonym

ITSM ADSM TSM IBM SPECTRUM PROTECT

Document Information

Modified date:
17 June 2018

UID

swg21579521