IBM Support

Tivoli Storage Manager for Virtual Environments: Data Protection for VMware - Overview of data mover snapshot failures

Troubleshooting


Problem

The VMware vCenter fails to create or delete the requested data mover snapshot during backup. An ANS9365E error message is often returned.

Symptom


The virtual machine backup fails with the following error in the error log:

ANS9365E VMware vStorage API error for virtual machine 'VM1'.
   TSM function name : visdkCreateVmSnapshotMoRef
   TSM file          : vmvisdk.cpp (4752)
   API return code   : 67
   API error message : Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.

Diagnosing The Problem


Overview

The Data Protection for VMware data mover uses the VMware vStorage APIs for Data Protection (VADP http://kb.vmware.com/kb/1021175) to create and delete virtual machine snapshots. The vCenter/ESXi hosts are responsible for the handling of the snapshot create and delete tasks. For the backup task the snapshot create request is synchronous, meaning that the data mover issues the create request to the vCenter with the quiesced flag enabled and the memory flag disabled. The data mover must wait for the snapshot task to complete before continuing with the virtual machine backup.

The delete request is asynchronous, meaning that the data mover issues the delete request to the vCenter, confirms that it was received but does not wait for the task to complete (sometimes this can take as long as 20 minutes, depending on the vCenter load and the size of the VMDK redo logs that need consolidating). In this scenario, the data mover moves to the next virtual machine in the backup.

For all operations, the data mover must access the vCenter using the proper level of permissions.
See the manual :
http://www.ibm.com/support/knowledgecenter/SS8TDQ_7.1.6/ve.inst/c_ve_reqt_install_permissions.html
and IBM Technote: http://www-01.ibm.com/support/docview.wss?uid=swg27047438

This table provides details on VMware quiesced snapshots.

Guest operating system
Driver type used
Quiescing type used
Windows XP 32-bit
Windows 2000 32-bit
Sync driverFile-system consistent quiescing
Windows Vista 32-bit/64-bit
Windows 7 32-bit/64-bit
VMware VSS componentFile-system consistent quiescing
Windows 2003 32-bit/64-bit VMware VSS componentApplication-consistent quiescing
Windows 2008 32-bit/64-bit Windows 2008 R2
Windows 2012
VMware VSS componentApplication-consistent quiescing

Notes:
    • UUID attribute must be enabled.
    • Must have only SCSI VMDKs.
    • No dynamic disks.
Linux or other guest operating systemsNo in-guest snapshot driverCrash-consistent quiescing

VMware provides a useful Knowledge Base (KB) with video that explains the snapshot task. See VMware KB http://kb.vmware.com/kb/1015180

Additional information can also be found in the VMware vSphere 5.5 Documentation. For example, the “Snapshot Limitations to Manage Virtual Machines” topic is available on-line at http://pubs.vmware.com/vsphere-55/index.jsp?topic=%2Fcom.vmware.vsphere.vm_admin.doc%2FGUID-53F65726-A23B-4CF0-A7D5-48E584B88613.html

Important notes:
  • All target virtual machines should have the latest version of VMware Tools installed.
  • By default, the Data mover requests only quiesced snapshots. This request requires the Volume Shadow Copy or VSS to be installed and working properly. (VSS is installed with VMware Tools).
  • Data mover does not request memory dump snapshots as this request would negate the quiesced request.
  • Snapshots generate redo logs or delta disks for each VMDK connected to the virtual machine. The redo logs hold all the VMDK I/O while a virtual machine is being backed up.
  • The size of the redo logs is directly dependent on the amount of I/O being done to the VMDK and the time it take to complete the backup.
  • Snapshot delete and consolidation is directly affected by size of the redo logs.
  • Virtual machine performance can degrade while the snapshot consolidation is processing.
  • No preexisting snapshots should exist on the virtual machines being managed by a data mover. If this requirement is not met, Change Block Tracking does not function correctly.
  • VMware does not support snapshots of raw disks, physical RDM disks, or with PCI vSphere direct path I/O devices.
  • Starting with TSM for VE version 7.1.2 a new option is provide "INCLUDE.VMSNAPSHOTATTEMPTS". For VMware backup operations, this option determines the total number of snapshot attempts to try for a VMware virtual machine that fails during backup processing due to snapshot failures. See http://www-01.ibm.com/support/knowledgecenter/SSGSG7_7.1.2/com.ibm.itsm.client.doc/r_opt_includevmsnapshotattempts.html

Snapshot create issues

The most common problem during the snapshot create task is for the Volume Shadow Copy or VSS to time-out quiescing the virtual machines I/O. For example, the following message is displayed: “Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.” In this scenario, the problem might be too much I/O on the target virtual machine, or the VSS is corrupt. Often un-installing VMware Tools then re-installing it resolves a corrupt VSS problem.

The I/O problem is a more complex issue and does not have one clear solution. IBM has several technotes and VMware has many KB articles related to this issue. Links to technotes frequently used to solve these types of I/O problems are provided below.

The first log that should be examined is the virtual machine's “vmware.log” in the datastore for the virtual machine. The next log to examine is the virtual machine's operating system logs. For Windows systems, see the “Windows Event” logs. For Linux systems, see the “var/log/messages” log file.

A good test is to take a manual virtual machine snapshot using the vSphere Client Snapshot Manager. The virtual machine should be running, the "Snapshot the virtual machine's memory" should be UNCHECKED, and "Quiesce guest file system" should be CHECKED. Doing is a few times in a row is recommended to see if the error occurs.

For reference, see the following IBM technotes:
http://www.ibm.com/support/docview.wss?uid=swg21585387
http://www.ibm.com/support/docview.wss?uid=swg21568757

To confirm the quiesce failed you can examine the target virtual machines “vmware.log” file. In this log file, look for the following type of entry:

20T20:31:54.778Z| vmx| Vix: [298549 vigorCommands.c:511]:
VigorSnapshotManager_Take: takeOptions=8,clientFlags=0, displayName='Test Quiesce'
......
20T20:32:02.972Z| vcpu-1| Msg_Post: Warning
2013-01-20T20:32:02.972Z| vcpu-1| [msg.snapshot.quiesce.vmerr] The guest OS has reported an error during quiescing.
2013-01-20T20:32:02.972Z| vcpu-1| --> The error code was: 4
2013-01-20T20:32:02.972Z| vcpu-1| --> The error message was: Quiesce aborted.
......
20T20:32:04.047Z| vcpu-0| SnapshotVMXTakeSnapshotComplete done with snapshot 'Test Quiesce': 11
2013-01-20T20:32:04.047Z| vcpu-0| SnapshotVMXTakeSnapshotComplete: Snapshot 11 failed: Failed to quiesce the virtual machine. (40).

For reference, see the following VMware KB articles:
http://kb.vmware.com/kb/1018194
http://kb.vmware.com/kb/1031298
http://kb.vmware.com/kb/1007696
http://kb.vmware.com/kb/1009073
http://kb.vmware.com/kb/2068653

In some cases the disk I/O might be so heavy that custom pre-freeze and post-thaw scripts are necessary to allow the snapshot to complete. See VMware KBs http://kb.vmware.com/kb/5962168,http://kb.vmware.com/kb/1006671 and http://kb.vmware.com/kb/2044169

Finally, a KB article about how to disable application-consistent quiescing (recommended only for testing) is available on-line at http://kb.vmware.com/kb/1028881

Snapshot delete issues

The snapshot delete and consolidation tasks are also subject to failures, but these failures often go
unnoticed because the data mover snapshot delete request is asynchronous and it does not wait for the
results. Sometimes snapshot delete failures can be seen in the vSphere Client Task Panel. In other cases
the failures are silent.

A common delete problem is a failure to consolidate the snapshot redo logs into the base VMDK. This
problem can be caused by a locked file in the datastore that prevents the consolidation process or
insufficient disk space in the datastore.

Using the vSphere Client Snapshot Manager to create and delete a snapshot often succeeds because it's
completed quickly and the virtual machine does not have time to grow the VMDK redo logs. In this
scenario, the consolidation process has nothing to do. A more accurate test is to wait the time it takes to
back up the target virtual machine, then issue the snapshot delete.

The “vmware.log” might contain lines like the following sample:

vmx| ConsolidateOnlineCB: nextState = 2 uid 3
vmx| Foundry operation failed with system error: Device or resource busy
(16), translated to 5
vmx| ConsolidateOnlineCB: Done with consolidate


VMware provides a KB article that describes the snapshot consolidating process. This article is
available on-line at http://kb.vmware.com/kb/2003638. In addition, the following VMware KB articles
provide additional information:
http://kb.vmware.com/kb/2017072
http://kb.vmware.com/kb/2007245
http://kb.vmware.com/kb/2040846
http://kb.vmware.com/kb/2053758

Virtual machines residing on NFS storage can become unresponsive during the snapshot delete task.
If the data mover is installed on a virtual machine and the transport being used is HotAdd, the problem
can be more significant. For this problem, see the following VMware KB article:
http://kb.vmware.com/kb/2053758.

[{"Product":{"code":"SSERB6","label":"IBM Spectrum Protect for Virtual Environments"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Data Protection for VMware","Platform":[{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"All Versions","Edition":"All Editions","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
17 June 2018

UID

swg21678788