IBM Support

Resource waiter aborts during client backup

Troubleshooting


Problem

Client backups to a disk storage pool may fail due to resource waiter abort: ANR0538I A resource waiter has been aborted.

Cause

This behavior may be seen in environments where clients backup to a cache-enabled primary disk storage pool and the data is subsequently migrated to a dedup-enabled FILE storage pool. When there is an insufficient amount of storage space available in the disk storage pool to accommodate the client backup, the server will begin to reclaim storage space by deleting cached files. Resource contention may occur if an attempt is made to delete a cached file that is being accessed in the dedup-enabled FILE storage pool. For example, if a file in the dedup-enabled FILE storage pool is being accessed by a space reclamation or identify duplicates process, the corresponding cached version of the same file cannot be deleted until the transaction in the FILE storage pool has completed and the lock on the file has been released. If the requested lock cannot be acquired in the amount of time specified by the RESOURCETIMEOUT parameter, then the client backup session will be terminated.

Diagnosing The Problem

Prior to the client backup session being terminated due to the resource waiter abort, the output from the QUERY SESSION command will likely show the session in a Run state, though no data transfer is occurring:

    Sess Number: 2,159,868
    Comm. Method: Tcp/Ip
    Sess State: Run
    Wait Time: 0 S
    Bytes Sent: 3.2 K
    Bytes Recvd: 1.1 K
    Sess Type: Node
    Platform: AIX
    Client Name: Test

The output from the SHOW SESSIONS command allows us to determine the thread id (2330167) corresponding with this session (2159868):

    Session 2159868: Type=Node, Id=Test
    Platform=AIX, NodeId=637, Owner=
    SessType=4, Index=2, TermReason=0
    threadId=2330167
    ProxyByAgent False
    RecvWaitTime=0.000 (samples=0)
    Backup Objects ( bytes ) Inserted: 0 ( 0 )
    Backup Objects ( bytes ) Restored: 0 ( 0 )
    Archive Objects ( bytes ) Inserted: 0 ( 0 )
    Archive Objects ( bytes ) Retrieved: 0 ( 0 )
    Last Verb ( BackInsNormEnhanced ), Last Verb State ( Recv )
    Global id reports 0 mount points in use
    Write MP count 0, read MP count 0 keepUsedMP = No.

The SHOW RESQ command can then be issued to confirm whether resource contention is occurring on the Tivoli Storage Manager server. The output from the SHOW RESQ command may show the client backup session thread waiting to acquire a resource, as seen below:

    ==========================================================
    This is a local waiter.
    status=resWaiting waitTime=52 minutes
    waiter Type=unknown (0) timeout value=60 minutes
    txnSeqNo=0:261644895 resourceName=(2382444736) lengthLen=10
    type=46001 and nameSpace=0
    waiter thead id is 2330167 condition=-1161073272
    mutex=24440888 abortFunc=d60b40
    ==========================================================

From this output we can see that thread 2330167 has been waiting 52 minutes to acquire a lock (lock type 46001) on a resource (2382444736). The output from the SHOW LOCKS command provides additional information about the lock and the transaction holding this lock:

    LockDesc: Type=46001(bf aggregate (superbitfile) id), NameSpace=0, SummMode=xLock, Key='2382444736'
    Holder: (bfcreate.c:3032 Thread 2329514) Tsn=0:261047415, Mode=xLock
    Waiter: (dferase.c:817 Thread 2330167) Tsn=0:261644895, Mode=xLock

In the output above we can see that lock type 46001 is an aggregate, or superbitfile, lock and the resource name of '2382444736' represents the aggregate id. This lock is currently held by transaction 0:261047415 which is associated with thread 2329514. The SHOW THREADS command shows that thread 2329514 is associated with a space reclamation process:

    Thread 2329514, Parent 2329498: AfRclmVolumeThread, Storage 137592108, AllocCnt 214 HighWaterAmt 137592529
    tid=46913811421504, ptid=46913029925184, det=1, zomb=0, join=0, result=0, sess=0
    lwp=28797

We know that the client backup session is attempting to acquire a lock on a bitfile, but the bitfile is currently locked by the space reclamation process. We now need to review the call stack associated with the client backup thread to confirm that the lock contention is occurring because the server is attempting to delete cached data. The SHOW THREADS command can be issued to display the call stacks for the various threads, but on some platforms it may be necessary to issue the pstack/procstack command on the operating system to obtain this information. In either case, the call stack associated with the client backup thread should resemble the following:


    Thread 20 (Thread 0x2aaacf1af940 (LWP 31231)):
    #0 0x000000392be0b019 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
    #1 0x0000000000e5f789 in pkWaitConditionTracked ()
    #2 0x0000000000d61780 in tmLockTracked ()
    #3 0x000000000066aeed in BfLockAggregateTracked ()
    #4 0x00000000006aca93 in DfReclaimCached ()
    #5 0x00000000006a819f in DfAllocSpace ()
    #6 0x00000000006a99b4 in DfCreate ()
    #7 0x0000000000662e08 in bfCreate ()
    #8 0x0000000000b82e04 in CreateBitfile ()
    #9 0x0000000000b8496d in SmDoBackInsNormEnhanced ()
    #10 0x0000000000ba6150 in SmNodeSession ()
    #11 0x0000000000b647d6 in HandleNodeSession ()
    #12 0x0000000000b6757b in smExecuteSession ()
    #13 0x0000000000e7119d in psSessionThread ()
    #14 0x0000000000e5e01a in StartThread ()
    #15 0x000000392be0683d in start_thread () from /lib64/libpthread.so.0
    #16 0x000000392b2d503d in clone () from /lib64/libc.so.6

This call stack indicates that the client session is attempting to write data to the disk storage pool (DfCreate), but the server must first free up additional space in the storage pool by deleting cached data (DfReclaimCached).

Resolving The Problem

This problem behavior is most likely to affect large objects, as the larger the object size, the greater the amount of time required to reclaim or deduplicate the object. As such, the resource timeouts occurring as a result of this condition can likely be avoided by performing either of the following actions:


    a) disabling of caching in the disk storage pool

    b) specifying a MAXSIZE value for the disk storage pool to prevent large files from being backed up directly to disk

[{"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Server","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"6.3;7.1","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Product Synonym

tsm

Document Information

Modified date:
17 June 2018

UID

swg21666632