Direct links to fixes
8.1.1.100-IBM-SPOC-WindowsX64
8.1.1.100-IBM-SPOC-Linuxx86_64
8.1.1.100-IBM-SPOC-Linuxs390x
8.1.1.100-IBM-SPOC-AIX
8.1.1.000-IBM-SPSRV-WindowsX64
8.1.1.000-IBM-SPSRV-Linuxs390x
8.1.1.000-IBM-SPSRV-AIX
8.1.1.000-IBM-SPSRV-Linuxx86_64
7.1.7.100-TIV-TSMSRV-WIN
7.1.7.100-TIV-TSMSRV-SolarisSPARC
7.1.7.100-TIV-TSMSRV-Linuxx86_64
7.1.7.100-TIV-TSMSRV-Linuxs390x
7.1.7.100-TIV-TSMSRV-Linuxppc64
7.1.7.100-TIV-TSMSRV-HP-UX
7.1.7.100-TIV-TSMSRV-AIX
IBM Spectrum Protect Server V8.1 Fix Pack 1 (V8.1.1) Downloads
IBM Spectrum Protect Server V7.1.7.X interim fix downloads
IBM Spectrum Protect Server V7.1 Fix Pack 8 (7.1.8.000) Downloads
APAR status
Closed as program error.
Error description
On a target replication server, a hang condition can occur due to Node Replication session cleanup activity while other target container operations are running. This hang/deadlock prevents new sessions, including the issuing of administrator commands submitted to the server, from starting. IBM Spectrum Protect Versions Affected: 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7 and above Customer/L2 Diagnostics: Get AIX procstack or UNIX pstack on the dsmserv process id. One thread shows the SdCancelSession entry in the call stack, and the other threads has pkAcquireMutexTracked in the call stack after BeginSession or smGetRemSrvConStatus or SdCancelSession or smGetRemSrvConStatus or smRemoveSessMountCount entries. Wait 10 minutes and gather the procstack output again. If the stack with smKillSessionNumber is still there, you probably are affected by this APAR. To absolutely identify the hang, obtain a core file produced with a Kill -11 on the hung dsmserv process. The following program stack is indicative of the deadlock. ---------- tid# wwww (pthread ID: xxxxx) ---------- 0x09000000004c983c _global_lock_common(??, ??, ??) + 0x4bc 0x09000000004d7104 _mutex_lock(??, ??, ??) + 0x164 0x0000000100007554 pkAcquireMutexTracked(??, ??, ??) + 0x94 0x0000000100854804 sdCancelSession(??, ??) + 0x64 0x000000010027eea0 smKillSessionNumber(??, ??, ??) + 0x700 0x000000010034326c SmReplServerSession(??) + 0x17ec 0x0000000100415770 DoReplServer(??, ??) + 0x3f0 0x000000010040b384 smExecuteSession(??, ??, ??, ??, ??, ??, ??, ??) + 0x1784 0x00000001003e02fc psSessionThread(??) + 0x59c 0x000000010000d670 StartThread(0x0) + 0xb0 0x09000000004cae10 _pthread_body(??) + 0xf0 This is one example stack of container processing that is part of the deadlock. There might be other stacks that cause the deadlock: ---------- tid# yyyyyy (pthread ID: zzzzz) ---------- 0x09000000004ec260 _cond_wait_global(??, ??, ??) + 0x4e0 0x09000000004ecdf4 _cond_wait(??, ??, ??) + 0x34 0x09000000004edadc pthread_cond_wait(??, ??) + 0x19c 0x0000000100008b90 pkWaitConditionTracked(??, ??, ??, ??, ??) + 0xb0 0x000000010026fe80 WaitForLock(??, ??, ??, ??, ??, ??, ??, ??) + 0x860 0x000000010026e50c tmLockTracked(??, ??, ??, ??, ??, ??, ??, ??) + 0xb2c 0x000000010085aa68 SdLockContainerIdTracked(??, ??, ??, ??, ??) + 0x68 0x0000000100874098 SdUpdateContainerUtil(??, ??, ??, ??, ??, ??, ??) + 0x118 0x0000000100845d78 PrepareCntrAlloc(??) + 0x318 0x0000000100844638 sdPrepareTxn(??, ??, ??) + 0xf8 0x0000000100046c20 CollectVotes(??) + 0xc0 0x00000001000461c8 tmEndX(??, ??, ??) + 0x168 0x0000000100046a4c tmEndWithStreamMsg(??, ??, ??, ??) + 0x4c 0x0000000100a38f3c SdWriteCompletion(??) + 0x1dc 0x0000000100a33a18 SdFlushCQControls(??) + 0x258 0x0000000100a3580c SdCQSinkThread(??) + 0xacc 0x000000010000d670 StartThread(0x0) + 0xb0 0x09000000004cae10 _pthread_body(??) + 0xf0 Because of the deadlock, other session threads will hang. This prevents administrator commands being issued to the server. Those hung threads have these call stacks. Depending on your workload you might see any of the following call stacks. ---------- tid# wwwwww (pthread ID: xxxxx ) ---------- 0x09000000004c983c _global_lock_common(??, ??, ??) + 0x4bc 0x09000000004d7104 _mutex_lock(??, ??, ??) + 0x164 0x0000000100007554 pkAcquireMutexTracked(??, ??, ??) + 0x94 0x000000010040d0f8 BeginSession() + 0x58 0x0000000100409e04 smExecuteSession(??, ??, ??, ??, ??, ??, ??, ??) + 0x204 0x00000001003e02fc psSessionThread(??) + 0x59c 0x000000010000d670 StartThread(0x0) + 0xb0 0x09000000004cae10 _pthread_body(??) + 0xf0 ---------- tid# wwwww (pthread ID: xxxxx) ---------- 0x09000000004c983c _global_lock_common(??, ??, ??) + 0x4bc 0x09000000004d7104 _mutex_lock(??, ??, ??) + 0x164 0x0000000100007554 pkAcquireMutexTracked(??, ??, ??) + 0x94 0x000000010027fc78 smGetSessSeqNum(??) + 0x58 0x0000000100dec4f8 CsRunCmdThread(??) + 0x218 0x000000010000d670 StartThread(0x0) + 0xb0 0x09000000004cae10 _pthread_body(??) + 0xf0 Other threads that require the session mutex might also be seen are: ---------- tid# wwwwww (pthread ID: xxxx) ---------- 0x09000000004c983c _global_lock_common(??, ??, ??) + 0x4bc 0x09000000004d7104 _mutex_lock(??, ??, ??) + 0x164 0x0000000100007554 pkAcquireMutexTracked(??, ??, ??) + 0x94 0x00000001006e95ec smGetRemSrvConStatus(??, ??, ??) + 0x8c 0x00000001000c9334 BuildUpdateL2Grids(0x740a0b112a350ec0, 0x740a0b112a350000, 0x900000000040e4c, 0x117b1cb68, 0x0, 0x9001000a0091110, 0x11cbb00c0, 0x0) + 0x9f4 0x00000001000a4ce0 StatusMonitorGridsThread(??) + 0x9c0 0x000000010000d670 StartThread(0x0) + 0xb0 0x09000000004cae10 _pthread_body(??) + 0xf0 ---------- tid# wwwww (pthread ID: wwwww) ---------- 0x09000000004c983c _global_lock_common(??, ??, ??) + 0x4bc 0x09000000004d7104 _mutex_lock(??, ??, ??) + 0x164 0x0000000100007554 pkAcquireMutexTracked(??, ??, ??) + 0x94 0x0000000100281218 smRemoveSessMountCount(??) + 0x78 0x000000010040dbb8 EndSession(0x169ed8288) + 0xd8 0x000000010040ae78 smExecuteSession(??, ??, ??, ??, ??, ??, ??, ??) + 0x1278 0x00000001003e02fc psSessionThread(??) + 0x59c 0x000000010000d670 StartThread(0x0) + 0xb0 0x09000000004cae10 _pthread_body(??) + 0xf0 Initial Impact: High Additional Keywords: hung deadlock container pool session
Local fix
Problem summary
**************************************************************** * USERS AFFECTED: * * All Tivoli Storage Manager server users. * **************************************************************** * PROBLEM DESCRIPTION: * * See error description. * **************************************************************** * RECOMMENDATION: * * Apply fixing level when available. This problem is currently * * projected to be fixed in levels 7.1.7.100, 7.1.8 and 8.1.1. * * Note that this is subject to change at the discretion of * * IBM. * ****************************************************************
Problem conclusion
This problem was fixed. Affected platforms: AIX, Solaris, Linux, and Windows.
Temporary fix
Comments
APAR Information
APAR number
IT17609
Reported component name
TSM SERVER
Reported component ID
5698ISMSV
Reported release
71A
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2016-10-21
Closed date
2016-11-16
Last modified date
2016-12-07
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
TSM SERVER
Fixed component ID
5698ISMSV
Applicable component levels
Document Information
Modified date:
07 December 2016