IBM Support

IT17609: NODE REPLICATION TO A TARGET REPLICATION SERVER'S CONTAINER STORAGE POOL MIGHT CAUSE A SERVER HANG.

Subscribe

You can track all active APARs for this component.

APAR status

  • Closed as program error.

Error description

  • On a target replication server, a hang condition can occur due
    to Node Replication session cleanup activity while other target
    container operations are running.  This hang/deadlock prevents
    new sessions, including the issuing of administrator commands
    submitted to the server, from starting.
    
    IBM Spectrum Protect Versions Affected: 7.1.3, 7.1.4, 7.1.5,
    7.1.6, 7.1.7 and above
    
    Customer/L2 Diagnostics:
    Get AIX procstack or UNIX pstack on the dsmserv process id.  One
    thread shows the SdCancelSession entry in the call stack, and
    the other threads has pkAcquireMutexTracked in the call stack
    after BeginSession or smGetRemSrvConStatus or SdCancelSession or
    smGetRemSrvConStatus or smRemoveSessMountCount entries.   Wait
    10 minutes and gather the procstack output again.  If the stack
    with smKillSessionNumber is still there, you probably are
    affected by this APAR.  To absolutely identify the hang, obtain
    a core file produced with a Kill -11 on the hung dsmserv
    process.
    
    The following program stack is indicative of the deadlock.
    ---------- tid# wwww (pthread ID:  xxxxx) ----------
    0x09000000004c983c  _global_lock_common(??, ??, ??) + 0x4bc
    0x09000000004d7104  _mutex_lock(??, ??, ??) + 0x164
    0x0000000100007554  pkAcquireMutexTracked(??, ??, ??) + 0x94
    0x0000000100854804  sdCancelSession(??, ??) + 0x64
    0x000000010027eea0  smKillSessionNumber(??, ??, ??) + 0x700
    0x000000010034326c  SmReplServerSession(??) + 0x17ec
    0x0000000100415770  DoReplServer(??, ??) + 0x3f0
    0x000000010040b384  smExecuteSession(??, ??, ??, ??, ??, ??, ??,
    ??) + 0x1784
    0x00000001003e02fc  psSessionThread(??) + 0x59c
    0x000000010000d670  StartThread(0x0) + 0xb0
    0x09000000004cae10  _pthread_body(??) + 0xf0
    
    This is one example stack of container processing that is part
    of the deadlock.  There might be other stacks that cause the
    deadlock:
    
    ---------- tid# yyyyyy (pthread ID:  zzzzz) ----------
    0x09000000004ec260  _cond_wait_global(??, ??, ??) + 0x4e0
    0x09000000004ecdf4  _cond_wait(??, ??, ??) + 0x34
    0x09000000004edadc  pthread_cond_wait(??, ??) + 0x19c
    0x0000000100008b90  pkWaitConditionTracked(??, ??, ??, ??, ??) +
    0xb0
    0x000000010026fe80  WaitForLock(??, ??, ??, ??, ??, ??, ??, ??)
    + 0x860
    0x000000010026e50c  tmLockTracked(??, ??, ??, ??, ??, ??, ??,
    ??) + 0xb2c
    0x000000010085aa68  SdLockContainerIdTracked(??, ??, ??, ??, ??)
    + 0x68
    0x0000000100874098  SdUpdateContainerUtil(??, ??, ??, ??, ??,
    ??, ??) + 0x118
    0x0000000100845d78  PrepareCntrAlloc(??) + 0x318
    0x0000000100844638  sdPrepareTxn(??, ??, ??) + 0xf8
    0x0000000100046c20  CollectVotes(??) + 0xc0
    0x00000001000461c8  tmEndX(??, ??, ??) + 0x168
    0x0000000100046a4c  tmEndWithStreamMsg(??, ??, ??, ??) + 0x4c
    0x0000000100a38f3c  SdWriteCompletion(??) + 0x1dc
    0x0000000100a33a18  SdFlushCQControls(??) + 0x258
    0x0000000100a3580c  SdCQSinkThread(??) + 0xacc
    0x000000010000d670  StartThread(0x0) + 0xb0
    0x09000000004cae10  _pthread_body(??) + 0xf0
    
    Because of the deadlock, other session threads will hang. This
    prevents administrator commands being issued to the server.
    Those hung threads have these call stacks. Depending on your
    workload you might see any of the following call stacks.
    
    ---------- tid# wwwwww (pthread ID:  xxxxx ) ----------
    0x09000000004c983c  _global_lock_common(??, ??, ??) + 0x4bc
    0x09000000004d7104  _mutex_lock(??, ??, ??) + 0x164
    0x0000000100007554  pkAcquireMutexTracked(??, ??, ??) + 0x94
    0x000000010040d0f8  BeginSession() + 0x58
    0x0000000100409e04  smExecuteSession(??, ??, ??, ??, ??, ??, ??,
    ??) + 0x204
    0x00000001003e02fc  psSessionThread(??) + 0x59c
    0x000000010000d670  StartThread(0x0) + 0xb0
    0x09000000004cae10  _pthread_body(??) + 0xf0
    
    ---------- tid# wwwww (pthread ID:  xxxxx) ----------
    0x09000000004c983c  _global_lock_common(??, ??, ??) + 0x4bc
    0x09000000004d7104  _mutex_lock(??, ??, ??) + 0x164
    0x0000000100007554  pkAcquireMutexTracked(??, ??, ??) + 0x94
    0x000000010027fc78  smGetSessSeqNum(??) + 0x58
    0x0000000100dec4f8  CsRunCmdThread(??) + 0x218
    0x000000010000d670  StartThread(0x0) + 0xb0
    0x09000000004cae10  _pthread_body(??) + 0xf0
    
    Other threads that require the session mutex might also be seen
    are:
    ---------- tid# wwwwww (pthread ID:  xxxx) ----------
    0x09000000004c983c  _global_lock_common(??, ??, ??) + 0x4bc
    0x09000000004d7104  _mutex_lock(??, ??, ??) + 0x164
    0x0000000100007554  pkAcquireMutexTracked(??, ??, ??) + 0x94
    0x00000001006e95ec  smGetRemSrvConStatus(??, ??, ??) + 0x8c
    0x00000001000c9334  BuildUpdateL2Grids(0x740a0b112a350ec0,
    0x740a0b112a350000, 0x900000000040e4c, 0x117b1cb68, 0x0,
    0x9001000a0091110, 0x11cbb00c0, 0x0) + 0x9f4
    0x00000001000a4ce0  StatusMonitorGridsThread(??) + 0x9c0
    0x000000010000d670  StartThread(0x0) + 0xb0
    0x09000000004cae10  _pthread_body(??) + 0xf0
    
    ---------- tid# wwwww (pthread ID: wwwww) ----------
    0x09000000004c983c  _global_lock_common(??, ??, ??) + 0x4bc
    0x09000000004d7104  _mutex_lock(??, ??, ??) + 0x164
    0x0000000100007554  pkAcquireMutexTracked(??, ??, ??) + 0x94
    0x0000000100281218  smRemoveSessMountCount(??) + 0x78
    0x000000010040dbb8  EndSession(0x169ed8288) + 0xd8
    0x000000010040ae78  smExecuteSession(??, ??, ??, ??, ??, ??, ??,
    ??) + 0x1278
    0x00000001003e02fc  psSessionThread(??) + 0x59c
    0x000000010000d670  StartThread(0x0) + 0xb0
    0x09000000004cae10  _pthread_body(??) + 0xf0
    
    Initial Impact: High
    Additional Keywords: hung deadlock container pool session
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * All Tivoli Storage Manager server users.                     *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See error description.                                       *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Apply fixing level when available. This problem is currently *
    * projected to be fixed in levels 7.1.7.100, 7.1.8 and 8.1.1.  *
    * Note that this is subject to change at the discretion of     *
    * IBM.                                                         *
    ****************************************************************
    

Problem conclusion

  • This problem was fixed.
    Affected platforms: AIX, Solaris, Linux, and Windows.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT17609

  • Reported component name

    TSM SERVER

  • Reported component ID

    5698ISMSV

  • Reported release

    71A

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2016-10-21

  • Closed date

    2016-11-16

  • Last modified date

    2016-12-07

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    TSM SERVER

  • Fixed component ID

    5698ISMSV

Applicable component levels

  • R71A PSY

       UP

  • R71H PSY

       UP

  • R71L PSY

       UP

  • R71S PSY

       UP

  • R71W PSY

       UP



Document information

More support for: Tivoli Storage Manager

Software version: 7.1.3

Reference #: IT17609

Modified date: 07 December 2016