IBM Support

IT17970: CANCEL NODE REPLICATION SESSION MAY CAUSE TARGET SERVER HANG IN SDFREEWRITECONTROL

Subscribe

You can track all active APARs for this component.

APAR status

  • Closed as program error.

Error description

  • After canceling Node Replication on the source server, the
    target server may have many orphaned Node Replication sessions. 
    Canceling these orphaned sessions resulted in the target
    instance becoming hung.
    
    IBM Spectrum Protect Versions Affected: 7.1.3, 7.1.4, 7.1.5,
    7.1.6 and 7.1.7
    
    Collect the servermon.pl script data before and during the
    cancellation of orphaned Node replication sessions.  If that was
    not collected, then get the AIX procstack or UNIX pstack on the
    dsmserv process id.   Wait 10 minutes and gather the procstack
    output again. Then get a core file produced with a Kill -11 on
    the hung dsmserv process.
    
    Key stack threads:
    in pth_cond._cond_wait_global at 0x9000000004ec260 ($t5533)
    0x9000000004ec260 (_cond_wait_global+0x4e0) e8410028    ld
    r2,0x28(r1)
    pth_cond._cond_wait_global(??, ??, ??) at 0x9000000004ec260
    pth_cond._cond_wait(??, ??, ??) at 0x9000000004ecdf4
    pth_cond.pthread_cond_wait(??, ??) at 0x9000000004edadc
    pkmon.pkWaitConditionTracked(??, ??, ??, ??, ??) at 0x100008f10
    sdprodcon.SdFreeWriteControl(??) at 0x1009c8e18
    sdutil.sdEndSession(??) at 0x100977450
    smrepl.SmReplServerSession(??) at 0x100358618
    smexec.DoReplServer(??, ??) at 0x100579ad8
    smexec.smExecuteSession(??, ??, ??, ??, ??, ??, ??, ??) at
    0x10056ef64
    tcpcomm.psSessionThread(??) at 0x100545930
    pkthread.StartThread(0x0) at 0x10000da90
    
    and
    
    _cond_wait_global(??, ??, ??) at 0x9000000004ec260
    _cond_wait(??, ??, ??) at 0x9000000004ecdf4
    pthread_cond_wait(??, ??) at 0x9000000004edadc
    pkWaitConditionTracked(??, ??, ??, ??, ??) at 0x100008f10
    SdFreeWriteControl(??) at 0x1009c8e18
    sdEndSession(??) at 0x100977450
    SmReplServerSession(??) at 0x100358618
    DoReplServer(??, ??) at 0x100579ad8
    smExecuteSession(??, ??, ??, ??, ??, ??, ??, ??) at 0x10056ef64
    psSessionThread(??) at 0x100545930
    StartThread(0x0) at 0x10000da90
    
    There may be many of these program stacks:
    0x9000000004ec260 (_cond_wait_global+0x4e0) e8410028    ld
    r2,0x28(r1)
    pth_cond._cond_wait_global(??, ??, ??) at 0x9000000004ec260
    pth_cond._cond_wait(??, ??, ??) at 0x9000000004ecdf4
    pth_cond.pthread_cond_wait(??, ??) at 0x9000000004edadc
    pkmon.pkWaitConditionTracked(??, ??, ??, ??, ??) at 0x100008f10
    queue.DequeueVarQueue(??, ??, ??, ??, ??) at 0x100357820
    prodcons.ProdConsGetWork(??, ??) at 0x10093dd80
    prodcons.PcConsumerThread(??) at 0x10093d278
    
    and many threads are "running", in pkDelayThread() waiting for
    the timer to end.
    
    Example:
    in pth_spinlock._global_lock_common at 0x9000000004c983c
    ($t6404)
    pth_spinlock._global_lock_common(??, ??, ??) at
    0x9000000004c983c
    in pth_spinlock._global_lock_common at 0x9000000004c983c
    ($t6406)
    pth_spinlock._global_lock_common(??, ??, ??) at
    0x9000000004c983c
    in pth_spinlock._global_lock_common at 0x9000000004c983c
    ($t6409)
    pth_spinlock._global_lock_common(??, ??, ??) at
    0x9000000004c983c
    
    Some threads are in BeginSession() which needs the SMV->mutex to
    proceed.   Example:
    _global_lock_common(??, ??, ??) at 0x9000000004c983c
    _mutex_lock(??, ??, ??) at 0x9000000004d7104
    pkAcquireMutexTracked(??, ??, ??) at 0x1000078d4
    BeginSession() at 0x100571014
    smExecuteSession(??, ??, ??, ??, ??, ??, ??, ??) at 0x10056d988
    psSessionThread(??) at 0x100545930
    StartThread(0x0) at 0x10000da90
    
    The deadlocked threads are  smLockSessMutexTracked and
    smLockSessMutex attempting to get the same mutex.
    
    
    Example:
    in pth_spinlock._global_lock_common at 0x9000000004c983c
    ($t6404)
    0x9000000004c983c (_global_lock_common+0x4bc) e8410028 ld
    r2,0x28(r1)
    pth_spinlock._global_lock_common(??, ??, ??) at
    0x9000000004c983c
    pth_mutex._mutex_lock(??, ??, ??) at 0x9000000004d7104
    pkmon.pkAcquireMutexTracked(??, ??, ??) at 0x1000078d4
    smutil.smLockSessMutexTracked(??, ??, ??) at 0x1002aac10
    smcancel.CancelSessionNum(??, ??) at 0x10028a04c
    smcancel.smCancelSession(??) at 0x1002893f8
    admcmd.AdmCommandLocal(??, ??, ??, ??, ??) at 0x1007af284
    admcmd.admCommand(??, ??, ??, ??, ??) at 0x1007acd40
    smadmin.SmAdminCommandThread(??) at 0x1008e0c10
    pkthread.StartThread(0x0) at 0x10000da90
    
    _global_lock_common(??, ??, ??) at 0x9000000004c983c
    _mutex_lock(??, ??, ??) at 0x9000000004d7104
    pkAcquireMutexTracked(??, ??, ??) at 0x1000078d4
    smLockSessMutexTracked(??, ??, ??) at 0x1002aac10
    CancelSessionNum(??, ??) at 0x10028a04c
    smCancelSession(??) at 0x1002893f8
    AdmCommandLocal(??, ??, ??, ??, ??) at 0x1007af284
    admCommand(??, ??, ??, ??, ??) at 0x1007acd40
    SmAdminCommandThread(??) at 0x1008e0c10
    StartThread(0x0) at 0x10000da90
    
    Initial Impact: High
    Additional Keywords:   hung deadlock
    

Local fix

  • Do not cancel target replication sessions.
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * All Tivoli Storage Manager server users.                     *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See error description.                                       *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Apply fixing level when available. This problem is currently *
    * projected to be fixed in levels 7.1.7.100, 7.1.8, and 8.1.1. *
    * Note that this is subject to change at the discretion of     *
    * IBM.                                                         *
    ****************************************************************
    

Problem conclusion

  • This problem was fixed.
    Affected platforms: AIX, Solaris, Linux, and Windows.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT17970

  • Reported component name

    TSM SERVER

  • Reported component ID

    5698ISMSV

  • Reported release

    71A

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2016-11-15

  • Closed date

    2016-12-12

  • Last modified date

    2016-12-12

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    TSM SERVER

  • Fixed component ID

    5698ISMSV

Applicable component levels

  • R71A PSY

       UP

  • R71H PSY

       UP

  • R71L PSY

       UP

  • R71S PSY

       UP

  • R71W PSY

       UP

  • R81A PSY

       UP

  • R81L PSY

       UP

  • R81W PSY

       UP



Document information

More support for: Tivoli Storage Manager

Software version: 7.1.3

Reference #: IT17970

Modified date: 12 December 2016