IBM Support

IT23192: IBM SPECTRUM PROTECT SERVER MAY HANG IN A REPLICATION ENVIRONMENT WAITING FOR A MUTEX

Subscribe

You can track all active APARs for this component.

APAR status

  • Closed as program error.

Error description

  • In a replication environment it can be seen that the source
    server is in a hang or wait condition.
    Commands like 'QUERY PROCESS' or 'select from processes' will
    not return any output.
    'QUERY SESSION' will show the replication sessions in status
    IdleW for a long time.
    
    From the IBM Spectrum Protect server monitoring data within the
    show.txt outputs you can see a admin session which acquires a
    mutex:
    
    Thread 7286, Parent 7258: SmAdminCommandThread, Storage 108500,
    AllocCnt 101 HighWaterAmt 434840
     tid=e676, ptid=d85a, det=0, zomb=0, join=0, result=0, sess=0,
    procToken=0, sessToken=3725
      Stack trace:
        0x090000000056683c _global_lock_common
        0x0900000000574108 _mutex_lock
        0x0000000100007ed8 pkAcquireMutexTracked
        0x0000000100281084 NrQueryCounts
        0x00000001000d0c9c procQueryProcess
        0x0000000101154044 AdmQueryProcess
        0x0000000100636fc0 AdmCommandLocal
        0x0000000100634978 admCommand
        0x000000010064e448 PreFlushDataForSQL
        0x000000010064d494 IPRA.$ScrubCmdInput
        0x0000000100645a84 IPRA.$PreProcessQuery
        0x0000000100648f94 AdmSQLExecute
        0x0000000100636fc0 AdmCommandLocal
        0x0000000100634978 admCommand
        0x000000010099c7d4 SmAdminCommandThread
        0x000000010000e300 StartThread
      Holding mutex PROCV->mutex (0x11141ffd8), acquired at
    process.c(1152)
      Holding mutex descP->tableMutex (0x130e1f8f8), acquired at
    output.c(1935)
      Acquiring mutex ctlP->fsArrayMutex (0x11f2add98) at
    nrmain.c(13409)    <===  here the mutex is acquired
     Thread context:
       COMMAND: QUERY PROCESS
       COMMMETHOD: SSL
       THREAD_TYPE: SESSION
       SESSION_TYPE: ADMIN
       ADMIN_NAME: AAAA
    
    Also the replication sessions are waiting for the same mutex
    for example:
    
     Thread 475, Parent 468: NrReplicateFilespace, Storage 9010964,
    AllocCnt 545528 HighWaterAmt 9083686
     tid=43db, ptid=34d4, det=0, zomb=0, join=0, result=0, sess=115,
    procToken=2, sessToken=106
      Stack trace:
        0x090000000056683c _global_lock_common
        0x0900000000574108 _mutex_lock
        0x0000000100007ed8 pkAcquireMutexTracked
        0x0000000100289050 NrReplicateFilespace
        0x000000010083d440 PcConsumerThread
        0x000000010000e300 StartThread
      Acquiring mutex ctlP->fsArrayMutex (0x11f2add98) at
    nrmain.c(5162)   <===  here the mutex is acquired
     Thread context:
       COMMAND: REPLICATE NODE
       SCHEDULE_TYPE: ADMIN
       SCHEDULE_NAME: REPLICATE_ALL_NODE_INITIAL
       PROCESS_NUMBER: 2
       PROCESS_DESC: Replicate Node
       THREAD_TYPE: PROCESS
       SCHEDULED: YES
    
    One replication thread holds this mutex:
    
    Thread 468, Parent 466: NrReplicationThread, Storage 6550965781,
    AllocCnt 126580 HighWaterAmt 6794766785
     tid=34d4, ptid=32d2, det=1, zomb=0, join=0, result=0, sess=177,
    procToken=2, sessToken=106
      Stack trace:
        0x0900000000589260 _cond_wait_global
        0x0900000000589df8 _cond_wait
        0x090000000058aae0 pthread_cond_wait
        0x00000001000095b4 pkWaitConditionTracked
        0x00000001002c63d8 EnqueueVarQueue
        0x00000001008401a4 ProdConsPutWork
        0x00000001002bd450 IPRA.$MakeTapeBatch
        0x000000010028ece8 IPRA.$ProcessFsCompletion
        0x0000000100283adc NrReplicationThread
        0x000000010000e300 StartThread
      Holding mutex ctlP->fsArrayMutex (0x11f2add98), acquired at
    nrmain.c(3375)  ===> here the mutex is hold
      Awaiting cond newQueue->notFull (0x113f47ff0), using mutex
    newQueue->mutex (0x11f601df8), at queue.c(1743)
     Thread context:
       COMMAND: REPLICATE NODE
       SCHEDULE_TYPE: ADMIN
       SCHEDULE_NAME: REPLICATE_ALL_NODE_INITIAL
       THREAD_TYPE: PROCESS
       PROCESS_DESC: Replicate Node
       PROCESS_NUMBER: 2
       SCHEDULED: YES
    
    
    Customer/L2 Diagnostics:
    If the target replication server does not have enough mount
    points or volumes to satisfy all of the sessions storing data on
    the target server, the source server may hang.
    
    The hang is caused by a thread holding a mutex for processes
    and is waiting for a mutex for a specific process.
    The process mutex that the first thread holds causes other
    threads to wait for it that are holding other resources.
    
    
    IBM Spectrum Protect Server Version Affected:
    Version 7.1.x and above on all platforms
    
    
    Initial Impact:
    High
    
    Additional Keywords:
    TSM server Spectrum Protect hang freeze replication repl process
    

Local fix

  • Fixing the media wait issue on the target server will fix this
    hang on the source server.
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * All Spectrum Protect server users.                           *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See error description.                                       *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Apply fixing level when available. This problem is currently *
    * projected to be fixed in levels 7.1.9 and 8.1.5. Note that   *
    * this is subject to change at the discretion of IBM.          *
    ****************************************************************
    

Problem conclusion

  • This problem was fixed.
    Affected platforms: AIX, HP-UX, Solaris, Linux and Windows.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT23192

  • Reported component name

    TSM SERVER

  • Reported component ID

    5698ISMSV

  • Reported release

    81A

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2017-11-23

  • Closed date

    2018-02-13

  • Last modified date

    2018-02-13

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    TSM SERVER

  • Fixed component ID

    5698ISMSV

Applicable component levels

  • R81A PSY

       UP

  • R81L PSY

       UP

  • R81W PSY

       UP



Document information

More support for: Tivoli Storage Manager

Software version: 81A

Reference #: IT23192

Modified date: 13 February 2018