IBM Support

Tivoli Storage Manager server performance analysis - Disk I/O

Troubleshooting


Problem

This document provides a step-by-step analysis of Tivoli Storage Manager performance when writing to disk storage pools.

Symptom

In this example, the Tivoli Storage Manager server shows performance degradation for all clients that backup or archive directly to a DISK storage pool.

Diagnosing The Problem


To isolate the performance bottleneck that occurs during client backup or archive, you must capture performance trace data from both the Tivoli Storage Manager server and the backup-archive client.

Capture client performance data

To capture the client performance data, add the following option to the dsm.opt file :
testflag instrument:detail

Let the backup or archive operation run to completion. At the end of the backup or archive, dsminstr.report.p* files are produced. These files are needed for performance analysis.

Capture server performance data

To capture the server performance data, run the Perl script to collect server monitoring data. The script produces a series of command outputs that are captured at a regular interval. Run this script while the client backup or archive operation is in progress.

Examine the client performance data

Examine the dsminstr.report.p* files produced after the client backup or archive operation completes. Review the client statistics logged at the completion of the backup or archive operation. For example :

Total number of objects inspected:      140,659
Total number of objects archived:       140,642
Total number of objects updated:              0
Total number of journal objects:          0
Total number of objects rebound:              0
Total number of objects deleted:              0
Total number of objects expired:              0
Total number of objects failed:               1
Total number of bytes transferred:        90.90 GB
LanFree data bytes:                           0  B
Data transfer time:                    6,799.86 sec
Network data transfer rate:           14,018.15 KB/sec
Aggregate data transfer rate:         10,496.96 KB/sec
Total number of bytes pre-compress:  97,568,079,307
Total number of bytes post-compress: 97,568,079,307
Objects compressed by:                        0%
Total data reduction ratio:                0,00%
Elapsed processing time:               02:31:20
Average file size:                    677.47 KB

This log shows that a total elapsed time of 2 hours, 31 minutes and 20 seconds was needed to complete the archive. Out of this elapsed time, 6799 seconds (1 hour, 53 minutes and 19 seconds) were needed to transfer the data to the server. Review the client instrumentation statistics that are in the same file. For example :

Detailed Instrumentation statistics for
Thread: 8836  Elapsed time =  9080,856 sec
Section                  Actual(sec) Average(msec) Frequency used
------------------------------------------------------------------
Compute                    7,956      0,0  3194844
BeginTxn Verb              0,000      0,0     2052
Transaction               89,812     43,8     2052
File I/O                1415,186      0,4  3335491
Data Verb               6796,311      2,1  3194844
Confirm Verb               3,884     86,3       45
EndTxn Verb              764,073    372,4     2052
Thread Wait                2,371    474,2        5
Other                      1,264      0,0        0
------------------------------------------------------------------
Detailed Instrumentation statistics for
Thread: 9184  Elapsed time =  9080,856 sec
Section                  Actual(sec) Average(msec) Frequency used
------------------------------------------------------------------
Process Dirs             174,067      9,9    17589
Solve Tree               221,786      1,6   140658
Sleep                      0,062     62,4        1
Thread Wait             8670,433    801,0    10824
Other                     14,508      0,0        0

This data also shows that most of the time (6796 seconds) is spent transferring the data (Data Verb) to the server. These client statistics indicate a problem either with the network or with the Tivoli Storage Manager server.

Examine the server performance data

Examine the show.txt output that was collected on the server by using the Perl script mentioned above. Identify the session numbers that are used by the client backup or archive operation. For example :

ANS8000I Server command: 'Q SE F=D'        

  Sess  Comm.   Sess     Wait    Bytes    Bytes  Sess     Platform  Client  
Number  Method  State    Time     Sent    Recvd  Type               Name    
------  ------  ------ ------  -------  -------  -------  -------- ---------
    22  Tcp/Ip  IdleW   20 S    90.3 K    3.5 M  Node     WinNT     TESTNODE
    23  Tcp/Ip  Run      0 S    49.0 K   74.3 G  Node     WinNT     TESTNODE  

The output shows that sessions 22 and 23 are used by the client TESTNODE. The session of interest in this output is session 23, which is the session sending the data to the server.

Next, examine the show.txt output again and locate the thread number for session 23 in the "show threads" output. For example :

Thread 49, ID 6664 (0x1a08): SessionThread
Parent=38, result=0, joining=0, detached=1, zombie=0, session=23
Waiting for Cond 70710696 (&sessP->wrBufEmptyCond) using mutex 0 (na). Waiting from

Stack trace:
0000000076D3742A NtWaitForMultipleObjects()+a
000000007662A313 FlsSetValue()+1e3
0000000076620BC1 WaitForMultipleObjects()+11
000007FEF7650323 pkWaitCondition()+a3
000007FEF7E2015E ssSetExclRclmVols()+40e
000007FEF7E2B8F0 ssStore()+1070
000007FEF78A948A DfCreate()+36a
000007FEF784273A BfReconstruct()+3c7a
000007FEF7843BD2 bfCreate()+522
000007FEF7CEA042 SmDoPSQryExt()+2842
000007FEF7CECBC4 SmDoArchInsNorm()+a44
000007FEF7CF4C60 SmNodeSession()+3f70
000007FEF7D6522C SmSendAuthResult()+21cc
000007FEF7D6DF19 smExecuteSession()+1e09
000007FEF7FBF5B9 tcpFlush()+479
000007FEF7646654 startThread()+124
00000000733B1D9F endthreadex()+43
00000000733B1E3B endthreadex()+df
000000007662A54D BaseThreadInitThunk()+d
0000000076D16861 RtlUserThreadStart()+21

In this example, it shows that session 23 uses thread 49. This is the thread that receives the data from the client. Examine the show.txt output again and locate any child thread. For example :

Thread 50, ID 3928 (0x0f58): SsAuxSinkThread
Parent=49, result=0, joining=0, detached=1, zombie=0, session=0

Stack trace:
0000000076D36EFA ZwWriteFile()+a
000000007663ADD4 WriteFile()+84
000007FEF764C4E3 bdSync()+303
000007FEF764CE6C bdBatch()+13c
000007FEF7E39287 DsReleaseBlocks()+bc7
000007FEF7E3D494 DsWriteSeg()+5f4
000007FEF7E2797B SsIsValidFrame()+421b
000007FEF7E28351 ssValidateSfFrames()+571
000007FEF7646654 startThread()+124
00000000733B1D9F endthreadex()+43
00000000733B1E3B endthreadex()+df
000000007662A54D BaseThreadInitThunk()+d
0000000076D16861 RtlUserThreadStart()+21

In this example, it shows that thread 50 is a thread that was started by thread 49 (parent=49). This is the thread that writes data to the disk pool.

Next, examine the performance statistics logged in the instr.txt file. Examine the statistics for the related thread (thread 49 and 50 in this example). For example :

Thread 49 SessionThread (Win Thread ID 6664)  15:27:23.297-->15:47:31.883
Operation       Count  Tottime  Avgtime  Mintime  Maxtime InstTput  Total KB
----------------------------------------------------------------------------
Disk Write         59    0.609    0.010    0.001    0.017    387.5       236
Network Recv   1495555  173.251    0.000    0.000    0.284  74030.4  12825840
Network Send      668    0.006    0.000    0.000    0.001   1666.7        10
DB2 Fetch Prep     30    0.005    0.000    0.000    0.001
DB2 Fetch Exec    738    0.120    0.000    0.000    0.014
DB2 MFtch Exec     18    0.019    0.001    0.000    0.009
DB2 Inser Exec   2267   33.784    0.015    0.000    1.066
DB2 Delet Exec    714    0.057    0.000    0.000    0.001
DB2 Updat Exec     15    0.000    0.000    0.000    0.000
DB2 Fetch         738    0.009    0.000    0.000    0.001
DB2 MFetch       1985    0.085    0.000    0.000    0.001
DB2 Commit        354    7.816    0.022    0.002    0.104
DB2 Reg Prep      251    0.022    0.000    0.000    0.001
DB2 Reg Exec    14814    1.437    0.000    0.000    0.333
Thread Wait     51269  989.886    0.019    0.000    1.804
Unknown                  1.480
----------------------------------------------------------------------------
Total                 1208.586                             10612.5  12826086

In this example, the performance data was captured for an interval of 20 minutes (15:27:23.297-->15:47:31.883). In this 20-minute interval, 12825840 KB (Total KB) was transferred over the network (Network Recv) to the Tivoli Storage Manager server. It took 173 seconds (Tottime) to transfer the data over the network. This shows good network data transfer rate.

Then, examine the child thread that writes to the DISK pool (thread 50 in this example). For example ;

Thread 50 SsAuxSinkThread (Win Thread ID 3928)  15:27:23.297-->15:47:31.883
Operation       Count  Tottime  Avgtime  Mintime  Maxtime InstTput  Total KB
----------------------------------------------------------------------------
Disk Write      66681 1055.171    0.016    0.002    1.776  12154.5  12825024
DB2 Inser Exec  20627    0.978    0.000    0.000    0.010
Thread Wait     50935  151.768    0.003    0.000    2.297
Unknown                  0.669
----------------------------------------------------------------------------
Total                 1208.586                             10611.6  12825024

In this example, the performance data was captured for an interval of 20 minutes (15:27:23.297-->15:47:31.883). In this 20-minute interval, 12825024 KB (Total KB) was written to DISK (Disk Write) in a storage pool on the Tivoli Storage Manager server. It took 1055 seconds (Tottime) to write the data to disk. This shows poor write performance to disk.

In conclusion, in the above 20 minutes interval, it took less than 3 minutes (173 seconds) to transfer about 12 GB of data over the network but took almost 17 minutes (1055 seconds) to write the data to disk.

In this example, the disk write performance is the performance bottleneck.

Resolving The Problem

In this example, the client was configured to archive data to a DISK pool with volumes defined on D: and E: drives in Windows. The D: and E: drives were volumes defined in Windows as a RAID5 configuration using an IBM ServeRAID-MR10i adapter. In this case, the RAID adapter cache was disabled, causing the poor performance on disk writes.

Review the following document about how to re-enable the cache :


http://www.ibm.com/support/entry/portal/docdisplay?lndocid=migr-5078572

In this example, since the environment was using UPS, the following MegaRAID Command Line Interface (MegaCLI) commands were used to re-enable the cache :

MegaCli -LDSetProp -WB -Lall -aALL
MegaCli -LDSetProp -CachedBadBBU -Lall –aALL


After the cache was re-enabled the write performance to the disk pool was more than twice as fast as before. Consult with your hardware support vendor for more information about enabling the adapter cache.

[{"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Server","Platform":[{"code":"PF033","label":"Windows"}],"Version":"All Supported Versions","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
17 June 2018

UID

swg21689148