Troubleshooting
Problem
This document provides a step-by-step analysis of Tivoli Storage Manager performance when writing to FILE storage pools.
Symptom
In this example, the Tivoli Storage Manager server shows performance degradation when a client archives data to a FILE storage pool using a GPFS filesystem for storage.
Diagnosing The Problem
To isolate the performance bottleneck that occurs during client backup or archive, you must capture performance trace data from both the Tivoli Storage Manager server and the backup-archive client.
Capture client performance data
To capture the client performance data, add the following option to the dsm.opt file :
testflag instrument:detail
Let the backup or archive operation run to completion. At the end of the backup or archive, dsminstr.report.p* files are produced. These files are needed for performance analysis.
Capture server performance data
To capture the server performance data, run the Perl script to collect server monitoring data. The script produces a series of command outputs that are captured at a regular interval. Run this script while the client backup or archive operation is in progress.
Examine the client performance data
Examine the dsminstr.report.p* files produced after the client backup or archive operation completes. Review the client statistics logged at the completion of the backup or archive operation. For example :
Total number of objects inspected: 16
Total number of objects archived: 8
Total number of objects updated: 0
Total number of journal objects: 0
Total number of objects rebound: 0
Total number of objects deleted: 2
Total number of objects expired: 0
Total number of objects failed: 0
Total number of bytes transferred: 316.60 GB
LanFree data bytes: 0 B
Data transfer time: 34,339.88 sec
Network data transfer rate: 9,667.54 KB/sec
Aggregate data transfer rate: 9,491.69 KB/sec
Total number of bytes pre-compress: 263,679,119,802
Total number of bytes post-compress: 263,679,119,802
Objects compressed by: 0%
Total data reduction ratio: 72.90%
Elapsed processing time: 09:42:56
Average file size: 30.69 GB
This log shows that a total elapsed time of 9 hours, 42 minutes and 56 seconds was needed to complete the archive. Out of this elapsed time, 34339 seconds (9 hours, 32 minutes and 19 seconds) were needed to transfer the data to the server. Review the client instrumentation statistics that are in the same file. For example :
Section Actual(sec) Average(msec) Frequency used
------------------------------------------------------------------
Data Verb 34780.549 107.2 324525
Sleep 0.601 150.2 4
Thread Wait 176.449 199.8 883
Other 4.576 0.0 0
This data also shows that most of the time ( 34780 seconds) is spent transferring the data (Data Verb) to the server. These client statistics indicate a problem either with the network or with the Tivoli Storage Manager server.
Examine the server performance data
Examine the show.txt output that was collected on the server by using the Perl script mentioned above. Identify the session numbers that are used by the client backup or archive operation. For example :
ANS8000I Server command: 'Q SE F=D'
Sess Comm. Sess Wait Bytes Bytes Sess Platform Client Name
Number Method State Time Sent Recvd Type
------ ------ ----- ---- ----- ------- ---- -------- --------------
42,505 Tcp/Ip Run 0 S 934 114.4 G Node Proxynode GPFS (MYNODE)
The output shows that session 42505 is used by the client GPFS, a proxy session used by the MYNODE client.
Next, examine the show.txt output again and locate the thread number for session 42505 in the "show threads" output. For example :
Thread 60278, Parent 60: psSessionThread, Storage 10275237, AllocCnt 297 HighWaterAmt 10275237
tid=140734181132032, ptid=140736717911808, det=1, zomb=0, join=0, result=0, sess=42505
Awaiting cond sessP->wrBufEmptyCond (0x0x7fffed18db20), using mutex sessP->mutex (0x0x7fff48961ce8), at sstrans.c(11353)
lwp=7229
Thread 60279, Parent 60278: SsAuxSinkThread, Storage 445479, AllocCnt 286 HighWaterAmt 488949
tid=140736721331968, ptid=140734181132032, det=1, zomb=0, join=0, result=0, sess=0
Awaiting cond agentP->doneCond (0x0x7fffed18d0a0), using mutex agentP->mutex (0x0x7ffef00479e8), at pvr.c(13650)
lwp=7230
Thread 60410, Parent 60279: AgentThread, Storage 5486, AllocCnt 18 HighWaterAmt 50095
tid=140734184675072, ptid=140736721331968, det=1, zomb=0, join=0, result=0, sess=0
Holding mutex agentP->mutex (0x0x7ffef00479e8), acquired at pvr.c(13724)
lwp=7784
In this example, it shows that session 42505 uses thread 60278. This is the thread that receives the data from the client. It also shows that thread 60410 (a child thread of 60279/60278) is the thread that writes data to the FILE storage pool.
Next, examine the performance statistics logged in the instr.txt file. Examine the statistics for the related thread (thread 60278 and 60410 in this example). For example :
Thread 60278 psSessionThread parent=60 11:07:05.551-->11:28:05.381
Operation Count Tottime Avgtime Mintime Maxtime InstTput Total KB
----------------------------------------------------------------------------
Network Recv 494997 6.114 0.000 0.000 0.133 1355404.1 8287357
Network Send 28 0.000 0.000 0.000 0.000 0
Thread Wait 32383 1253.223 0.039 0.000 0.370
Unknown 0.491
----------------------------------------------------------------------------
Total 1259.829 6578.2 8287357
In this example, the performance data was captured for an interval of 20 minutes (11:07:05.551-->11:28:05.381). In this 20-minute interval, 8287357 KB (Total KB) was transferred over the network (Network Recv) to the Tivoli Storage Manager server. It took 6 seconds (Tottime) to transfer the data over the network. This shows good network data transfer rate.
Then, examine the child thread that writes to the DISK pool (thread 60410 in this example). For example ;
Thread 60410 AgentThread parent=60279 11:07:05.551-->11:28:05.381
Operation Count Tottime Avgtime Mintime Maxtime InstTput Total KB
----------------------------------------------------------------------------
Disk Write 32383 1255.114 0.039 0.015 0.370 6604.8 8289792
Thread Wait 32382 4.434 0.000 0.000 0.092
Unknown 0.281
----------------------------------------------------------------------------
Total 1259.829 6580.1 8289792
In this example, the performance data was captured for an interval of 20 minutes (11:07:05.551-->11:28:05.381). In this 20-minute interval, 8289792 KB (Total KB) was written to the FILE storage pool volume (Disk Write) on the Tivoli Storage Manager server. It took 1255 seconds (Tottime) to write the data to disk. This shows poor write performance to disk.
In conclusion, in the above 20 minutes interval, it took 6 seconds to transfer about 8 GB of data over the network but took almost 20 minutes (1255 seconds) to write the data to disk.
In this example, the disk write performance is the performance bottleneck.
Resolving The Problem
In this example, the client was configured to archive data to a FILE storage pool with volumes defined on a GPFS file system. In this example, the disk write performance was improved with the "DIRECTIO NO" server option. Add the following option to the dsmserv.opt file of the Tivoli Storage Manager server instance :
DIRECTIO NO
Stop and restart the Tivoli Storage Manager server instance after adding the option.
Was this topic helpful?
Document Information
Modified date:
17 June 2018
UID
swg21696171