IBM Support

Tivoli Storage Manager server performance analysis - GPFS FILE I/O

Troubleshooting


Problem

This document provides a step-by-step analysis of Tivoli Storage Manager performance when writing to FILE storage pools.

Symptom

In this example, the Tivoli Storage Manager server shows performance degradation when a client archives data to a FILE storage pool using a GPFS filesystem for storage.

Diagnosing The Problem

To isolate the performance bottleneck that occurs during client backup or archive, you must capture performance trace data from both the Tivoli Storage Manager server and the backup-archive client.

Capture client performance data

To capture the client performance data, add the following option to the dsm.opt file :
testflag instrument:detail

Let the backup or archive operation run to completion. At the end of the backup or archive, dsminstr.report.p* files are produced. These files are needed for performance analysis.

Capture server performance data

To capture the server performance data, run the Perl script to collect server monitoring data. The script produces a series of command outputs that are captured at a regular interval. Run this script while the client backup or archive operation is in progress.

Examine the client performance data

Examine the dsminstr.report.p* files produced after the client backup or archive operation completes. Review the client statistics logged at the completion of the backup or archive operation. For example :

Total number of objects inspected:           16
Total number of objects archived:             8
Total number of objects updated:              0
Total number of journal objects:          0
Total number of objects rebound:              0
Total number of objects deleted:              2
Total number of objects expired:              0
Total number of objects failed:               0
Total number of bytes transferred:       316.60 GB
LanFree data bytes:                        0  B
Data transfer time:                   34,339.88 sec
Network data transfer rate:            9,667.54 KB/sec
Aggregate data transfer rate:          9,491.69 KB/sec
Total number of bytes pre-compress:  263,679,119,802
Total number of bytes post-compress: 263,679,119,802
Objects compressed by:                        0%
Total data reduction ratio:               72.90%
Elapsed processing time:               09:42:56
Average file size:                     30.69 GB

This log shows that a total elapsed time of 9 hours, 42 minutes and 56 seconds was needed to complete the archive. Out of this elapsed time, 34339 seconds (9 hours, 32 minutes and 19 seconds) were needed to transfer the data to the server. Review the client instrumentation statistics that are in the same file. For example :

Section                  Actual(sec) Average(msec) Frequency used
------------------------------------------------------------------
Data Verb              34780.549    107.2   324525
Sleep                      0.601    150.2        4
Thread Wait              176.449    199.8      883
Other                      4.576      0.0        0

This data also shows that most of the time ( 34780 seconds) is spent transferring the data (Data Verb) to the server. These client statistics indicate a problem either with the network or with the Tivoli Storage Manager server.

Examine the server performance data

Examine the show.txt output that was collected on the server by using the Perl script mentioned above. Identify the session numbers that are used by the client backup or archive operation. For example :

ANS8000I Server command: 'Q SE F=D'

  Sess  Comm.   Sess  Wait Bytes    Bytes  Sess Platform Client Name  
Number  Method  State Time  Sent    Recvd  Type                        
------  ------  ----- ---- -----  -------  ---- -------- --------------
42,505  Tcp/Ip  Run   0 S    934  114.4 G  Node Proxynode GPFS (MYNODE)

The output shows that session 42505 is used by the client GPFS, a proxy session used by the MYNODE client.

Next, examine the show.txt output again and locate the thread number for session 42505 in the "show threads" output. For example :

Thread 60278, Parent 60: psSessionThread, Storage 10275237, AllocCnt 297 HighWaterAmt 10275237
tid=140734181132032, ptid=140736717911808, det=1, zomb=0, join=0, result=0, sess=42505
Awaiting cond sessP->wrBufEmptyCond (0x0x7fffed18db20), using mutex sessP->mutex (0x0x7fff48961ce8), at sstrans.c(11353)
lwp=7229

Thread 60279, Parent 60278: SsAuxSinkThread, Storage 445479, AllocCnt 286 HighWaterAmt 488949
tid=140736721331968, ptid=140734181132032, det=1, zomb=0, join=0, result=0, sess=0
Awaiting cond agentP->doneCond (0x0x7fffed18d0a0), using mutex agentP->mutex (0x0x7ffef00479e8), at pvr.c(13650)
lwp=7230

Thread 60410, Parent 60279: AgentThread, Storage 5486, AllocCnt 18 HighWaterAmt 50095
tid=140734184675072, ptid=140736721331968, det=1, zomb=0, join=0, result=0, sess=0
Holding mutex agentP->mutex (0x0x7ffef00479e8), acquired at pvr.c(13724)
lwp=7784

In this example, it shows that session 42505 uses thread 60278. This is the thread that receives the data from the client. It also shows that thread 60410 (a child thread of 60279/60278) is the thread that writes data to the FILE storage pool.

Next, examine the performance statistics logged in the instr.txt file. Examine the statistics for the related thread (thread 60278 and 60410 in this example). For example :

Thread 60278 psSessionThread parent=60  11:07:05.551-->11:28:05.381
Operation       Count  Tottime  Avgtime  Mintime  Maxtime InstTput  Total KB
----------------------------------------------------------------------------
Network Recv   494997    6.114    0.000    0.000    0.133 1355404.1   8287357
Network Send       28    0.000    0.000    0.000    0.000                  0
Thread Wait     32383 1253.223    0.039    0.000    0.370
Unknown                  0.491
----------------------------------------------------------------------------
Total                 1259.829                              6578.2   8287357

In this example, the performance data was captured for an interval of 20 minutes (11:07:05.551-->11:28:05.381). In this 20-minute interval, 8287357 KB (Total KB) was transferred over the network (Network Recv) to the Tivoli Storage Manager server. It took 6 seconds (Tottime) to transfer the data over the network. This shows good network data transfer rate.

Then, examine the child thread that writes to the DISK pool (thread 60410 in this example). For example ;

Thread 60410 AgentThread parent=60279  11:07:05.551-->11:28:05.381
Operation       Count  Tottime  Avgtime  Mintime  Maxtime InstTput  Total KB
----------------------------------------------------------------------------
Disk Write      32383 1255.114    0.039    0.015    0.370   6604.8   8289792
Thread Wait     32382    4.434    0.000    0.000    0.092
Unknown                  0.281
----------------------------------------------------------------------------
Total                 1259.829                              6580.1   8289792

In this example, the performance data was captured for an interval of 20 minutes (11:07:05.551-->11:28:05.381). In this 20-minute interval, 8289792 KB (Total KB) was written to the FILE storage pool volume (Disk Write) on the Tivoli Storage Manager server. It took 1255 seconds (Tottime) to write the data to disk. This shows poor write performance to disk.

In conclusion, in the above 20 minutes interval, it took 6 seconds to transfer about 8 GB of data over the network but took almost 20 minutes (1255 seconds) to write the data to disk.

In this example, the disk write performance is the performance bottleneck.

Resolving The Problem

In this example, the client was configured to archive data to a FILE storage pool with volumes defined on a GPFS file system. In this example, the disk write performance was improved with the "DIRECTIO NO" server option. Add the following option to the dsmserv.opt file of the Tivoli Storage Manager server instance :

DIRECTIO NO

Stop and restart the Tivoli Storage Manager server instance after adding the option.

[{"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Server","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"}],"Version":"All Supported Versions","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
17 June 2018

UID

swg21696171