DB2 10.5 for Linux, UNIX, and Windows

Monitoring and tuning workload management dispatcher performance

The monitoring and tuning of workload management dispatcher performance can be achieved with the table functions and monitoring elements provided with the DB2® database manager. Details are provided here.

Introduction

To monitor and tune the performance of the workload management dispatcher and achieve the best results, you require the proper tools. Table functions and monitoring elements are provided that can help you monitor the performance of the dispatcher. After analyzing the collected monitoring data as described here, you can adjust the dispatcher concurrency level or redistribute CPU entitlements by adjusting service class CPU shares and CPU limits to tune the dispatcher performance.

The following sections describe the types of workloads to consider, because they differ in how they are best monitored to deliver the appropriate data for you to analyze, and the performance measures which are the most suitable for the particular type of workload under consideration.

Types of workloads

From the perspective of measuring performance for the purpose of tuning your dispatcher configuration to obtain the best possible workload performance from your system, there are two types of workloads to consider: batch and transactional. Each type of workload has characteristic measures of performance that are best suited to determining how well your system is performing under that type of workload. Use the performance measures that best characterize the workload type your system is experiencing.

Batch

A batch workload has one or more applications connecting to the database and each application submits activity after activity or transaction after transaction without any pause. The most important measure of the performance of this workload is how quickly the entire set of activities or transactions is completed. The processing speed of the database manager is the main determinant of how quickly the entire set of activities or transactions is completed.

Transactional

A transactional workload has a user at a terminal that submits an activity or transaction to the database, then waits for a response, analyzes the response, and decides whether or not to submit a follow-up activity or transaction. For this type of workload, the most important measure of its performance is how quickly the user gets back an individual result. The processing speed of the database manager to process a single activity or transaction for each individual user on the system is the main determinant of how quickly the average user can get back an individual result. How quickly the database manager can process all the activities or transactions from a user over a given period of time is not the relevant metric because it is more dependent on user behavior than on the performance of the database manager.

Performance measures

You can use the following performance measures to ascertain how well your system is performing under a particular type of workload.

Average throughput

Average throughput is the average number of service completions per unit time. If the service is a transaction or unit of work (UOW), then the average UOW throughput is the number of unit of work completions per unit time. It is usually presented as transactions per second or transactions per minute. Average throughput is a useful measure of system performance when the type of work being measured is a batch workload.

Average activity throughput is the average number of activity completions per unit time. On a system with mostly long-running units of work containing many individual activities, it is easier to measure the progress of the activities within the workload by measuring activity throughput rather than measure the progress of the workload by measuring UOW throughput.

Average response time

Average response time is the average amount of time it takes to get a single service completion from the time the service was requested. If the service is a transaction or unit of work (UOW), then the average UOW response time is the amount of time it takes for a UOW to complete from the time it was requested. Average response time is a useful measure of system performance when the type of work being measured is a transactional workload. The closest approximation to average UOW response time is the uow_lifetime_avg statistic available from the MON_SAMPLE_SERVICE_CLASS_METRICS and MON_SAMPLE_WORKLOAD_METRICS table functions, the WLM_GET_SERVICE_SUBCLASS_STATS and WLM_GET_WORKLOAD_STATS table functions, and the event_scstats and event_wlstats event monitor logical data groups reported in the WLM statistics event monitor. A more sophisticated form of UOW lifetime information is available in the UowLifetime histogram, also available in the event monitor.

Average activity response time is the average amount of time it takes to get a single activity to return its result from the time the activity was started. The closest approximation to average activity response time is the coord_act_lifetime_avg statistic available from the WLM_GET_SERVICE_SUBCLASS_STATS and WLM_GET_WORKLOAD_STATS table functions, and the event_scstats and event_wlstats event monitor logical data groups. This number is measured at each member and is reset when a member is deactivated or the WLM_COLLECT_STATS procedure is called. The reason it can be an approximation is that for one type of activity, a cursor activity, the activity can return some results before it finishes and relies on the user to finish reading the result set and close the cursor before the activity is considered complete. A more sophisticated form of activity lifetime is available in the CoordActLifetime histogram, also available in the event monitor.

CPU utilization

Another metric that is useful when tuning workload management dispatcher, regardless of the type of workload, is the CPU utilization. CPU utilization is the fraction of the time that the CPU resources are busy on the host or LPAR. CPU utilization is the metric that the workload management dispatcher uses to allocate the CPU resources to any one service class. CPU utilization is also the metric that you can use to verify that your workload management dispatcher configuration is working the way you intended. You can measure the CPU utilization over the same intervals as the uow_throughput, uow_lifetime_avg, and act_throughput monitor elements by using the MON_SAMPLE_SERVICE_CLASS_METRICS and MON_SAMPLE_WORKLOAD_METRICS table functions, the WLM_GET_SERVICE_SUBCLASS_STATS and WLM_GET_WORKLOAD_STATS table functions, and the event_scstats and event_wlstats event monitor logical data groups collected and reported by the WLM statistics event monitor.

Note: If CPU utilization measurements are not as expected for your created service classes, check for the presence of workloads that are running under the default user and maintenance service classes because these workloads were not explicitly assigned to service classes that you created. Forgetting to include workloads running under these default service classes, which each have 1000 hard CPU shares assigned by default when CPU shares were first enabled, can account for CPU utilization measurements that are not as you expected.

The CPU utilization reported through the table functions and event monitors is the CPU resources that are consumed by work executing in only the user and maintenance service classes. Work that is not handled by the dispatcher is not counted towards CPU utilization.

Work that is not handled by the workload management dispatcher includes:

Work performed by applications or middleware products, other than the DB2 database manager, that perform a portion of their work outside of the DB2 database manager
Work performed by entities executing in the DB2 system service class
Work performed by other DB2 instances
Non-DB2 database manager work performed in fenced mode processes (FMPs) such as fenced stored procedures
Non-DB2 database manager work performed in trusted routines

To obtain the CPU utilization for these other consumers of CPU resources, one must use operating-system-level (OS-level) monitoring such as that provided with OS workload managers.

CPU velocity

CPU velocity is a statistic that determines whether there is contention for a resource and the degree of such contention. When all access to a resource is mutually exclusive and there are multiple requestors of that resource wanting to access it at the same time, there must be some form of queuing for access, or requestors must be turned away. When a queue is allowed to form, the time taken for a requestor to obtain and then finish using a resource can exceed the time spent simply using the resource. The velocity is the ratio of the time spent simply using the resource to the total time spent both waiting for and using the resource. It is measured on a scale of zero to 100%. When there is a high amount of contention for a resource, velocity sinks towards zero. When there is no contention for a resource, then there is no queue time; CPU velocity reaches its maximum value of 100%.

When the workload management dispatcher is enabled, you can measure CPU velocity using the MON_SAMPLE_SERVICE_CLASS_METRICS and MON_SAMPLE_WORKLOAD_METRICS table functions, the WLM_GET_WORKLOAD_STATS and WLM_GET_SERVICE_SUBCLASS_STATS table functions, and the event_scstats and event_wlstats event monitor logical data groups collected and reported by the WLM statistics event monitor. A low CPU velocity value indicates that contention exists for the CPU resources of the host or LPAR and indicates that the workload management dispatcher can be effective in shifting CPU resources towards high-priority service classes and away from low-priority service classes. A high CPU velocity indicates that the workload management dispatcher will have a limited effect on improving workload performance, since every request for CPU resources is already being serviced without any delay.