About 10 metrics of system performance provide a good basic set to use in an on-going operational monitoring effort.
This starter set includes about 10 metrics:
TOTAL_APP_COMMITS
This
provides an excellent base level measurement of system activity.100 * (POOL_DATA_L_READS - POOL_DATA_P_READS) / POOL_DATA_L_READS
100 * (POOL_INDEX_L_READS - POOL_INDEX_P_READS) / POOL_INDEX_L_READS
100 * (POOL_TEMP_DATA_L_READS - POOL_TEMP_DATA_P_READS) / POOL_TEMP_DATA_L_READS
100 * (POOL_TEMP_INDEX_L_READS - POOL_TEMP_INDEX_P_READS)
/ POOL_TEMP_INDEX_L_READS
Buffer pool hit ratios are
one of the most fundamental metrics, and give an important overall
measure of how effectively the system is exploiting memory to avoid
disk I/O. Hit ratios of 80-85% or better for data and 90-95% or better
for indexes are generally considered good for an OLTP environment,
and of course these ratios can be calculated for individual buffer
pools using data from the buffer pool snapshot.Although these metrics are generally useful, for systems such as data warehouses that frequently perform large table scans, data hit ratios are often irretrievably low, because data is read into the buffer pool and then not used again before being evicted to make room for other data.
(POOL_DATA_P_READS + POOL_INDEX_P_READS +
POOL_TEMP_DATA_P_READS + POOL_TEMP_INDEX_P_READS)
/ TOTAL_APP_COMMITS
(POOL_DATA_WRITES + POOL_INDEX_WRITES)
/ TOTAL_APP_COMMITS
These metrics are closely related
to buffer pool hit ratios, but have a slightly different purpose.
Although you can consider target values for hit ratios, there are
no possible targets for reads and writes per transaction. Why bother
with these calculations? Because disk I/O is such a major factor in
database performance, it is useful to have multiple ways of looking
at it. As well, these calculations include writes, whereas hit ratios
only deal with reads. Lastly, in isolation, it is difficult to know,
for example, whether a 94% index hit ratio is worth trying to improve.
If there are only 100 logical index reads per hour, and 94 of them
are in the buffer pool, working to keep those last 6 from turning
into physical reads is not a good use of time. However, if a 94% index
hit ratio were accompanied by a statistic that each transaction did
twenty physical reads (which could be further broken down by data
and index, regular and temporary), the buffer pool hit ratios might
well deserve some investigation.The metrics are not just physical reads and writes, but are normalized per transaction. This trend is followed through many of the metrics. The purpose is to decouple metrics from the length of time data was collected, and from whether the system was very busy or less busy at that time. In general, this helps ensure that similar values for metrics are obtained, regardless of how and when monitoring data is collected. Some amount of consistency in the timing and duration of data collection is a good thing; however, normalization reduces it from being critical to being a good idea.
ROWS_READ / ROWS_RETURNED
This calculation
gives an indication of the average number of rows that are read from
database tables in order to find the rows that qualify. Low numbers
are an indication of efficiency in locating data, and generally show
that indexes are being used effectively. For example, this number
can be very high in the case where the system does many table scans,
and millions of rows need to be inspected to determine if they qualify
for the result set. On the other hand, this statistic can be very
low in the case of access to a table through a fully-qualified unique
index. Index-only access plans (where no rows need to be read from
the table) do not cause ROWS_READ to increase.In an OLTP environment, this metric is generally no higher than 2 or 3, indicating that most access is through indexes instead of table scans. This metric is a simple way to monitor plan stability over time - an unexpected increase is often an indication that an index is no longer being used and should be investigated.
TOTAL_SORT_TIME / TOTAL_APP_COMMITS
This
is an efficient way to handle sort statistics, because any extra overhead
due to spilled sorts automatically gets included here. That said,
you might also want to collect TOTAL_SORTS and SORT_OVERFLOWS for
ease of analysis, especially if your system has a history of sorting
issues.1000 * LOCK_WAIT_TIME / TOTAL_APP_COMMITS
Excessive
lock wait time often translates into poor response time, so it is
important to monitor. The value is normalized to one thousand transactions
because lock wait time on a single transaction is typically quite
low. Scaling up to one thousand transactions simply provides measurements
that are easier to handle.1000 * (DEADLOCKS + LOCK_TIMEOUTS) / TOTAL_APP_COMMITS
Although
deadlocks are comparatively rare in most production systems, lock
timeouts can be more common. The application usually has to handle
them in a similar way: re-executing the transaction from the beginning.
Monitoring the rate at which this happens helps avoid the case where
many deadlocks or lock timeouts drive significant extra load on the
system without the DBA being aware.1000 * POOL_DRTY_PG_STEAL_CLNS / TOTAL_APP_COMMITS
A "dirty
steal" is the least preferred way to trigger buffer pool cleaning.
Essentially, the processing of an SQL statement that is in need of
a new buffer pool page is interrupted while updates on the victim
page are written to disk. If dirty steals are allowed to happen frequently,
they can have a significant impact on throughput and response time.1000 * PKG_CACHE_INSERTS / TOTAL_APP_COMMITS
Package
cache insertions are part of normal execution of the system; however,
in large numbers, they can represent a significant consumer of CPU
time. In many well-designed systems, after the system is running at
steady-state, very few package cache inserts occur, because the system
is using or reusing static SQL or previously prepared dynamic SQL
statements. In systems with a high traffic of ad hoc dynamic SQL statements,
SQL compilation and package cache inserts are unavoidable. However,
this metric is intended to watch for a third type of situation, one
in which applications unintentionally cause package cache churn by
not reusing prepared statements, or by not using parameter markers
in their frequently executed SQL.LOG_WRITE_TIME
/ TOTAL_APP_COMMITS
The transaction log has significant
potential to be a system bottleneck, whether due to high levels of
activity, or to improper configuration, or other causes. By monitoring
log activity, you can detect problems both from the DB2® side (meaning an increase in number of log
requests driven by the application) and from the system side (often
due to a decrease in log subsystem performance caused by hardware
or configuration problems).FCM_SENDS_TOTAL, FCM_RECVS_TOTAL
These give
the rate of flow of data between different partitions in the cluster,
and in particular, whether the flow is balanced. Significant differences
in the numbers of buffers received from different partitions might
indicate a skew in the amount of data that has been hashed to each
partition.Almost all of the individual monitoring element values mentioned above are reported on a per-partition basis.