DB2 Version 9.7 for Linux, UNIX, and Windows

Catalog statistics

When the query compiler optimizes query plans, its decisions are heavily influenced by statistical information about the size of the database tables, indexes, and statistical views. This information is stored in system catalog tables.

The optimizer also uses information about the distribution of data in specific columns of tables, indexes, and statistical views if these columns are used to select rows or to join tables. The optimizer uses this information to estimate the costs of alternative access plans for each query.

Statistical information about the cluster ratio of indexes, the number of leaf pages in indexes, the number of table rows that overflow their original pages, and the number of filled and empty pages in a table can also be collected. You can use this information to decide when to reorganize tables or indexes.

Table statistics in a partitioned database environment are collected only for that portion of the table that resides on the database partition on which the utility is running, or for the first database partition in the database partition group that contains the table. Information about statistical views is collected for all database partitions.

Statistics that are updated by the runstats utility

Catalog statistics are updated by the runstats utility, which can be started by issuing the RUNSTATS command, calling the ADMIN_CMD procedure, or calling the db2Runstats API. Updates can be initiated either manually or automatically.

For collecting catalog statistics, you can use the task assistant available in IBM® Data Studio Version 3.1 or later. Task assistants can guide you through the process of setting options, reviewing the automatically generated commands to perform the task, and running these commands. For more details, see Administering databases with task assistants.

Statistics about declared temporary tables are not stored in the system catalog, but are stored in memory structures that represent the catalog information for declared temporary tables. It is possible (and in some cases, it might be useful) to perform runstats on a declared temporary table.

The runstats utility collects the following information about tables and indexes:

The number of pages that contain rows
The number of pages that are in use
The number of rows in the table (the cardinality)
The number of rows that overflow
For multidimensional clustering (MDC) tables, the number of blocks that contain data
For partitioned tables, the degree of data clustering within a single data partition
Data distribution statistics, which are used by the optimizer to estimate efficient access plans for tables and statistical views whose data is not evenly distributed and whose columns have a significant number of duplicate values
Detailed index statistics, which are used by the optimizer to determine how efficient it is to access table data through an index
Subelement statistics for LIKE predicates, especially those that search for patterns within strings (for example, LIKE %disk%), are also used by the optimizer

The runstats utility collects the following statistics for each data partition in a table. These statistics are only used for determining whether a partition needs to be reorganized:

The number of pages that contain rows
The number of pages that are in use
The number of rows in the table (the cardinality)
The number of rows that overflow
For MDC tables, the number of blocks that contain data

Distribution statistics are not collected:

When the num_freqvalues and num_quantiles database configuration parameters are set to 0
When the distribution of data is known, such as when each data value is unique
When the column contains a LONG, LOB, or structured data type
For row types in sub-tables (the table-level statistics NPAGES, FPAGES, and OVERFLOW are not collected)
If quantile distributions are requested, but there is only one non-null value in the column
For extended indexes or declared temporary tables

The runstats utility collects the following information about each column in a table or statistical view, and the first column in an index key:

The cardinality of the column
The average length of the column (the average space, in bytes, that is required when the column is stored in database memory or in a temporary table)
The second highest value in the column
The second lowest value in the column
The number of null values in the column

For columns that contain large object (LOB) or LONG data types, the runstats utility collects only the average length of the column and the number of null values in the column. The average length of the column represents the length of the data descriptor, except when LOB data is located inline on the data page. The average amount of space that is required to store the column on disk might be different than the value of this statistic.

The runstats utility collects the following information about each XML column:

The number of NULL XML documents
The number of non-NULL XML documents
The number of distinct paths
The sum of the node count for each distinct path
The sum of the document count for each distinct path
The k pairs of (path, node count) with the largest node count
The k pairs of (path, document count) with the largest document count
The k triples of (path, value, node count) with the largest node count
The k triples of (path, value, document count) with the largest document count
For each distinct path that leads to a text or attribute value:
- The number of distinct values that this path can take
- The highest value
- The lowest value
- The number of text or attribute nodes
- The number of documents that contain the text or attribute nodes

Each row in an XML column stores an XML document. The node count for a path or path-value pair refers to the number of nodes that are reachable by that path or path-value pair. The document count for a path or path-value pair refers to the number of documents that contain that path or path-value pair.

For DB2® V9.7 Fix Pack 1 and later releases, the following apply to the collection of distribution statistics on an XML column:

Distribution statistics are collected for each index over XML data specified on an XML column.
The runstats utility must collect both distribution statistics and table statistics to collect distribution statistics for an index over XML data. Table statistics must be gathered in order for distribution statistics to be collected since XML distribution statistics are stored with table statistics.
Collecting only index statistics, or collecting index statistics during index creation, will not collect distribution statistics for an index over XML data.

As the default, the runstats utility collects a maximum of 250 quantiles for distribution statistics for each index over XML data. The maximum number of quantiles for a column can be specified when executing the runstats utility.
Distribution statistics are collected for indexes over XML data of type VARCHAR, DOUBLE, TIMESTAMP, and DATE. XML distribution statistics are not collected for indexes over XML data of type VARCHAR HASHED.
XML distribution statistics are collected when automatic table runstats operations are performed.
XML distribution statistics are not created when loading data with the STATISTICS option.
XML distribution statistics are not collected for partitioned indexes over XML data defined on a partitioned table.

The runstats utility collects the following information about column groups:

A timestamp-based name for the column group
The cardinality of the column group

The runstats utility collects the following information about indexes:

The number of index entries (the index cardinality)
The number of leaf pages
The number of index levels
The degree of clustering of the table data to the index
The degree of clustering of the index keys with regard to data partitions
The ratio of leaf pages located on disk in index key order to the number of pages in the range of pages occupied by the index
The number of distinct values in the first column of the index
The number of distinct values in the first two, three, and four columns of the index
The number of distinct values in all columns of the index
The number of leaf pages located on disk in index key order, with few or no large gaps between them
The average leaf key size, without include columns
The average leaf key size, with include columns
The number of pages on which all record identifiers (RIDs) are marked deleted
The number of RIDs that are marked deleted on pages where not all RIDs are marked deleted

If you request detailed index statistics, additional information about the degree of clustering of the table data to the index, and the page fetch estimates for different buffer sizes, is collected.

For a partitioned index, these statistics are representative of a single index partition, with the exception of the distinct values in the first column of the index; the first two, three, and four columns of the index; and in all columns of the index. Per-index partition statistics are also collected for the purpose of determining whether an index partition needs to be reorganized.

Statistics collection invalidates cached dynamic statements that reference tables for which statistics have been collected. This is done so that cached dynamic statements can be re-optimized with the latest statistics.