When the query compiler optimizes query plans, its decisions
are heavily influenced by statistical information about the size of
the database tables, indexes, and statistical views. This information
is stored in system catalog tables.
The optimizer also uses information about the distribution of data
in specific columns of tables, indexes, and statistical views if these
columns are used to select rows or to join tables. The optimizer uses
this information to estimate the costs of alternative access plans
for each query.
Statistical information about the cluster ratio of indexes, the
number of leaf pages in indexes, the number of table rows that overflow
their original pages, and the number of filled and empty pages in
a table can also be collected. You can use this information to decide
when to reorganize tables or indexes.
Table statistics in a partitioned database environment are collected
only for that portion of the table that resides on the database partition
on which the utility is running, or for the first database partition
in the database partition group that contains the table. Information
about statistical views is collected for all database partitions.
Statistics that are updated by the runstats utility
Catalog
statistics are updated by the runstats utility, which can be started
by issuing the RUNSTATS command, calling the ADMIN_CMD
procedure, or calling the db2Runstats API. Updates
can be initiated either manually or automatically.
For collecting catalog statistics, you can
use the task assistant available in IBM® Data Studio Version
3.1 or later. Task assistants can guide you through the process of
setting options, reviewing the automatically generated commands to
perform the task, and running these commands. For more details, see Administering databases with task assistants.
Statistics about declared temporary tables are not stored
in the system catalog, but are stored in memory structures that represent
the catalog information for declared temporary tables. It is possible
(and in some cases, it might be useful) to perform runstats on a declared
temporary table.
The runstats utility collects the following
information about tables and indexes:
- The number of pages that contain rows
- The number of pages that are in use
- The number of rows in the table (the cardinality)
- The number of rows that overflow
- For multidimensional clustering (MDC) tables, the number of blocks
that contain data
- For partitioned tables, the degree of data clustering within a
single data partition
- Data distribution statistics, which are used by the optimizer
to estimate efficient access plans for tables and statistical views
whose data is not evenly distributed and whose columns have a significant
number of duplicate values
- Detailed index statistics, which are used by the optimizer to
determine how efficient it is to access table data through an index
- Subelement statistics for LIKE predicates, especially those that
search for patterns within strings (for example, LIKE %disk%),
are also used by the optimizer
The runstats utility collects the following statistics
for each data partition in a table. These statistics are only used
for determining whether a partition needs to be reorganized:
- The number of pages that contain rows
- The number of pages that are in use
- The number of rows in the table (the cardinality)
- The number of rows that overflow
- For MDC tables, the number of blocks that contain data
Distribution statistics are not collected:
- When the num_freqvalues and num_quantiles database
configuration parameters are set to 0
- When the distribution of data is known, such as when each data
value is unique
- When the column contains a LONG, LOB, or structured
data type
- For row types in sub-tables (the table-level statistics NPAGES,
FPAGES, and OVERFLOW are not collected)
- If quantile distributions are requested, but there is only one
non-null value in the column
- For extended indexes or declared temporary tables
The runstats utility collects the following information
about each column in a table or statistical view, and the first column
in an index key:
- The cardinality of the column
- The average length of the column (the average space, in bytes,
that is required when the column is stored in database memory or in
a temporary table)
- The second highest value in the column
- The second lowest value in the column
- The number of null values in the column
For columns that contain large object (LOB)
or LONG data types, the runstats utility collects only the average
length of the column and the number of null values in the column.
The average length of the column represents the length of the data
descriptor, except when LOB data is located inline on the data page.
The average amount of space that is required to store the column on
disk might be different than the value of this statistic.
The
runstats utility collects the following information about each XML
column:
- The number of NULL XML documents
- The number of non-NULL XML documents
- The number of distinct paths
- The sum of the node count for each distinct path
- The sum of the document count for each distinct path
- The k pairs of (path, node count) with the
largest node count
- The k pairs of (path, document count) with
the largest document count
- The k triples of (path, value, node count)
with the largest node count
- The k triples of (path, value, document count)
with the largest document count
- For each distinct path that leads to a text or attribute value:
- The number of distinct values that this path can take
- The highest value
- The lowest value
- The number of text or attribute nodes
- The number of documents that contain the text or attribute nodes
Each row in an XML column stores an XML document. The node count
for a path or path-value pair refers to the number of nodes that are
reachable by that path or path-value pair. The document count for
a path or path-value pair refers to the number of documents that contain
that path or path-value pair.
For DB2® V9.7
Fix Pack 1 and later releases, the following apply to the collection
of distribution statistics on an XML column:
- Distribution statistics are collected for each index over XML
data specified on an XML column.
- The runstats utility must collect both distribution statistics
and table statistics to collect distribution statistics for an index
over XML data. Table statistics must be gathered in order for distribution
statistics to be collected since XML distribution statistics are stored
with table statistics.
Collecting only index statistics, or collecting
index statistics during index creation, will not collect distribution
statistics for an index over XML data.
As the default, the runstats
utility collects a maximum of 250 quantiles for distribution statistics
for each index over XML data. The maximum number of quantiles for
a column can be specified when executing the runstats utility.
- Distribution statistics are collected for indexes over XML data
of type VARCHAR, DOUBLE, TIMESTAMP, and DATE. XML distribution statistics
are not collected for indexes over XML data of type VARCHAR HASHED.
- XML distribution statistics are collected when automatic table
runstats operations are performed.
- XML distribution statistics are not created when loading data
with the STATISTICS option.
- XML distribution statistics are not collected for partitioned
indexes over XML data defined on a partitioned table.
The runstats utility collects the
following information about column groups:
- A timestamp-based name for the column group
- The cardinality of the column group
The runstats utility collects the following information
about indexes:
- The number of index entries (the index cardinality)
- The number of leaf pages
- The number of index levels
- The degree of clustering of the table data to the index
- The degree of clustering of the index keys with regard to data
partitions
- The ratio of leaf pages located on disk in index key order to
the number of pages in the range of pages occupied by the index
- The number of distinct values in the first column of the index
- The number of distinct values in the first two, three, and four
columns of the index
- The number of distinct values in all columns of the index
- The number of leaf pages located on disk in index key order, with
few or no large gaps between them
- The average leaf key size, without include columns
- The average leaf key size, with include columns
- The number of pages on which all record identifiers (RIDs) are
marked deleted
- The number of RIDs that are marked deleted on pages where not
all RIDs are marked deleted
If you request detailed index statistics, additional information
about the degree of clustering of the table data to the index, and
the page fetch estimates for different buffer sizes, is collected.
For
a partitioned index, these statistics are representative of a single
index partition, with the exception of the distinct values in the
first column of the index; the first two, three, and four columns
of the index; and in all columns of the index. Per-index partition
statistics are also collected for the purpose of determining whether
an index partition needs to be reorganized.
Statistics
collection invalidates cached dynamic statements that reference tables
for which statistics have been collected. This is done so that cached
dynamic statements can be re-optimized with the latest statistics.