IBM Support

DB2 might choose a sub-optimal query execution plan when distribution statistics are collected on string data columns with a common prefix of more than 32 bytes.

Technote (troubleshooting)


This document applies only to the following language version(s):

English

Problem(Abstract)

The RUNSTATS utility can collect two types of distribution statistics: quantile statistics and frequent value statistics. Quantile statistics provide information about how data values are distributed in relation to other values. Frequent value statistics provide information about a column and the data value with the highest number of duplicates, the value with the second highest number of duplicates, etc.

When you collect distribution statistics on columns containing string data with a common prefix of more than 32 bytes, the statistics can be inaccurate as a result of design limitations in the RUNSTATS utility. If you compile a query with equality or range predicates on this column, this inaccuracy can result in a sub-optimal query execution plan.

Example:
In the following example, table T1 contains a column of type CHAR(40) with the following data:

C1
----------------------------------------
test.test.test.test.test.test.value1
test.test.test.test.test.test.value2
test.test.test.test.test.test.value3
test.test.test.test.test.test.value3
test.test.test.test.test.test.value3

where the first 35 bytes are the same in each value. If you collect distribution statistics on this column, you will see the following data in the SYSSTAT.COLDIST catalog view

COLVALUE VALCOUNT TYPE SEQNO
---------------------------------------- -------------------- ---- ------
'test.test.test.test.test.test.val' 5 F 1
'test.test.test.test.test.test.val' 0 Q 1
'test.test.test.test.test.test.val' 5 Q 2

and for a query referencing the column in a predicate, such as

SELECT C1 FROM T1 WHERE C1 < 'test.test.test.test.test.test.value2'

the optimizer will estimate all rows in the table will be returned, which can be confirmed by collecting an EXPLAIN of the query.

Resolving the problem

You can solve this issue by disabling the collection of distribution statistics on the columns affected, while enabling the collection of distribution statistics on all other columns. The following is one example of a RUNSTATS command that disables collection of distribution statistics on one column, C1, while enabling it for all other columns in the table:
RUNSTATS ON TABLE SCHEMA.T1 WITH DISTRIBUTION ON ALL COLUMNS AND COLUMNS(C1 NUM_FREQVALUES 0 NUM_QUANTILES 0) AND DETAILED INDEXES ALL

Related information

RUNSTATS command
Distribution statistics

Document information

More support for: DB2 for Linux, UNIX and Windows
Database Objects/Config - Database

Software version: 9.7, 9.8, 10.1, 10.5

Operating system(s): AIX, Linux, Solaris, Windows

Reference #: 1668238

Modified date: 18 November 2015