In a data warehouse, fact table information often changes quite dynamically, whereas dimension table data is static. This means that dimension attribute data might be positively or negatively correlated with fact table attribute data.
Traditional base table statistics currently available to the optimizer do not allow it to discern relationships across tables. Column and table distribution statistics on statistical views (and MQTs) can be used to give the optimizer the necessary information to correct these types of cardinality estimation errors.
select sum(f.sales_price), d2.year
from product d1, period d2, daily_sales f
where d1.prodkey = f.prodkey
and d2.perkey = f.perkey
and d1.item_desc = 'golf club'
and d2.month = 'JUL'
group by d2.year
A star join query execution plan can be an excellent choice for this query, provided that the optimizer can determine whether the semi-join involving PRODUCT and DAILY_SALES, or the semi-join involving PERIOD and DAILY_SALES, is the most selective. To generate an efficient star join plan, the optimizer must be able to choose the most selective semi-join for the outer leg of the index ANDing operation.
Data warehouses often contain records for products that are no longer on store shelves. This can cause the distribution of PRODUCT columns after the join to appear dramatically different than their distribution before the join. Because the optimizer, for lack of better information, will determine the selectivity of local predicates based solely on base table statistics, the optimizer might become overly optimistic regarding the selectivity of the predicate item_desc = 'golf club'
For example, if golf clubs historically represent 1% of the products manufactured, but now account for 20% of sales, the optimizer would likely overestimate the selectivity of item_desc = 'golf club', because there are no statistics describing the distribution of item_desc after the join. And if sales in all twelve months are equally likely, the selectivity of the predicate month = 'JUL' would be around 8%, and thus the error in estimating the selectivity of the predicate item_desc = 'golf club' would mistakenly cause the optimizer to perform the seemingly more selective semi-join between PRODUCT and DAILY_SALES as the outer leg of the star join plan's index ANDing operation.
The following example provides a step-by-step illustration of how to set up statistical views to solve this type of problem.
Column | storekey | store_number | city | state | district | ... |
---|---|---|---|---|---|---|
Attribute | integer |
char(2) | char(20) | char(5) | char(14) | ... |
Column | custkey | name | address | age | gender | ... |
---|---|---|---|---|---|---|
Attribute | integer |
char(30) | char(40) | smallint | char(1) | ... |
Column | prodkey | category | item_desc | price | cost | ... |
---|---|---|---|---|---|---|
Attribute | integer |
integer | char(30) | decimal(11) | decimal(11) | ... |
Column | promokey | promotype | promodesc | promovalue | ... |
---|---|---|---|---|---|
Attribute | integer |
integer | char(30) | decimal(5) | ... |
Column | perkey | calendar_date | month | period | year | ... |
---|---|---|---|---|---|---|
Attribute | integer |
date | char(3) | smallint | smallint | ... |
Column | storekey | custkey | prodkey | promokey | perkey | sales_price | ... |
---|---|---|---|---|---|---|---|
Attribute | integer | integer | integer | integer | integer | decimal(11) | ... |
Suppose the company managers want to determine whether or not consumers will buy a product again if they are offered a discount on a return visit. Moreover, suppose this study is done only for store '01', which has 18 locations nationwide. Table 7 shows information about the different categories of promotion that are available.
promotype | promodesc | COUNT (promotype) | percentage of total |
---|---|---|---|
1 | Return customers | 1 | 2.86% |
2 | Coupon | 15 | 42.86% |
3 | Advertisement | 5 | 14.29% |
4 | Manager's special | 3 | 8.57% |
5 | Overstocked items | 4 | 11.43% |
6 | End aisle display | 7 | 20.00% |
select count(*)
from store d1, promotion d2, daily_sales f
where d1.storekey = f.storekey
and d2.promokey = f.promokey
and d1.store_number = '01'
and d2.promotype = 1
6.15567e+06
IXAND
( 8)
/------------------+------------------\
2.15448e+07 2.15448e+08
NLJOIN NLJOIN
( 9) ( 13)
/---------+--------\ /---------+--------\
1 2.15448e+07 18 1.19694e+07
FETCH IXSCAN FETCH IXSCAN
( 10) ( 12) ( 14) ( 16)
/---+---\ | /---+---\ |
35 35 7.54069e+08 18 63 7.54069e+08
IXSCAN TABLE: DB2DBA INDEX: DB2DBA IXSCAN TABLE: DB2DBA INDEX: DB2DBA
( 11) PROMOTION PROMO_FK_IDX ( 15) STORE STORE_FK_IDX
| |
35 63
INDEX: DB2DBA INDEX: DB2DBA
PROMOTION_PK_IDX STOREX1
Before Join | After Join | |||
---|---|---|---|---|
Predicate | count | percentage of rows qualified | count | percentage of rows qualified |
store_number = '01' | 18 | 28.57% | 2.15448e+08 | 28.57% |
promotype = 1 | 1 | 2.86% | 2.15448e+07 | 2.86% |
Because the probability of promotype = 1 is less than that of store_number = '01', the optimizer chooses the semi-join between PROMOTION and DAILY_SALES as the outer leg of the star join plan's index ANDing operation. This leads to an estimated count of approximately 6 155 670 products sold using promotion type 1 - an incorrect cardinality estimate that is off by a factor of 2.09 (12 889 514 ÷ 6 155 670 ≈ 2.09).
What causes the optimizer to only estimate half of the actual number of records satisfying the two predicates? Store '01' represents about 28.57% of all the stores. What if other stores had more sales than store '01' (less than 28.57%)? Or what if store '01' actually sold most of the product (more than 28.57%)? Likewise, the 2.86% of products sold using promotion type 1 shown in Table 8 can be misleading. The actual percentage in DAILY_SALES could very well be a different figure than the projected one.
create view sv_store_dailysales as
(select s.*
from store s, daily_sales ds
where s.storekey = ds.storekey)
create view sv_promotion_dailysales as
(select p.*
from promotion.p, daily_sales ds
where p.promokey = ds.promokey)
alter view sv_store_dailysales enable query optimization
alter view sv_promotion_dailysales enable query optimization
runstats on table db2dba.sv_store_dailysales with distribution
runstats on table db2dba.sv_promotion_dailysales with distribution
1.04627e+07
IXAND
( 8)
/------------------+------------------\
6.99152e+07 1.12845e+08
NLJOIN NLJOIN
( 9) ( 13)
/---------+--------\ /---------+--------\
18 3.88418e+06 1 1.12845e+08
FETCH IXSCAN FETCH IXSCAN
( 10) ( 12) ( 14) ( 16)
/---+---\ | /---+---\ |
18 63 7.54069e+08 35 35 7.54069e+08
IXSCAN TABLE:DB2DBA INDEX: DB2DBA IXSCAN TABLE: DB2DBA INDEX: DB2DBA DB2DBA
( 11) STORE STORE_FK_IDX ( 15) PROMOTION PROMO_FK_IDX
| |
63 35
INDEX: DB2DBA INDEX: DB2DBA
STOREX1 PROMOTION_PK_IDX
Before Join | After Join (no statistical views) | After Join (with statistical views) | ||||
---|---|---|---|---|---|---|
Predicate | count | percentage of rows qualified | count | percentage of rows qualified | count | percentage of rows qualified |
store_number = '01' | 18 | 28.57% | 2.15448e+08 | 28.57% | 6.99152e+07 | 9.27% |
promotype = 1 | 1 | 2.86% | 2.15448e+07 | 2.86% | 1.12845e+08 | 14.96% |
Note that this time, the semi-join between STORE and DAILY_SALES is performed on the outer leg of the index ANDing plan. This is because the two statistical views essentially tell the optimizer that the predicate store_number = '01' will filter more rows than promotype = 1. This time, the optimizer estimates that there are approximately 10 462 700 products sold. This estimate is off by a factor of 1.23 (12 889 514 ÷ 10 462 700 ≈ 1.23), which is a significant improvement over the estimate without statistical views (in Table 8).