When the optimizer chooses an access plan, it considers
the performance impact of sorting data. Sorting occurs when no index
satisfies the requested ordering of fetched rows. Sorting might also
occur when the optimizer determines that a sort is less expensive
than an index scan.
The optimizer handles sorted data in one of the following ways:
- It pipes the results of the sort when the query executes
- It lets the database manager handle the sort internally
Piped and non-piped sorts
If the final sorted
list of data can be read in a single sequential pass, the results
can be piped. Piping is quicker than non-piped ways of
communicating the results of a sort. The optimizer chooses to pipe
the results of a sort whenever possible.
Whether or not a sort
is piped, the sort time depends on a number of factors, including
the number of rows to be sorted, the key size and the row width. If
the rows to be sorted occupy more than the space that is available
in the sort heap, several sort passes are performed. A subset of the
entire set of rows is sorted during each pass. Each pass is stored
in a temporary table in the buffer pool. If there is not enough space
in the buffer pool, pages from this temporary table might be written
to disk. When all of the sort passes are complete, these sorted subsets
are merged into a single sorted set of rows. If the sort is piped,
the rows are passed directly to Relational Data Services (RDS) as
they are merged. (RDS is the DB2® component
that processes requests to access or manipulate the contents of a
database.)
Group and sort pushdown operators
In some
cases, the optimizer can choose to push down a sort or aggregation
operation to Data Management Services (DMS) from RDS. (DMS is the DB2 component that controls creating,
removing, maintaining, and accessing the tables and table data in
a database.) Pushing down these operations improves performance by
allowing DMS to pass data directly to a sort or aggregation routine.
Without this pushdown, DMS first passes this data to RDS, which then
interfaces with the sort or aggregation routines. For example, the
following query benefits from this type of optimization:
select workdept, avg(salary) as avg_dept_salary
from employee
group by workdept
Group operations in sorts
When sorting produces
the order that is required for a GROUP BY operation, the optimizer
can perform some or all of the GROUP BY aggregations while doing the
sort. This is advantageous if the number of rows in each group is
large. It is even more advantageous if doing some of the grouping
during the sort reduces or eliminates the need for the sort to spill
to disk.
Aggregation during sorting requires one or more of
the following three stages of aggregation to ensure that proper results
are returned.
- The first stage of aggregation, partial aggregation,
calculates the aggregate values until the sort heap is filled. During
partial aggregation, non-aggregated data is taken in and partial aggregates
are produced. If the sort heap is filled, the rest of the data spills
to disk, including all of the partial aggregations that have been
calculated in the current sort heap. After the sort heap is reset,
new aggregations are started.
- The second stage of aggregation, intermediate aggregation,
takes all of the spilled sort runs and aggregates further on the grouping
keys. The aggregation cannot be completed, because the grouping key
columns are a subset of the distribution key columns. Intermediate
aggregation uses existing partial aggregates to produce new partial
aggregates. This stage does not always occur. It is used for both
intrapartition and interpartition parallelism. In intrapartition parallelism,
the grouping is finished when a global grouping key is available.
In interpartition parallelism, this occurs when the grouping key is
a subset of the distribution key dividing groups across database partitions,
and thus requires redistribution to complete the aggregation. A similar
case exists in intrapartition parallelism, when each agent finishes
merging its spilled sort runs before reducing to a single agent to
complete the aggregation.
- The last stage of aggregation, final aggregation,
uses all of the partial aggregates and produces final aggregates.
This step always takes place in a GROUP BY operator. Sorting cannot
perform complete aggregation, because it cannot be guaranteed that
the sort will not split. Complete aggregation takes in non-aggregated
data and produces final aggregates. This method of aggregation is
usually used to group data that is already in the correct order.