DB2 Version 9.7 for Linux, UNIX, and Windows

Improving the performance of SQL procedures

Overview of how DB2 compiles SQL PL and inline SQL PL

Before discussing how to improve the performance of SQL procedures we should discuss how DB2® compiles them upon the execution of the CREATE PROCEDURE statement.

When an SQL procedure is created, DB2 separates the SQL queries in the procedure body from the procedural logic. To maximize performance, the SQL queries are statically compiled into sections in a package. For a statically compiled query, a section consists mainly of the access plan selected by the DB2 optimizer for that query. A package is a collection of sections. For more information on packages and sections, please refer to the DB2 SQL Reference. The procedural logic is compiled into a dynamically linked library.

During the execution of a procedure, every time control flows from the procedural logic to an SQL statement, there is a "context switch" between the DLL and the DB2 engine. As of DB2 Version 8.1, SQL procedures run in "unfenced mode". That is they run in the same addressing space as the DB2 engine. Therefore the context switch we refer to here is not a full context switch at the operating system level, but rather a change of layer within DB2. Reducing the number of context switches in procedures that are invoked very often, such as procedures in an OLTP application, or that process large numbers of rows, such as procedures that perform data cleansing, can have a noticeable impact on their performance.

Whereas an SQL procedure containing SQL PL is implemented by statically compiling its individual SQL queries into sections in a package, an inline SQL PL function is implemented, as the name suggests, by inlining the body of the function into the query that uses it. Queries in SQL functions are compiled together, as if the function body were a single query. The compilation occurs every time a statement that uses the function is compiled. Unlike what happens in SQL procedures, procedural statements in SQL functions are not executed in a different layer than dataflow statements. Therefore, there is no context switch every time control flows from a procedural to a dataflow statement or vice versa.

If there are no side-effects in your logic use an SQL function instead

Because of the difference in compilation between SQL PL in procedures and inline SQL PL in functions, it is reasonable to presume that a piece of procedural code will execute faster in a function than in a procedure if it only queries SQL data and does no data modifications - that is it has no side-effects on the data in the database or external to the database.

That is only good news if all the statements that you need to execute are supported in SQL functions. SQL functions can not contain SQL statements that modify the database. As well, only a subset of SQL PL is available in the inline SQL PL of functions. For example, you cannot execute CALL statements, declare cursors, or return result sets in SQL functions.

Here is an example of an SQL procedure containing SQL PL that was a good candidate for conversion to an SQL function to maximize performance:

  CREATE PROCEDURE GetPrice (IN Vendor CHAR&(20&),
                             IN Pid INT, OUT price DECIMAL(10,3))
  LANGUAGE SQL
  BEGIN
    IF Vendor eq; ssq;Vendor 1ssq;
      THEN SET price eq; (SELECT ProdPrice 
                              FROM V1Table 
                              WHERE Id = Pid);
    ELSE IF Vendor eq; ssq;Vendor 2ssq;
      THEN SET price eq; (SELECT Price FROM V2Table
                              WHERE Pid eq; GetPrice.Pid);
    END IF;
  END

Here is the rewritten SQL function:

  CREATE FUNCTION GetPrice (Vendor CHAR(20), Pid INT)  
  RETURNS  DECIMAL(10,3)
  LANGUAGE SQL
  BEGIN
    DECLARE price DECIMAL(10,3);
    IF Vendor = 'Vendor 1'
      THEN SET price = (SELECT ProdPrice 
                               FROM V1Table 
                               WHERE Id = Pid);
    ELSE IF Vendor = 'Vendor 2'
      THEN SET price = (SELECT Price FROM V2Table
        			      WHERE Pid = GetPrice.Pid);
    END IF;
    RETURN price;
  END

Remember that the invocation of a function is different than a procedure. To invoke the function, use the VALUES statement or invoke the function where an expression is valid, such as in a SELECT or SET statement. Any of the following are valid ways of invoking the new function:

  VALUES (GetPrice('IBM', 324))
  
  SELECT VName FROM Vendors WHERE GetPrice(Vname, Pid) < 10
  
  SET price = GetPrice(Vname, Pid)

Avoid multiple statements in an SQL PL procedure when just one is sufficient

Although it is generally a good idea to write concise SQL, it is very ease to forget to do this in practice. For example the following SQL statements:

  INSERT INTO tab_comp VALUES (item1, price1, qty1);
  INSERT INTO tab_comp VALUES (item2, price2, qty2);
  INSERT INTO tab_comp VALUES (item3, price3, qty3);

can be rewritten as a single statement:

  INSERT INTO tab_comp VALUES	(item1, price1, qty1),
                              (item2, price2, qty2),
                              (item3, price3, qty3);

The multi-row insert will require roughly one third of the time required to execute the three original statements. Isolated, this improvement might seem negligible, but if the code fragment is executed repeatedly, for example in a loop or in a trigger body, the improvement can be significant.

Similarly, a sequence of SET statements like:

  SET A = expr1;
  SET B = expr2;
  SET C = expr3;

can be written as a single VALUES statement:

  VALUES expr1, expr2, expr3 INTO A, B, C;

This transformation preserves the semantics of the original sequence if there are no dependencies between any two statements. To illustrate this, consider:

  SET A = monthly_avg * 12;
  SET B = (A / 2) * correction_factor;

Converting the previous two statements to:

  VALUES (monthly_avg * 12, (A / 2) * correction_factor) INTO A, B;

does not preserve the original semantics because the expressions before the INTO keyword are evaluated 'in parallel'. This means that the value assigned to B is not based on the value assigned to A, which was the intended semantics of the original statements.

Reduce multiple SQL statements to a single SQL expression

Like other programming languages, the SQL language provides two types of conditional constructs: procedural (IF and CASE statements) and functional (CASE expressions). In most circumstances where either type can be used to express a computation, using one or the other is a matter of taste. However, logic written using CASE expressions is not only more compact, but also more efficient than logic written using CASE or IF statements.

Consider the following fragment of SQL PL code:

  IF (Price <= MaxPrice) THEN
    INSERT INTO tab_comp(Id, Val) VALUES(Oid, Price);
  ELSE
    INSERT INTO tab_comp(Id, Val) VALUES(Oid, MaxPrice);
  END IF;

The condition in the IF clause is only being used to decide what value is inserted in the tab_comp.Val column. To avoid the context switch between the procedural and the dataflow layers, the same logic can be expressed as a single INSERT with a CASE expression:

  INSERT INTO tab_comp(Id, Val)
         VALUES(Oid,
              CASE
                 WHEN (Price <= MaxPrice) THEN Price
                 ELSE MaxPrice
              END);

It's worth noting that CASE expressions can be used in any context where a scalar value is expected. In particular, they can be used on the right-hand side of assignments. For example:

  IF (Name IS NOT NULL) THEN
    SET ProdName = Name;
  ELSEIF (NameStr IS NOT NULL) THEN
    SET ProdName = NameStr;
  ELSE
    SET ProdName = DefaultName;
  END IF;

can be rewritten as:

  SET ProdName = (CASE
                    WHEN (Name IS NOT NULL) THEN Name
                    WHEN (NameStr IS NOT NULL) THEN NameStr
                    ELSE  DefaultName
                  END);

In fact, this particular example admits an even better solution:

  SET ProdName = COALESCE(Name, NameStr, DefaultName);

Don't underestimate the benefit of taking the time to analyze and consider rewriting your SQL. The performance benefits will pay you back many times over for the time invested in analyzing and rewriting your procedure.

Exploit the set-at-a-time semantics of SQL

Procedural constructs such as loops, assignment and cursors allow us to express computations that would not be possible to express using just SQL DML statements. But when we have procedural statements at our disposal, there is a risk that we could turn to them even when the computation at hand can, in fact, be expressed using just SQL DML statements. As we've mentioned earlier, the performance of a procedural computation can be orders of magnitude slower than the performance of an equivalent computation expressed using DML statements. Consider the following fragment of code:

  DECLARE cur1 CURSOR FOR SELECT col1, col2 FROM tab_comp;
  OPEN cur1;
  FETCH cur1 INTO v1, v2;
  WHILE SQLCODE <> 100 DO
    IF (v1 > 20) THEN
      INSERT INTO tab_sel VALUES (20, v2);
    ELSE
      INSERT INTO tab_sel VALUES (v1, v2);
    END IF;
    FETCH cur1 INTO v1, v2;
  END WHILE;

To begin with, the loop body can be improved by applying the transformation discussed in the last section - "Reduce multiple SQL statements to a single SQL expression":

  DECLARE cur1 CURSOR FOR SELECT col1, col2 FROM tab_comp;
  OPEN cur1;
  FETCH cur1 INTO v1, v2;
  WHILE SQLCODE <> 100 DO
    INSERT INTO tab_sel VALUES (CASE
                                  WHEN v1 > 20 THEN 20
                                  ELSE v1
                                END, v2);
    FETCH cur1 INTO v1, v2;
  END WHILE;

But upon closer inspection, the whole block of code can be written as an INSERT with a sub-SELECT:

  INSERT INTO tab_sel (SELECT (CASE
                                 WHEN col1 > 20 THEN 20
                                 ELSE col1
                               END),
                               col2
                       FROM tab_comp);

In the original formulation, there was a context switch between the procedural and the dataflow layers for each row in the SELECT statements. In the last formulation, there is no context switch at all, and the optimizer has a chance to globally optimize the full computation.

On the other hand, this dramatic simplification would not have been possible if each of the INSERT statements targeted a different table, as shown below:

  DECLARE cur1 CURSOR FOR SELECT col1, col2 FROM tab_comp;
  OPEN cur1;
  FETCH cur1 INTO v1, v2;
  WHILE SQLCODE <> 100 DO
    IF (v1 > 20) THEN
      INSERT INTO tab_default VALUES (20, v2);
    ELSE
      INSERT INTO tab_sel VALUES (v1, v2);
    END IF;
    FETCH cur1 INTO v1, v2;
  END WHILE;

However, the set-at-a-time nature of SQL can also be exploited here:

  INSERT INTO tab_sel (SELECT col1, col2
                       FROM tab_comp
                       WHERE col1 <= 20);
  INSERT INTO tab_default (SELECT col1, col2
                           FROM tab_comp
                           WHERE col1 > 20);

When looking at improving the performance of existing procedural logic, any time spent in eliminating cursor loops will likely pay off.

Keep the DB2 optimizer informed

When a procedure is created, its individual SQL queries are compiled into sections in a package. The DB2 optimizer chooses an execution plan for a query based, among other things, on table statistics (for example, table sizes or the relative frequency of data values in a column) and indexes available at the time the query is compiled. When tables suffer significant changes, it may be a good idea to let DB2 collect statistics on these tables again. And when statistics are updated or new indexes are created, it may also be a good idea to rebind the packages associated with SQL procedures that use the tables, to let DB2 create plans that exploit the latest statistics and indexes.

Table statistics can be updated using the RUNSTATS command. To rebind the package associated with an SQL procedure, you can use the REBIND_ROUTINE_PACKAGE built-in procedure that is available in DB2 Version 8.1. For example, the following command can be used to rebind the package for procedure MYSCHEMA.MYPROC:

  CALL SYSPROC.REBIND_ROUTINE_PACKAGE('P', 'MYSCHEMA.MYPROC', 'ANY')

where 'P' indicates that the package corresponds to a procedure and 'ANY' indicates that any of the functions and types in the SQL path are considered for function and type resolution. See the Command Reference entry for the REBIND command for more details.

Use arrays

You can use arrays to efficiently pass collections of data between applications and stored procedures and to store and manipulate transient collections of data within SQL procedures without having to use relational tables. Operators on arrays available within SQL procedures allow for the efficient storage and retrieval of data. Applications that create arrays of moderate size will experience significantly better performance than applications that create very large arrays (on the scale of multiple megabytes), as the entire array is stored in main memory. See Related links section for additional information.