IBM Support

Db2 Machine Learning Optimizer Technology Preview

News


Abstract

Machine Learning Optimizer Technology Preview in Db2 11.5.

Content

Introduction

Accurate cardinality estimates are fundamental to cost-based optimizers, such as the Db2 optimizer. Cardinality estimation is a process where the optimizer traditionally uses statistics to determine the size of intermediate query results such as the output cardinality of an operator in an access plan that applies one or more predicates. In the absence of multi-column statistics, the optimizer will default to assuming independence between multiple predicates. For example, the PROJNAME and DEPTNO columns in the PROJECT table from the SAMPLE database are not independent. If statistics are not collected on the combination of these two columns, cardinality estimates may be inaccurate for a pair of predicates such as:

WHERE PROJNAME = ‘APPLICATIONS SUPPORT’ AND DEPTNO = ‘E21’
(‘E21’ corresponds to DEPTNAME = ’SOFTWARE SUPPORT’ in the DEPARTMENT table)

A technology preview of the Db2 Machine Learning Optimizer (also referred to as ML Optimizer) is available in Db2 11.5. This unsupported version of the feature should only be used in non-production environments.

The ML Optimizer feature applies novel machine learning techniques in order for the Db2 optimizer to better estimate predicate filtering, which is key to good cardinality estimation. These techniques also eliminate the need for custom tuning of statistics collection options by the user.

When the ML Optimizer feature is enabled, a machine learning model can be built for any desired tables. Either an invocation of a specific Python script, an explicit RUNSTATS command, or the automatic statistics collection facility can be used to trigger the building of a machine learning model. The model can then be used to predict the estimated cardinalities for operators that apply multiple predicates.

Requirements for use

  • Python (Python versions and requirements are listed in the ‘Setup process’ section below)
  • Db2 11.5 for Linux on AMD64 or Intel EM64T systems (x64)
  • Db2 instance is configured to allow TCP/IP connections for local connections

Setup Process

The ML Optimizer feature requires Python 3 as well as several numerical and machine learning packages. Python and the required packages must be installed in the same location on each of your machines (if you are using Db2 with the Data Partitioning Feature on multiple physical machines). You may install Python wherever you prefer, however the instructions below assume it is being installed in the default instance owner’s home directory (/home/db2inst1). Python 3.6.3 was validated to work with the ML Optimizer feature, and is the recommended Python version to use.

Python Installation Steps

1) Download the Python-3.6.3.tgz file from https://www.python.org/ftp/python/3.6.3/ into the instance owner’s home directory, extract it using the command gtar -xzf Python-3.6.3.tgz, and change directory to the $HOME/Python-3.6.3 directory.

2) Run the following commands to build and install Python:

./configure --prefix=$HOME/python
make
make test
make install

3) Configure PIP for Python package management. Download https://bootstrap.pypa.io/get-pip.py into the instance owner’s $HOME directory and run:

$HOME/python/bin/python3 get-pip.py

4) Install the packages required by the ML Optimizer feature:

$HOME/python/bin/python3 -m pip install numpy
$HOME/python/bin/python3 -m pip install scipy
$HOME/python/bin/python3 -m pip install sklearn
$HOME/python/bin/python3 -m pip install xgboost==0.82
$HOME/python/bin/python3 -m pip install ibm_db
$HOME/python/bin/python3 -m pip install more-itertools

5) Validate that the XGBoost library can be loaded successfully on your system by running the following:

ldd $HOME/python/xgboost/libxgboost.so

Db2 Installation Steps

1) Enable the ML Optimizer feature for your Db2 instance by running the following command as the Db2 owning instance userid (I.e. db2inst1):

db2set DB2_ML_OPT=”ENABLE:YES;PYTHON_EXEC:$HOME/python/bin/python3”

After setting this registry variable, you must deactivate and re-activate your database for the new setting (and the ML Optimizer feature) to take effect.

2) Create a new db2cli.ini file that the ML Optimizer feature scripts will use to make database connections while training models. Open the following file (creating this file if it does not already exist) with your favourite editor `$HOME/python/lib/python3.6/site-packages/clidriver/cfg/db2cli.ini`, and add the following contents from one of the following two options, depending on whether IPC or TCP/IP connections are configured:

Option a): Assuming IPC connections are configured, your database name is BLU, and your instance owning userid is db2inst1:

[BLU]
Database=BLU
Instance=db2inst1
Protocol=IPC

Option b): Assuming TCP/IP connections are configured, your database name is BLU, the TCP/IP port used by your Db2 instance is 50000, your instance owning userid is db2inst1, and the db2inst1 password is “my password”:

[BLU]
Database=BLU
Protocol=TCPIP
Port=50000
uid=db2inst1
password=my password

Assuming you have followed both of these installation topics, you are now ready to use the ML Optimizer feature. See the “Usage” topic for more information.

Disabling and Turning off the Machine Learning Optimizer Feature

There are two ways to prevent the ML Optimizer feature from being used once it has been turned on: disabling it temporarily, and turning it off entirely. Once the ML Optimizer feature has been disabled or turned off, the traditional Db2 optimizer cardinality estimates are once again used for queries.

The ML Optimizer feature can be disabled immediately by running the following command:

db2set -im DB2_SELECTIVITY=”ML_PRED_SEL OFF”

Disabling the ML Optimizer feature is useful if you are planning to resume using the feature at a later time and do not want to deactivate and re-activate your database.

The ML Optimizer feature can be turned off entirely by running the following command:

db2set DB2_ML_OPT=”ENABLE:NO”

After setting this registry variable, you must deactivate and re-activate your database for the new setting to take effect. If models have already been trained and you wish to remove them, see the “Removing Models” subtopic for more information.

Usage

Training Models

When a machine learning model is trained for a table, the model and metadata are stored in the SYSTOOLS.TABLE_MODELS table, which is automatically created the first time a model is trained. The model itself is stored in the sqllib/tmp directory, with the naming convention of ‘SCHEMA_NAME.TABLE_NAME.1’. Manual modifications to the SYSTOOLS.TABLE_MODELS table is not recommended, and the models stored in sqllib/tmp should not be modified.

The only required user interaction with the Db2 ML Optimizer feature is to choose which tables should be used to train a machine learning model, and to choose which columns from those tables are typically used by query predicates from analytical queries issued against those tables.

Once you have determined which tables you think may benefit from improved cardinality estimates from the ML Optimizer feature, and selected up to ten columns from those tables commonly used in analytical query predicates, it is time to train machine learning models for those tables. Assuming your database name is MYDB, your table is in the schema MYSCHEMA, and is named MYTABLE, and the columns you have selected to train the model on are named C1, C2, and C3, you can train a model using the following command:

$HOME/python/bin/python3 $HOME/sqllib/bin/MLOptimizerEnableTable MYDB MYSCHEMA MYTABLE C1 C2 C3

Once this command completes successfully, you should see a message like the following:

Connection to MYDB established.
Successfully trained model.

After a model has been trained, it will automatically be considered during Db2 optimizer access plan selections for queries issued against that table, and so can automatically benefit from improved cardinality estimates.

Retraining/Creating New Models

In some cases, a model on a table may need to be retrained because there may have been significant changes to the data in the table. When a model is retrained, it is trained using a new representative sampling of rows through Db2 RUNSTATS.

Models are retrained using the automatic statistics collection facility or can be retrained manually. There are two ways to manually retrain a model: using the MLOptimizerEnableTable Python script (also known as the “enable script”) or using RUNSTATS. To retrain an existing model using the enable script, simply resubmit the original command used to train that model. To retrain an existing model using RUNSTATS, submit a RUNSTATS command on the associated table.

Sometimes it may be helpful to create and train a new model to capture different columns than the ones the original model was trained on (for example, if the focus for a model is to be used on a new set of queries). To create and train a new model, resubmit the original command using the enable script with the new set of columns. Note that only one model is allowed per table as mentioned in the “Restrictions/Limitations” topic, so a previously existing model on the same table will no longer be used for predictions.

For example, to create and train a new model for the table named MYTABLE using columns C2, C3, and C4, you can use the following command:

$HOME/python/bin/python3 $HOME/sqllib/bin/MLOptimizerEnableTable MYDB MYSCHEMA MYTABLE C2 C3 C4

This new model will replace the current model for MYTABLE which was trained on columns C1, C2, and C3.

Removing Models

To remove a model and its metadata entirely, it must first be removed from the SYSTOOLS.TABLE_MODELS table and then from memory. The model itself can then safely be removed from the sqllib/tmp directory. To remove the model and metadata from the SYSTOOLS.TABLE_MODELS table, a DELETE statement can be issued that specifies the table the model was trained on. To remove the model and metadata from memory, the database the model resides in needs to be re-activated. When a model no longer exists, the traditional Db2 Optimizer cardinality estimates are once again used for that table.

For example, to remove the model and metadata for the table named MYTABLE while connected to a database named MYDB, the following command can be used to remove the SYSTOOLS.TABLE_MODELS entries for that model:

DELETE FROM SYSTOOLS.TABLE_MODELS WHERE SCHEMANAME = 'MYSCHEMA' AND TABLENAME = 'MYTABLE';

Once all connections are terminated, the database can be re-activated using these commands to remove the model from memory:

DEACTIVATE DATABASE MYDB;
ACTIVATE DATABASE MYDB;

Finally, the model ‘MYSCHEMA.MYTABLE.1’ can be safely removed from the sqllib/tmp directory.

See the “Advanced Configuration Information” topic for more information on the SYSTOOLS.TABLE_MODELS table.

Troubleshooting

If any errors are encountered, there are two logs to check that may have more information about the error: the db2diag.log and the db2optstats log.

For example, if the enable script experiences an error, there will likely be a corresponding entry for that error in the db2diag.log (located in the directory specified by your DIAGPATH database manager configuration setting) that could provide additional information about the error. Any db2diag.log entries that contain “DB2 UDB, Machine Learning” as part of the FUNCTION header are entries created by the ML Optimizer feature.

If an error is encountered while using manual or automatic RUNSTATS to retrain models or for model discovery (explained in the “Advanced Configuration Information” topic), db2diag.log entries may not contain enough information about the RUNSTATS command that failed. In a case like this, the failing RUNSTATS command and related model information may be found in the db2optstats log (located in your DIAGPATH/events directory). Db2optstats log entries containing “LEARNED CARDINALITY MODEL” as part of the DISCOVER event type are entries created by the ML Optimizer feature when a discovered model is attempting to be created. If such an entry is found or a model is being retrained, there should also be entries containing “TABLE AND INDEX STATS” as part of the COLLECT event type that contain the RUNSTATS command being attempted.

If you experience any error that appears to be a problem with the ML Optimizer feature (for example, experiencing SQL1224N or SQL0901N errors, hangs, etc. only while the feature is enabled), please send as much of the following information as you can to the e-mail listed in the “Feedback” topic:

  • Exported data from the SYSTOOLS.TABLE_MODELS table. The SYSTOOLS.TABLE_MODELS table does not contain any table data; it contains information about the tables on which models are trained on (including the schema and table name, as well as the training columns), metadata about the models, and a binary representation of the models. This table can be exported using the following Db2 command:
EXPORT TO mloptTableModels.ixf OF IXF LOBS TO mloptLobs/ LOBFILE mloptLobExport MODIFIED BY LOBSINFILE SELECT * FROM SYSTOOLS.TABLE_MODELS
  • The output of db2pd -db <database name> -mlopt
  • The db2diag.log and db2optstats logs
  • db2support <output_path> -d <database name> -cl 0

Example

This example uses the SAMPLE database. An access plan of the query with default statistics on the SALES table will be compared to an access plan using a learned model.

First, statistics are collected on the SALES table:

RUNSTATS ON TABLE MYSCHEMA.SALES ON ALL COLUMNS WITH DISTRIBUTION ON ALL COLUMNS AND SAMPLED DETAILED INDEXES ALL;

Consider the following query on the SALES table:

SELECT SALES_DATE, SALES_PERSON, REGION, SALES FROM MYSCHEMA.SALES WHERE SALES_PERSON = 'LUCCHESSI' AND REGION = 'Manitoba' AND SALES = 1;

This query returns the following three records:

SALES_DATE SALES_PERSON    REGION          SALES
---------- --------------- --------------- -----------
03/30/2006 LUCCHESSI       Manitoba                  1
03/31/2006 LUCCHESSI       Manitoba                  1
04/01/2006 LUCCHESSI       Manitoba                  1
  3 record(s) selected.

The captured access plan for this query is shown below. The cardinality estimate of the base table is shown above the schema and table name. The output from the table scan operator (annotated as TBSCAN and is where the predicates are applied for this query) is shown above the operator and denotes the operator’s cardinality estimate:

      0.706722
       TBSCAN
       (   3)
       539.221
         192
         |
         41
CO-TABLE: MYSCHEMA
        SALES
         Q1

The cardinality estimate is 0.706722 after applying the predicates which doesn’t match the actual result of 3.

A model is trained for the SALES table on the three predicate columns:

$HOME/python/bin/python3 $HOME/sqllib/bin/MLOptimizerEnableTable SAMPLE MYSCHEMA SALES SALES_PERSON,REGION,SALES

When the query is executed again, the newly trained model will be used to predict the cardinality, producing the following access plan:

       3.08214
       TBSCAN
       (   3)
       539.221
         192
         |
         41
CO-TABLE: MYSCHEMA
        SALES
         Q1

Using the model results in a cardinality estimate of 3.08214 which is much closer to the actual result of 3.

Restrictions/Limitations

The ML Optimizer feature has some restrictions and limitations that are important to be aware of while using it:

  • Models can only be used on equality, range, and between predicates on individual tables and derived query blocks. This also applies to OR predicate subterms that qualify these requirements. Predicates not supported by the ML optimizer will be evaluated by the traditional Db2 optimizer. For example, the following predicates in blue are supported (join predicates, IN predicates and predicates with expressions around the columns are not supported in this technology preview):
SELECT * FROM T1, T2 WHERE
T1.C0 = T2.C0 AND
T1.C6 IN (5, 3, 205) AND
MOD(T1.C4, 10) = 1 AND
T1.C1 = ‘abc’ AND
T1.C2 BETWEEN 5 AND 10 AND
T2.C2 <= 120 AND T2.C3 = 20 AND
(T1.C4 > 5 AND T1.C5 < 20 OR T1.C4 < 2 AND T1.C5 > 100);
  • It is recommended to avoid concurrently training and using models. Doing so may cause database errors that require the database to be re-activated.
  • Models can only be trained for permanent base tables. Other table object variations including temporary tables, views, etc. are not supported.
  • Models are not isolated to the database where the model training originated. This is important to keep in mind when attempting to train/re-train models on tables with the same naming across databases.
  • The default maximum number of models that can exist in each database is 100. Retraining an existing model will not increase the count of existing models. See the “Advanced Configuration Information” topic for information on overriding this default (using DB2_ML_OPT_MAX_MODELS).
  • Only one model can be trained per table at any given time. Any subsequent training attempts on a table with an existing model will retrain that model or replace it with a new one. See the “Retraining/Creating New Models” subtopic for more information.
  • The MLOptimizerEnableTable Python script only allows up to 10 columns to be specified.
  • The MLOptimizerEnableTable Python script does not support delimited schema and table names that contain lowercase characters or spacing.
  • Commands that modify or alter a table (such as ALTER, RENAME, and DROP COLUMN commands) will not update an existing model on that table. Similarly, dropping a table will not cascade to remove a model associated with that table. When altering a table, it is recommended to train a new model (which will replace any existing models) or to remove the model entirely if the table is dropped or renamed. For further information about how to remove a model, see the “Removing Models” subtopic.
  • Model discovery is not limited to the number of training columns it can use. This means that in some cases, model discovery may be too aggressive in terms of the training columns chosen.
  • Models will only be successfully trained for tables with more than 2 rows.
  • Training may fail if a model attempting to be trained includes the leading column of an index.
  • Models cannot be trained on columns of the following types: Boolean, Graphic, Large Object, XML, Anchored, User-Defined

Advanced Configuration Information

None of the information mentioned in this topic is required for basic usage of the ML Optimizer feature. For basic usage information, see the “Setup Process” and “Usage” topics.

For those looking to explore additional capabilities or seeking more information regarding the ML Optimizer feature, this section will describe additional topics of interest.

MLOptimizerEnableTable Script

The MLOptimizerEnableTable Python script (also known as the “enable script”) is the main entry point for training a model on a table. The enable script takes as input a database, schema, and table name as well as columns from the provided table to train a model. These columns can be referred to as the “training columns”. If no training columns are provided to the enable script, up to ten columns from the provided table will be chosen automatically as the training columns.

Once the enable script verifies the inputs, it creates and interacts with a table called TABLE_MODELS to process the new model. The enable script then indirectly invokes the MLCardinalityTrain script to perform the model training. Finally, the enable script interacts with the TABLE_MODELS table once again to ensure that the model has successfully been trained. For more information about the TABLE_MODELS table, see the “SYSTOOLS.TABLE_MODELS table” subtopic.

If at any point the enable script fails to complete, the generic message below is output to the user along with a more specific message pertaining to the error encountered:

ERROR: Failed to train model. Please check for any ERROR or WARNING messages above, or in the Db2 diagnostic log file ('db2diag.log').

For more information about encountered errors, see the “Troubleshooting” subtopic.

SYSTOOLS.TABLE_MODELS table

The SYSTOOLS.TABLE_MODELS table is used to store and manage model metadata whenever a model is trained or retrained. Each entry in this table corresponds to information for a unique model that was created via the enable script or model discovery. Each entry contains the columns SCHEMANAME and TABLENAME to indicate the schema and name of the table that particular model was trained on. Other than what is mentioned in the “Removing Models” subtopic, it is not recommended to attempt to make manual modifications to the SYSTOOLS.TABLE_MODELS table as it can cause changes to the model configuration or database errors to occur.

You may notice that there are sometimes two entries per table in SYSTOOLS.TABLE_MODELS, one with VERSIONID = 1 and one with VERSIONID = 0. You can assume that entries with VERSIONID = 1 contain the active and current model metadata for those particular tables.

In some cases, it may be helpful to verify that training has successfully completed on a particular model (for example, when the output of the enable script was not kept). To verify this for a particular table, the ISTRAINED column can be queried. If the value is ‘1’, the model was successfully trained. If the value is ‘0’, the model was not successfully trained (in which case, refer to the “Troubleshooting” subtopic). For example, to verify that the model for the MYSCHEMA.MYTABLE table has successfully been trained, the following query can be used:

SELECT ISTRAINED FROM SYSTOOLS.TABLE_MODELS WHERE SCHEMANAME = 'MYSCHEMA' AND TABLENAME='MYTABLE' AND VERSIONID=1;

The enable script invokes a similar query to verify that the current model being created was successfully trained.

Model Discovery

Model discovery is a method of automatic model training that does not require invocation of the enable script. By using model discovery, the ML Optimizer feature automatically chooses training columns it thinks would best represent the “discovered” correlations present in the representative sampling of rows.

When model discovery is enabled, any manual or automatic statistics collection will automatically attempt to create a discovered model on the provided table if a model doesn’t already exist. Consequently, models created via the enable script will always take precedence over discovered models, and subsequent statistics collection will attempt to retrain existing models. Therefore, if a model exists on a particular table that you are interested in using a discovered model for, the existing model must first be removed. See the “Removing Models” subtopic for more details on removing existing models.

To allow the ML Optimizer feature to attempt to use model discovery, the following command can be used:

db2set DB2_STATISTICS=ML_DISCOVER:ON

After setting this registry variable, you must re-activate the database for the new setting to take effect. If the SYSTOOLS.TABLE_MODELS table has not yet been created (it is automatically created in the enable script), it must be created before model discovery can be used:

CREATE TABLESPACE SYSTOOLSPACE IN IBMCATGROUP MANAGED BY AUTOMATIC STORAGE EXTENTSIZE 4;
CREATE USER TEMPORARY TABLESPACE SYSTOOLSTMPSPACE IN IBMCATGROUP MANAGED BY AUTOMATIC STORAGE EXTENTSIZE 4;
CREATE TABLE SYSTOOLS.TABLE_MODELS(VERSIONID INTEGER NOT NULL,
                                   MODELID INTEGER,
                                   SCHEMANAME VARCHAR(128 OCTETS) NOT NULL,
                                   TABLENAME VARCHAR(128 OCTETS) NOT NULL,
                                   COLIDS VARCHAR(256 OCTETS),
                                   MODEL BLOB(2G),
                                   TIMEOFCREATION TIMESTAMP WITH DEFAULT CURRENT TIMESTAMP,
                                   ISTRAINED BOOLEAN WITH DEFAULT FALSE,
                                   METAMODEL BLOB(2G),
                                   PRIMARY KEY (VERSIONID, SCHEMANAME, TABLENAME) ENFORCED)
                                   IN SYSTOOLSPACE ORGANIZE BY ROW;

Note that model discovery will not occur if a statistics profile on the table is already set.

Environment variable DB2_ML_OPT_MAX_MODELS

DB2_ML_OPT_MAX_MODELS is an environment variable that is read by the enable script to override the default maximum number of models that can exist in each database (the default is set to 100). Only integers are valid to use with this environment variable.

For example, export DB2_ML_OPT_MAX_MODELS=200 will allow the maximum number of models to increase to 200.

This limit is only enforced for models trained by the enable script, and not for models trained using model discovery. However, models trained using model discovery will count towards this limit when the enable script is run.

Registry variables

The ML Optimizer feature introduces a new registry variable called DB2_ML_OPT as well as new settings for the DB2_SELECTIVITY and DB2_STATISTICS registry variables. An overview is provided below.

DB2_ML_OPT:

  • Allows the usage and configuration of model training and inference (predictions) for tables. Accepted settings for DB2_ML_OPT are of the form: `<Option> <colon> <value>` with multiple settings delimited with a semi-colon `;`. The entire registry setting should be surrounded in double quotes to ensure the shell does not interpret the `;` as the end-of-statement token.
  • Available settings:
    • ENABLE:<ON | OFF>
      • ON: Enable the creation and training of models as well as the usage of models for predictions.
      • OFF: Disable the creation, training, and usage of models (default).
    • INFERENCE_TIMEOUT_SEC:<integer>
      • The maximum number of seconds allowed for the model to spend on predicting cardinalities before timing out. If set to -1, no timeouts will occur. The default value if not specified is 10.
    • TRAINING_TIMEOUT_SEC:<integer>
      • The maximum number of seconds allowed for the model to spend on training. If set to -1, no timeouts will occur. The default value if not specified is 600.
    • SCRIPTDIR:<directory path>
      • The directory that will be searched for the ML Optimizer Python scripts (MLOptimizerEnableTable, MLCardinalityTrain, MLCardinalityPredict). The default path if not specified is `DFTDBPATH/sqllib/bin/`.
    • TMPDIR:<directory path>
      • The directory to store temporary files required for model training. The default path if not specified is `DFTDBPATH/sqllib/tmp/`.
    • PYTHON_EXEC:<Python executable path>
      • The Python executable to be used for training and inference. The default path if not specified is `/usr/bin/python`.

DB2_SELECTIVITY:

  • A new setting has been added to this registry variable to disable the use of model predictions.
  • Available setting:
    • ML_PRED_SEL OFF
      • OFF: If set, the optimizer will not use model predictions.

DB2_STATISTICS:

  • A new setting has been added to this registry variable to allow the enablement of model discovery.
  • Available setting:
    • ML_DISCOVER:<ON | OFF>
      • ON: Enable the ML Optimizer feature to attempt to automatically discover and train a model for tables during RUNSTATS. For more information, see the “Model Discovery” subtopic.
      • OFF: Disable the automatic discovery and training of models during RUNSTATS (default). Note that this only applies to tables with no model. Existing models will still continue to be re-trained via RUNSTATS according to the “Retraining/Creating New Models” subtopic.

Feedback

Although the ML Optimizer feature is not supported (since it is a Technology Preview), we would like to hear how your experience has been while using it! Please feel free to send any feedback including questions and concerns to calisto@ca.ibm.com.


Document information

More support for: DB2 for Linux, UNIX and Windows

Component: Query Optimization

Software version: 11.5

Operating system(s): Linux

Reference #: 0957375

Modified date: 10 July 2019