Creating the IBM InfoSphere QualityStage Match Designer database

The IBM® InfoSphere® QualityStage® Match Designer is a component of InfoSphere QualityStage that is used to design and test match specifications. Match specifications consist of match passes that identify duplicate entities in one or more files.

If you intend to develop match specifications, use a new or existing database that is dedicated to storing the results of match test passes. Do not use the metadata repository or the analysis database to store these results.

Database requirements

The InfoSphere Information Server installation program does not create the Match Designer results database. You create the database before or after the installation, as long as the database is configured and accessible when you use the Match Designer. You can create the database on a computer where the client or engine tier is installed or on any computer that is accessible to both of these tiers. You must configure the database to receive the type of data that is processed in the Match Designer. For example, you must configure the database to receive double-byte data if the Match Designer processes Asian data.

Note: IBM InfoSphere Information Server does not support clustered configurations of the IBM InfoSphere Information Analyzer database or the IBM InfoSphere QualityStage Match Designer database.

The following table lists the supported databases and describes the configuration requirements:

Table 1. Configuration requirements by database
Database Configuration requirements

All databases

Configure as an empty database. The Match Designer creates the table structure that it requires when it stores the match pass results in the database.

IBM DB2® database system

Configure the following settings:

  • Set Default Buffer Pool and Table Space Page Size to 32K.
  • Set Code Set to UTF-8.

Oracle database system

Configure the following settings:

  • Set Character Set for the data being processed. For example, you can use AL32UTF8.
  • Set National Character Set to UTF8 or AL16UTF16.
Note: If you use the IBM Wire Protocol driver, select the Enable N-CHAR Support option.

Microsoft SQL Server database system

No special settings are required.

Database space requirements

After you create the database, use automatic storage management to allocate database space as needed.

To estimate initial space requirements and plan for the appropriate hardware, you need to make some assumptions about the type and volume of matching that you will do. Consider the following factors:

  • Estimated number of match specifications
  • Average number of match passes per match specification
  • Average number of input data records per match specification
  • Average number of bytes per input data record
  • Average number of input frequency records per match specification

When you calculate the number of bytes in a Match Designer input record, assume that VarChar columns contain maximum width values. Also, keep in mind that most Match Designer input records will be preprocessed through an InfoSphere QualityStage Standardization Stage. This stage will add 500 bytes to a source record or 1000 bytes if the source record is double-byte.

When calculating frequency record volume, consider whether the data values for the source columns that participate in your matches typically have high or low rates of duplication. For data that is highly unique, your frequency record count will be similar to your data record count. For data that has a high rate of duplication, your frequency record count will be significantly less than your data record count.

Review the factors described in Table 2. Then, use the factors in the following formula to estimate how many megabytes of space the Match Designer results database is likely to require.

(number of match specifications x (match size + (match size x 10%) + frequency size) x replication factor) / 1,000,000
Table 2. Space requirement factors and calculations
Factor Description Calculation

Match specification count

Match specifications define and test criteria for matching data.

Approximate number of match specifications you expect to retain in your database

Match pass count

Match passes define the column-level matching criteria applied during a single pass through match specification input.

Average number of match passes per match specification (used in the Match Size calculation)

Data record count

Data records define the content and format of match specification input.

Average number of input data records per match specification (used in the Match Size calculation)

Data record length

Data records define the content and format of match specification input.

When calculating data record length, assume sizing scenarios that require the most space.

Average number of bytes per match specification input data record (used in the Match Size calculation)

Match size (in bytes)

Match size aggregates the match pass and data record count and length factors into the estimated space requirements for the data component of a single match specification.

(Match pass count) x (data record count) x (data record length)

Frequency record count

Frequency records indicate how often a particular value appears in a particular column.

For frequency record count, consider high or low rates of duplication for the data values in source columns.

Average number of frequency records per match specification

Frequency size (in bytes)

Frequency size estimates the space requirement for the frequency component of a single match specification.

(Frequency record count per match) x 360

Replication factor

Replication factors account for significant differences in space requirements of different match types:

  • One-source matches take input from one source, then group and match the data.
  • Two-source one-to-one matches compare a record on the data source to a record on the reference source.
  • Two-source many-to-one matches can match any reference source record with many data source records.

Select the replication factor for the match type you expect to run most often:

  • For one-source matches, use a factor of 1
  • For two-source one-to-one matches, use a factor of 2
  • For two-source many-to-one matches, use a factor of 5