Creating the IBM InfoSphere QualityStage Match Designer database
The IBM® InfoSphere® QualityStage® Match Designer is a component of InfoSphere QualityStage that is used to design and test match specifications. Match specifications consist of match passes that identify duplicate entities in one or more files.
If you intend to develop match specifications, use a new or existing database that is dedicated to storing the results of match test passes. Do not use the metadata repository or the analysis database to store these results.
Database requirements
The InfoSphere Information Server installation program does not create the Match Designer results database. You create the database before or after the installation, as long as the database is configured and accessible when you use the Match Designer. You can create the database on a computer where the client or engine tier is installed or on any computer that is accessible to both of these tiers. You must configure the database to receive the type of data that is processed in the Match Designer. For example, you must configure the database to receive double-byte data if the Match Designer processes Asian data.
The following table lists the supported databases and describes the configuration requirements:
Database | Configuration requirements |
---|---|
All databases |
Configure as an empty database. The Match Designer creates the table structure that it requires when it stores the match pass results in the database. |
IBM DB2® database system |
Configure the following settings:
|
Oracle database system |
Configure the following settings:
Note: If you use the IBM Wire
Protocol driver, select the Enable N-CHAR Support option.
|
Microsoft SQL Server database system |
No special settings are required. |
Database space requirements
After you create the database, use automatic storage management to allocate database space as needed.
To estimate initial space requirements and plan for the appropriate hardware, you need to make some assumptions about the type and volume of matching that you will do. Consider the following factors:
- Estimated number of match specifications
- Average number of match passes per match specification
- Average number of input data records per match specification
- Average number of bytes per input data record
- Average number of input frequency records per match specification
When you calculate the number of bytes in a Match Designer input record, assume that VarChar columns contain maximum width values. Also, keep in mind that most Match Designer input records will be preprocessed through an InfoSphere QualityStage Standardization Stage. This stage will add 500 bytes to a source record or 1000 bytes if the source record is double-byte.
When calculating frequency record volume, consider whether the data values for the source columns that participate in your matches typically have high or low rates of duplication. For data that is highly unique, your frequency record count will be similar to your data record count. For data that has a high rate of duplication, your frequency record count will be significantly less than your data record count.
Review the factors described in Table 2. Then, use the factors in the following formula to estimate how many megabytes of space the Match Designer results database is likely to require.
(number of match specifications x (match
size + (match size x 10%) + frequency size) x replication factor)
/ 1,000,000
Factor | Description | Calculation |
---|---|---|
Match specification count |
Match specifications define and test criteria for matching data. |
Approximate number of match specifications you expect to retain in your database |
Match pass count |
Match passes define the column-level matching criteria applied during a single pass through match specification input. |
Average number of match passes per match specification (used in the Match Size calculation) |
Data record count |
Data records define the content and format of match specification input. |
Average number of input data records per match specification (used in the Match Size calculation) |
Data record length |
Data records define the content and format of match specification input. When calculating data record length, assume sizing scenarios that require the most space. |
Average number of bytes per match specification input data record (used in the Match Size calculation) |
Match size (in bytes) |
Match size aggregates the match pass and data record count and length factors into the estimated space requirements for the data component of a single match specification. |
(Match pass count) x (data record count) x (data record length) |
Frequency record count |
Frequency records indicate how often a particular value appears in a particular column. For frequency record count, consider high or low rates of duplication for the data values in source columns. |
Average number of frequency records per match specification |
Frequency size (in bytes) |
Frequency size estimates the space requirement for the frequency component of a single match specification. |
(Frequency record count per match) x 360 |
Replication factor |
Replication factors account for significant differences in space requirements of different match types:
|
Select the replication factor for the match type you expect to run most often:
|