Tutorial: Enhancing a product rule set in the Standardization Rules Designer

In this tutorial, you will use the IBM® InfoSphere® QualityStage® Standardization Rules Designer to enhance a rule set that standardizes product data. When you standardize data, you implement data quality standards that normalize data values and prepare data for uses such as matching and reporting.

In this tutorial, you will use data from the fictional Sample Outdoor Company, which sells and distributes products to third-party retailer stores and consumers. For the last several years, the company has steadily grown into a worldwide operation, selling their line of products to retailers in nearly every part of the world.

The fictional Sample Outdoor Company recently acquired several new product lines. The company wants to integrate the data for these product lines into its current database, but the new data contains new types of information and is formatted inconsistently. The Sample Outdoor Company can use the IBM InfoSphere QualityStage Standardization Rules Designer to enhance the rule set that standardizes this type of data.

After the rule set is enhanced, the company will apply the rule set to a Standardize stage in a standardization job. When the standardization job is run, input data is standardized according to the logic that is specified in the enhanced rule set.

This tutorial guides you through some of the common tasks that you might complete when you enhance a rule set in the Standardization Rules Designer. The following steps illustrate the sequence of actions in the tutorial:
  1. In module 1, you identify a rule set that requires enhancement and open a revision for that rule set in the Standardization Rules Designer. You also import sample data to use in the Standardization Rules Designer.
  2. In module 2, you categorize the parts of the data. You add classification definitions that assign new values to existing classes and add a custom class for a new type of data. Figure 1 shows how each value in an example record for a fictional Sample Outdoor Company product can be assigned to a class.
    Figure 1. Each value in an example record is assigned to a class.
    Each value in an example record is assigned to a custom or default class.
  3. In module 3, you add a lookup table that converts alphabetic information about product colors to numeric color codes. Figure 2 shows part of the lookup table that is added in the Standardization Rules Designer.
    Figure 2. The lookup table converts an alphabetic color to a numeric color code.
    A lookup table that converts alphabetic colors to numeric color codes.
  4. In the first lesson of module 4, you modify a rule that was added in the Standardization Rules Designer. The rule is handling data incorrectly for some of the new product brands. Figure 3 shows the output with the current rule and the output after the rule is modified according to the data cleansing requirements of the fictional Sample Outdoor Company.
    Figure 3. The current rule does not meet the data cleansing requirements of the fictional Sample Outdoor company. The modified rule duplicates the product brand name in the product name.
    The current rule sends values from the input record to the Product Brand, Product Name, Size Type, Product Color, and Product Type output columns. The modified rule sends values to the same output column, but also duplicates the product brand in the product name.
  5. In the second and third lessons of module 4, you identify the most common pattern that is unhandled and add a rule to handle data that matches the pattern. Figure 4 shows how an example record for a fictional Sample Outdoor Company product is handled by the new rule.
    Figure 4. A new rule for the most common unhandled pattern uses the lookup table to convert color to color code. The rule also adds the product size values to the output column for size types.
    The rule uses a lookup table to convert product color to color code.
  6. In the fourth lesson of module 4, you create a rule that handles two distinct values that are concatenated in the input data. For example, if the input data contains the value 195cm, you can create a rule that splits the value into the values 195 and cm and places them in the appropriate output columns. Figure 5 shows how an example record for a fictional Sample Outdoor Company product is handled by this rule.
    Figure 5. The new rule splits a concatenated value into two distinct values and places them in the appropriate output columns. The rule also adds the product type values to the output column for product types.
    The rule splits concatenated value 195cm in the input data into the values 195 and cm.

Learning objectives

By completing the modules, you will learn about the concepts and tasks for enhancing rule sets:

Time required

Before you begin the tutorial, you must set up your environment. The time that is required for setup depends on your current environment.

The modules each take 20 - 60 minutes to complete.

System requirements

The following components and applications must be installed on your system.

Prerequisites

Before you begin this tutorial, you must understand data quality concepts such as standardization, classification, and rules for data cleansing. Knowledge about InfoSphere DataStage and QualityStage concepts such as jobs, stages, and reports might be helpful, but is not required.

Notices: The Sample Outdoor Company, GO Sales, any variation of the Great Outdoors name, and Planning Sample, depict fictitious business operations with sample data used to develop sample applications for IBM and IBM customers. These fictitious records include sample data for sales transactions, product distribution, finance, and human resources. Any resemblance to actual names, addresses, contact numbers, or transaction values, is coincidental. Other sample files may contain fictional data manually or machine generated, factual data compiled from academic or public sources, or data used with permission of the copyright holder, for use as sample data to develop sample applications. Product names referenced may be the trademarks of their respective owners. Unauthorized duplication is prohibited.