Predefined data quality rule definitions

The predefined data quality rule definitions are available in the Published Rules folder of the Data Quality workspace.

If you upgrade to Version 9.1, you can use the predefined data quality rule definitions, but must first import them from the \IBM\InformationServer\Clients\Samples\Information Analyzer directory on the client tier computer. If you are using Version 8.7, you can use the predefined data quality rule definitions, but must first import them from IBM developerWorks. For more information, see Using pre-built rule definitions with IBM® InfoSphere® Information Analyzer.

Names of the predefined data rule definitions use the following conventions:

Used as design-time accelerators, templates, and models

When viewed from the perspective of designing data rules, the predefined data rule definitions can serve several purposes: as educational examples, as accelerators to assess your data quality, as templates, or as models for development.

You can use predefined data quality rule definitions in InfoSphere Information Analyzer jobs, or through the Data Rules stage in the following ways:
  • To reduce the effort in identifying data quality issues in many common information domains and conditions. Some common information domains are keys, national identifiers, dates, country codes, and email addresses. Some common conditions are completeness checks, valid values, range checks, aggregated totals, and equations.

    You can immediately use the predefined data rule definitions as they are to test or assess your data sources and generate data rules, which allows you to accelerate your ability to start detailed data quality assessment.

  • As templates. You can copy and modify the predefined data rule definitions, and customize them for your specific data conditions.
  • As reference models. They can serve as examples of specific functions or conditions in use that can guide you as you design and develop unique rules for your environment.

Deployed in data quality analysis

Data rule definitions can be deployed at different points in the process of quality validation and monitoring. These points include: direct analysis of data quality, use in InfoSphere DataStage® and QualityStage®, or use in other IBM InfoSphere Information Analyzer projects.

As with all data rule definitions, the predefined data rule definitions can be:
  • Used to generate executable data rules for quality validation.
  • Copied to serve as a template for your own rule definitions.
  • Included in rule set definitions and executable rule sets to validate multiple conditions together. With the predefined rule definitions, you can combine as many of them together as necessary to evaluate all the fields in a record, including multiple instances of the same rule definition. Any rule set definition you create can contain the predefined rule definitions and your own rule definitions in any combination.
  • Published for users in other projects.
  • Exported for deployment in other IBM InfoSphere Information Analyzer environments.

    For example, if you work in a development environment with test data to ensure your data rules work correctly, you might then need to export those data rules to a production environment for ongoing quality monitoring.

Example

You receive a file every day from an external source. The quality of the data source is often low, which results in problems in other information systems, such as your business reporting system. This daily file currently runs through an InfoSphere QualityStage job to standardize the file and load the output to existing data sources. You want to test the incoming data for completeness by using a set of data rule definitions, and validate the results of the standardized output.

The following figure shows the Data Rules stage, CustomerValidityCheck, in an example job. The Data Rules stage can use one data rule definition or many, depending on the number of data fields that need to be validated. Outputs from the Data Rules stage include valid data, invalid data, and specific violation details.
Figure 1. A Data Rules stage job that validates standardized data
An InfoSphere QualityStage job that shows a data set that is input into a Standardize stage. The Standardize stage is input into a Data Rules stage. There are three outputs from the Data Rules stage: valid data, invalid data, and violations.
By taking advantage of predefined data rule definitions you can:
  • Reduce the effort to address many common information domains and conditions
  • Provide models and publish data rule definitions for other users to work from
  • Accelerate the process of assessing, testing, and deploying data rules
  • Deploy rule definitions for ongoing quality monitoring and inflight data validation