Phase three: Design and develop jobs

Phase three is to design jobs or sequences of jobs that generate the cleansed data you need. You then run these jobs on the data that was produced in the previous phase.

Designing the components that are required to build data quality jobs with InfoSphere® QualityStage® involves one or more of the following steps.

Step one: Standardizing data

Standardizing data involves preparing and conditioning data. InfoSphere QualityStage provides several features to assist you in standardizing data, such as stages and reports, to help you perform one or more of the following functions on your data:

Implement enterprise or industry data-quality standards
Improve addressability of data that is stored in a free form
Prepare data for all of its uses (display, matching, reporting)
Parses free-form or fixed-format columns into single-domain data elements to create a consistent representation of the input data
Makes each data element have the same content and format
Normalizes data values to standard forms
Standardizes spelling formats and abbreviations
Prepares data elements for more effective matching
Performs phonetic coding (NYSIIS and SOUNDEX), which can be used in the match processing
Review data results and statistics (Standardization Quality Assessment [SQA] reports)

Step two: Matching data

After the data is standardized, you are ready for matching. You match data to identify either duplicates or cross-references to other files. Your data cleansing assignment determines your matching strategy. After you know what you are looking for, whether it is to match individuals, match companies, perform householding, or reconcile inventory transactions, you can design a matching strategy to meet these goals.

Matching identifies all records in one source (the input source) that correspond to similar records (such as a person, household, address, and event) in another source (the reference source). Matching also identifies duplicate records in one source and builds relationships between records in multiple sources. Relationships are defined by business rules at the data level.

InfoSphere QualityStage provides Match stages (and a Match Designer that provides a test environment to produce match specifications for the Match stages) to help you perform one or more of the following functions on your data:

Find similar and duplicate data
Consolidate views
Cross reference data to other sources
Enrich existing data with new attributes from external sources

Step three: Identifying surviving data

After the data matching is complete, you identify which records (or columns of a set of duplicate records) from the match data survive and become available for formatting, loading, or reporting.

Survivorship facilitates that the best available data survives and is correctly prepared for the target destination. Thus, survivorship consolidates duplicate records, creating a best-of-breed representation of the matched data, enabling organizations to cross-populate all data sources with the best available data.

In this step, when you have duplicate records, you must make these decisions:

To keep all the duplicates
To keep only one record that contains all the information that is in the duplicates

InfoSphere QualityStage provides survivorship to help you perform one or more of the following functions on your data:

Resolve conflicts with records that pertain to one entity
Optionally create a cross-reference table to link all surviving records to the legacy source
Supply missing values in one record with values from other records on the same entity
Resolve conflicting data values on an entity according to your business rules
Enrich existing data with data from external sources
Customizes the output to meet specific organizational and technical requirements