Phase three: Design and develop jobs

Phase three is to design jobs or sequences of jobs that generate the cleansed data you need. You then run these jobs on the data that was produced in the previous phase.

Designing the components that are required to build data quality jobs with InfoSphere® QualityStage® involves one or more of the following steps.
Step one: Standardizing data
Standardizing data involves preparing and conditioning data. InfoSphere QualityStage provides several features to assist you in standardizing data, such as stages and reports, to help you perform one or more of the following functions on your data:
  • Implement enterprise or industry data-quality standards
  • Improve addressability of data that is stored in a free form
  • Prepare data for all of its uses (display, matching, reporting)
  • Parses free-form or fixed-format columns into single-domain data elements to create a consistent representation of the input data
  • Makes each data element have the same content and format
  • Normalizes data values to standard forms
  • Standardizes spelling formats and abbreviations
  • Prepares data elements for more effective matching
  • Performs phonetic coding (NYSIIS and SOUNDEX), which can be used in the match processing
  • Review data results and statistics (Standardization Quality Assessment [SQA] reports)
Step two: Matching data
After the data is standardized, you are ready for matching. You match data to identify either duplicates or cross-references to other files. Your data cleansing assignment determines your matching strategy. After you know what you are looking for, whether it is to match individuals, match companies, perform householding, or reconcile inventory transactions, you can design a matching strategy to meet these goals.
Matching identifies all records in one source (the input source) that correspond to similar records (such as a person, household, address, and event) in another source (the reference source). Matching also identifies duplicate records in one source and builds relationships between records in multiple sources. Relationships are defined by business rules at the data level.
InfoSphere QualityStage provides Match stages (and a Match Designer that provides a test environment to produce match specifications for the Match stages) to help you perform one or more of the following functions on your data:
  • Find similar and duplicate data
  • Consolidate views
  • Cross reference data to other sources
  • Enrich existing data with new attributes from external sources
Step three: Identifying surviving data
After the data matching is complete, you identify which records (or columns of a set of duplicate records) from the match data survive and become available for formatting, loading, or reporting.
Survivorship facilitates that the best available data survives and is correctly prepared for the target destination. Thus, survivorship consolidates duplicate records, creating a best-of-breed representation of the matched data, enabling organizations to cross-populate all data sources with the best available data.
In this step, when you have duplicate records, you must make these decisions:
  • To keep all the duplicates
  • To keep only one record that contains all the information that is in the duplicates
InfoSphere QualityStage provides survivorship to help you perform one or more of the following functions on your data:
  • Resolve conflicts with records that pertain to one entity
  • Optionally create a cross-reference table to link all surviving records to the legacy source
  • Supply missing values in one record with values from other records on the same entity
  • Resolve conflicting data values on an entity according to your business rules
  • Enrich existing data with data from external sources
  • Customizes the output to meet specific organizational and technical requirements