Phase three is to design jobs or sequences of jobs that
generate the cleansed data you need. You then run these jobs on the
data that was produced in the previous phase.
Designing the components that are required to build data quality
jobs with
InfoSphere® QualityStage® involves
one or more of the following steps.
- Step one: Standardizing data
- Standardizing data involves preparing and conditioning
data. InfoSphere QualityStage provides
several features to assist you in standardizing data, such as stages
and reports, to help you perform one or more of the following functions
on your data:
- Implement enterprise or industry data-quality standards
- Improve addressability of data that is stored in a free form
- Prepare data for all of its uses (display, matching, reporting)
- Parses free-form or fixed-format columns into single-domain data
elements to create a consistent representation of the input data
- Makes each data element have the same content and format
- Normalizes data values to standard forms
- Standardizes spelling formats and abbreviations
- Prepares data elements for more effective matching
- Performs phonetic coding (NYSIIS and SOUNDEX), which can be used
in the match processing
- Review data results and statistics (Standardization Quality Assessment
[SQA] reports)
- Step two: Matching data
- After the data is standardized, you are ready for matching. You
match data to identify either duplicates or cross-references to other
files. Your data cleansing assignment determines your matching strategy.
After you know what you are looking for, whether it is to match individuals,
match companies, perform householding, or reconcile inventory transactions,
you can design a matching strategy to meet these goals.
- Matching identifies all records in one source (the
input source) that correspond to similar records (such as a person,
household, address, and event) in another source (the reference source).
Matching also identifies duplicate records in one source and builds
relationships between records in multiple sources. Relationships are
defined by business rules at the data level.
- InfoSphere QualityStage provides
Match stages (and a Match Designer that provides a test environment
to produce match specifications for the Match stages) to help you
perform one or more of the following functions on your data:
- Find similar and duplicate data
- Consolidate views
- Cross reference data to other sources
- Enrich existing data with new attributes from external sources
- Step three: Identifying surviving data
- After the data matching is complete, you identify which records
(or columns of a set of duplicate records) from the match data survive
and become available for formatting, loading, or reporting.
- Survivorship facilitates that the best available
data survives and is correctly prepared for the target destination.
Thus, survivorship consolidates duplicate records, creating a best-of-breed
representation of the matched data, enabling organizations to cross-populate
all data sources with the best available data.
- In this step, when you have duplicate records, you must make these
decisions:
- To keep all the duplicates
- To keep only one record that contains all the information that
is in the duplicates
- InfoSphere QualityStage provides
survivorship to help you perform one or more of the following functions
on your data:
- Resolve conflicts with records that pertain to one entity
- Optionally create a cross-reference table to link all surviving
records to the legacy source
- Supply missing values in one record with values from other records
on the same entity
- Resolve conflicting data values on an entity according to your
business rules
- Enrich existing data with data from external sources
- Customizes the output to meet specific organizational and technical
requirements