Phase two of the data cleansing workflow is to learn about
your source data, prepare your source data, and understand the quality
of the source data.
Phase two of the workflow helps you:
- Identify whether the source data has the basic structure that
your target data requires
- Understand the content of the source data
- Create the input data used in the next phase
Phase two helps you begin understanding the size and complexity
of the project for creating cleansed data. If the granularity and
structure of the source data closely matches your initial impression
of the structure and requirements of the target data, then data cleansing
will be less complex. The degree of difference contributes to your
project complexity.
Most organizations think they know what data they have. But if
you analyzed your data to determine how complete it is, how much of
the information is duplicated, and what types of anomalies exist within
each data field, you might be surprised. Over time, data integrity
weakens. The contents of fields stray from their original intent.
The label might say Name, but the field might also contain a title,
a tax ID number, or a status, such as Deceased. This information is
useful, but not if you cannot locate it.
- Step one: Prepare for data cleansing
- Preparing for working in IBM® InfoSphere® QualityStage® entails:
- Having general knowledge about the information in the source data
- Knowing the format of the source data
- Developing business rules for use iteratively throughout the data
cleansing process, which are based on the data structure and content
- Step two: Investigate the source data
- Investigating helps you understand the quality of
the source data and clarify the direction of succeeding phases of
the workflow. In addition, it indicates the degree of processing you
will need to create the cleansed data.
- By investigating data, you gain these benefits:
- Gain a better understanding of the quality of the data
- Identify problem areas, such as blanks, errors, or formatting
issues
- Prove or disprove any assumptions you might have about the data
- Learn enough about the data to help you establish business rules
at the data level
InfoSphere QualityStage provides
stages and reports to help you perform one or more of the following
functions on your data:
- Organizing
- Parsing
- Classifying
- Analyzing patterns
This process produces input data for phase three, where you build
your data cleansing jobs.