Phase two: Analyze source data

Phase two of the data cleansing workflow is to learn about your source data, prepare your source data, and understand the quality of the source data.

Phase two of the workflow helps you:

Phase two helps you begin understanding the size and complexity of the project for creating cleansed data. If the granularity and structure of the source data closely matches your initial impression of the structure and requirements of the target data, then data cleansing will be less complex. The degree of difference contributes to your project complexity.

Most organizations think they know what data they have. But if you analyzed your data to determine how complete it is, how much of the information is duplicated, and what types of anomalies exist within each data field, you might be surprised. Over time, data integrity weakens. The contents of fields stray from their original intent. The label might say Name, but the field might also contain a title, a tax ID number, or a status, such as Deceased. This information is useful, but not if you cannot locate it.

Step one: Prepare for data cleansing
Preparing for working in IBM® InfoSphere® QualityStage® entails:
  • Having general knowledge about the information in the source data
  • Knowing the format of the source data
  • Developing business rules for use iteratively throughout the data cleansing process, which are based on the data structure and content
Step two: Investigate the source data
Investigating helps you understand the quality of the source data and clarify the direction of succeeding phases of the workflow. In addition, it indicates the degree of processing you will need to create the cleansed data.
By investigating data, you gain these benefits:
  • Gain a better understanding of the quality of the data
  • Identify problem areas, such as blanks, errors, or formatting issues
  • Prove or disprove any assumptions you might have about the data
  • Learn enough about the data to help you establish business rules at the data level
InfoSphere QualityStage provides stages and reports to help you perform one or more of the following functions on your data: This process produces input data for phase three, where you build your data cleansing jobs.