Format analysis

Format analysis is used to validate the pattern of characters used to store a data value in selective columns (for example, telephone numbers, Social Security numbers) that have a standard general format. Format analysis is useful when the format of characters is critical to automated procedures or to the display of those data values.

Analysis technique

Each data value in a column's frequency distribution is analyzed to create a general format for that data value that represents the character pattern of that data value. Then, all of the individual general formats for the column are summarized to develop a frequency distribution of general formats for the column. You can then inspect the general formats in the summary and decide for each one if that format conforms to the format requirements of the column, or violates the format requirements of the column.

System capability

During column analysis processing, the system constructs each column's frequency distribution and then analyzes each distinct value to develop a general format for that data value. To create the general format from the distinct data value, the system converts every alphabetic character to an “a” or “A”, depending on the capitalization of the character; converts every numeric character to a “9”; and does not change spaces and special characters, as shown in the following table.

Table 1. General Format examples
Data value General format
John Smith Aaaa Aaaaa
7256-AGW 9999-AAA
PR-609-54 VB AA-999-99 AA

Using all of the individual general formats, the system develops a frequency distribution by general format, which is referred to as the summary. Each general format in the summary has its own conformance flag. Initially, the system assumes that each general format conforms to the format requirements. When you flag a general format as "violates the standard", you can then also request that the system flag all of the affected data values as invalid, as shown in the following figure.

Figure 1. An example of general formats and possible status values
Shows the Format tab.

User responsibility

You can view the format analysis when the column analysis review of columns is viewed. At the detailed column view, format analysis has its own tab for viewing results. The system presents the summary of general formats used in the column. You inspect the general formats summary to identify and flag which general formats violate the requirements of the column. If needed, you can also request that the system flag every distinct data value that uses that general format to be invalid. The process is ultimately completed when you review all of the general formats and mark the format analysis function as reviewed.

Interpreting results

Most columns do not have stringent format requirements that must be enforced. For columns that have format requirements that must be enforced, there will typically be a limited number of general formats to inspect.

When general formats are in violation of the requirements or are unexplainable, you can gain more insight by performing a drill-down of the general format in question. A list of the actual data value or values that are in that general format is displayed. This can be helpful in deciding if the general format conforms to or violates the column's requirements.

The severity of the column's requirements and the need for corrective action will usually dictate the need to flag data values with general format violations as invalid data. If you wish to initiate corrective action on the invalid data, you can begin the process for any distinct data value flagged as invalid by manually entering a new replacement value, in a conforming general format, into a transformation value field in the frequency distribution grid. You can then request the creation of a mapping table that will create a two-column reference file with the first column containing the existing data value, and the second column containing the entered replacement value. This mapping table can be exported to other products or systems for application to the data source.

Like other column properties, there is an advantage to maintaining the same format requirements for the same logical column if it appears in multiple locations.

Finally, in some cases, you might want to enforce the formats of TEXT columns (for example, names, addresses, descriptions). While the format analysis could be used to do this, it is not the most efficient method. The companion InfoSphere® QualityStage product has additional pattern analysis capabilities that can be used to analyze those types of columns more efficiently.

Decisions and actions

First, you must decide if a column requires enforcement of the format of characters in the data values.

If it does, use format analysis and decide for each general format whether it conforms to or violates the column's requirements.

Finally, you decide if the data values that have violated general format should be treated as invalid data.

Performance considerations

There are no significant system performance considerations for the format analysis function.