Data classification

A data class is an asset that categorizes database columns and data file fields according to the type of the data and how the data is used. Data classification is the process of assigning a data class to a database column by IBM® InfoSphere® Information Analyzer during a column analysis job. Data classification can also be done manually by IBM InfoSphere Information Governance Catalog.

Predefined data classes are automatically installed with IBM InfoSphere Information Analyzer. These data classes can be displayed and used by IBM InfoSphere Information Governance Catalog. In addition to predefined data classes, you can create data classes in InfoSphere Information Analyzer and in InfoSphere Information Governance Catalog.

Note: Data classes that you create in InfoSphere Information Analyzer must be published before they can be displayed in InfoSphere Information Governance Catalog. Data classes that are detected or selected, but not published, are not displayed in InfoSphere Information Governance Catalog.

Use data classes to organize database columns and data file fields for review and subsequent column analysis work. For example, database columns with numeric data typically include numbers within a range of valid values.

The following table compares the actions that can be done in InfoSphere Information Analyzer and in InfoSphere Information Governance Catalog.
Action InfoSphere Information Analyzer InfoSphere Information Governance Catalog
Create a data class No Yes, but not of type JAVA
Edit a data class No Yes
Delete a data class No Yes
View and query a data class No Yes
Classify an asset by using a data class Yes, for database columns Yes, for database columns and data file fields
Assign data class to collection No Yes
Assign terms, information
governance rules, stewards,
custom attributes, and labels
to a data class
No Yes
Set data classification on an asset Yes, for database columns Yes, for database columns and data file fields
View data classification of an asset Yes Yes
Remove data classification from an asset Yes, for selected data classifications Yes
Analyze an asset according to its data classification Yes, if the data classification is enabled No
Depending on the type of the data class that you want to create in InfoSphere Information Governance Catalog, you can define the following properties in its Details page:
Column Name Match
The column name filter for data classes. A column is analyzed against the data class only if the name of the column matches the filter.
The value is a regular expression.
Confidence
A value 1 - 100 that is the measure of the overall quality in a data source and whether it met expectations. This property is determined by InfoSphere Information Analyzer and cannot be changed in InfoSphere Information Governance Catalog.
Data type
Only data that matches the data type is used in analysis by InfoSphere Information Analyzer.
Detected
Found by InfoSphere Information Analyzer during column analysis.
Enabled
The data classification is used when you run a column analysis job in InfoSphere Information Analyzer.
Example
Text that is an example of a match.
For example, if the data class is a regular expression, you might enter something that matches the regular expression.
Maximum data length
The maximum character count of a value. The maximum data length must be equal to or greater than the minimum data length.
For example, if a value in a database column is Test and the maximum data length is 3, this value is not analyzed.
Minimum data length
The minimum character count of a value.
In the preceding example, if the minimum data length is 4, the value Test is analyzed.
Selected
Reviewed and approved for use in column analysis by InfoSphere Information Analyzer.
State
This property is determined by InfoSphere Information Analyzer and cannot be changed in InfoSphere Information Governance Catalog.
The status of the data class:
Candidate
Under review for approval.
Inferred
Derived by examination of the characteristics of the data.
Threshold
The percentage of data that must match the properties of the data class. The percentage is an integer value.
For example, suppose that the threshold of a data class that is assigned to a database column is set at 75. In this case, 75 per cent of the data in the database column must match the data class for InfoSphere Information Analyzer to analyze data in the database column.
Valid Value Reference File
A file that contains a list of valid values. It must be referenced by a valid URL, for example http://www.ibm.com:80/my/path/to/mydataclass.txt, or file:///my/path/to/mydataclass.txt. The file must be available to all IBM InfoSphere Information Server engine tiers, and placed in the same location on each tier.
The content of the file is a list of valid values, where each value is on a separate line. For example:
Value1
Value2
...
ValueN