Information icon IBM InfoSphere DataStage and InfoSphere QualityStage, Version 8.5
space Feedback

Cutoff values

Match and clerical cutoffs are thresholds that determine how to categorize scored record pairs.

Your goal of setting cutoffs is to minimize uncertainty in the match results while you limit the number of false categorizations.

Record pairs with composite weights equal to or greater than the match cutoff are considered matches. Record pairs with composite weights equal to or greater than the clerical cutoff but less than the match cutoff are called clerical pairs. The matching process is uncertain whether clerical pairs are matches or nonmatches. Pairs with composite weights below the clerical cutoff are considered nonmatches. You can set cutoffs at the same value, so that you eliminate clerical records.

You can set a high cutoff threshold to limit the results to better quality matches, though possibly fewer matches. A lower threshold can produce more matches, but some of these matches might be of lesser quality. Business requirements help drive decisions. Results can vary depending on whether you take a conservative or more aggressive approach to defining the cutoff values.

For example, matching for the purpose of docking a person's pay might require a more conservative approach than deduplicating a mailing list for shopping catalogs. As a best practice, keep in mind the business purpose when you tune the match settings.

The composite weights assigned to each record pair create a distribution of scores that range from very high positive to very high negative. The graph in Figure 1 focuses on the area of a histogram where the number of low scoring pairs tails off and the high scoring pairs starts to increase. In this area of the graph, there is not a high likelihood that pairs are either matches or nonmatches.

You set the cutoff values to tell the matching process how to handle pairs in this range. Differences in the distribution of pairs help to determine the settings. The detail of the graph of matched versus nonmatched records relate to the cutoff points. You typically set cutoffs on the down slope of the nonmatched and the up slope of the matched. Where you set the cutoff is influenced by both the business objective and the tolerance for error.

The weights between the vertical lines form a gray area, where one cannot say whether the pair is matched or not. You want to have enough variables to distribute the matched versus nonmatched groups further apart (minimize the uncertain pairs). You know that you developed a good match strategy when what is in the clerical area are records with mostly blank, missing, and default values.

Figure 1. Histogram of weights
Histogram shows Number of Pairs on the vertical axis and weight of comparison on the horizontal axis. On the low end of weight of comparison, there are unmatched records to the left of the low cutoff and on the high end of weight comparison there are match records to the right of the high cutoff. Clerical records in between the low and high cutoff. The gray area is the middle area, between the nonmatch and match areas and is labeled as the clerical area.

The fewer records in the clerical area, the fewer the cases to review, but the greater the probability of errors.

False positives are cases in which records are classified as matched records but really are nonmatch records. False negatives are cases in which records are classified as nonmatch records but are matched records.

The goal of setting cutoffs is to minimize the number of clerical pairs and limit the number of false negatives and positives. You fine tune the results depending on the goals of your organization.


PDFThis topic is also in the IBM InfoSphere QualityStage User's Guide.

Update timestamp Last updated: 2012-9-20