Information icon IBM InfoSphere DataStage and InfoSphere QualityStage, Version 8.5
space Feedback

Using the Frequency/Weight histogram

The Match Designer histogram displays the distribution of the composite weights.

Before you begin

Test a match pass.

About this task

If you move the Current Data handle on the histogram to a new weight, it automatically scrolls the data grid to the location of the data with the selected weight. Likewise, if you reposition the selection within the data display, the Current Data handle is repositioned in the histogram.

Move the Current Data handle by either of the following actions:
  • To display records of a certain weight, move the Current Data handle along the Weight axis. Ascending by Weight Sort moves the Current Data handle to the lowest detail weight value.
    Note: For Unduplicate Match specifications, the Current Data handle is available only when the data display is in match pair order. To display in match pair order, right-click the data display and click Group by Match Pairs.
  • To adjust the Clerical Cutoff or Match Cutoff settings, move the cutoff handle along the Weight axis. The changed cutoffs show in the Cutoff Values pane.
The following list contains some points to remember about cutoffs:
  • The clerical cutoff is a composite weight above which record pairs are considered to be possible matches. Record pairs with weights between the match and the clerical cutoff are known as clericals and typically are reviewed to determine whether they are matches or nonmatches. If you do not want to review clericals, make the match cutoff weight equal to the clerical cutoff weight.
  • Cutoff weights can be negative values, if you want. However, when you set cutoff weights to negative values, this setting creates extremely inclusive sets of matched records. The histogram displays the distribution of the composite weights. If you use negative values for cutoff weights, this histogram shows many values at highly negative weights, because most cases are nonmatched pairs. However, record pairs that are obvious disagreements are not a large part of the matching process, and thus, negative weights are not often shown.
  • There is another large group of values at highly positive weights for the matched cases. The cutoff values for the match run can be set by inspecting this histogram. Make the clerical review cutoff the weight where the spike in the histogram reaches near the axis. Set the other cutoff weight where the nonmatched cases start to dominate. Experiment and examine the test results as a guide for setting the cutoffs.
  • For a Reference Many-to-One Duplicate match type, there is an additional cutoff weight called a duplicate cutoff. This cutoff is optional. If you use the duplicate cutoff, set it higher than the match cutoff weight. If more than one record pair receives a composite weight that is higher than the match cutoff, these records are declared duplicates if their composite weight is equal to or greater than the duplicate cutoff.

Procedure

Procedure

  1. Open the IBM® InfoSphere® DataStage® and QualityStage® Designer.
  2. Open the match specification and select the match pass for which you want to use the histogram.
  3. In the histogram, display the weight that you want to see by moving the Current Data handle or scrolling through the records in the data grid. The position and availability of the Current Data handle depends on the following factors:
    Unduplicate Match specifications
    The Current Data handle is visible only if you group by match pairs, not by match sets. The handle is initially positioned based on the weight of the first duplicate record.
    Reference Match specifications
    The initial position of the Current Data handle is determined by the weight of the first data record.
    Both Unduplicate and Reference Match specifications
    If you scroll through the Test Results data grid, the Current Data handle is repositioned based on the selected record.
  4. Inspect the data in the Test Results data grid to see if the matches are adequate.
  5. Repeat the two previous steps to find the most appropriate match cutoff and clerical cutoff points.
    Note: Do not allow the clerical cutoff value to exceed the match cutoff value. If the clerical cutoff value exceeds the match cutoff value, you receive an error message and must cancel or reduce the clerical cutoff value.
  6. Move the Match Cutoff and Clerical Cutoff handles to the values that you want, which are shown in the Cutoff Values pane.
  7. Test the match pass again with the new cutoff settings.

PDFThis topic is also in the IBM InfoSphere QualityStage User's Guide.

Update timestamp Last updated: 2012-9-20