Nearest Neighbor Analysis

Nearest Neighbor Analysis is a method for classifying cases based on their similarity to other cases. In machine learning, it was developed as a way to recognize patterns of data without requiring an exact match to any stored patterns, or cases. Similar cases are near each other and dissimilar cases are distant from each other. Thus, the distance between two cases is a measure of their dissimilarity.

Cases that are near each other are said to be “neighbors.” When a new case (holdout) is presented, its distance from each of the cases in the model is computed. The classifications of the most similar cases – the nearest neighbors – are tallied and the new case is placed into the category that contains the greatest number of nearest neighbors.

You can specify the number of nearest neighbors to examine; this value is called k.

Nearest neighbor analysis can also be used to compute values for a continuous target. In this situation, the average or median target value of the nearest neighbors is used to obtain the predicted value for the new case.

Nearest Neighbor Analysis Data Considerations

Target and features. The target and features can be:

  • Nominal. A variable can be treated as nominal when its values represent categories with no intrinsic ranking (for example, the department of the company in which an employee works). Examples of nominal variables include region, postal code, and religious affiliation.
  • Ordinal. A variable can be treated as ordinal when its values represent categories with some intrinsic ranking (for example, levels of service satisfaction from highly dissatisfied to highly satisfied). Examples of ordinal variables include attitude scores representing degree of satisfaction or confidence and preference rating scores.
  • Scale. A variable can be treated as scale (continuous) when its values represent ordered categories with a meaningful metric, so that distance comparisons between values are appropriate. Examples of scale variables include age in years and income in thousands of dollars.

    Nominal and Ordinal variables are treated equivalently by Nearest Neighbor Analysis. The procedure assumes that the appropriate measurement level has been assigned to each variable; however, you can temporarily change the measurement level for a variable by right-clicking the variable in the source variable list and selecting a measurement level from the pop-up menu. To permanently change the level of measurement for a variable, see Variable measurement level.

An icon next to each variable in the variable list identifies the measurement level and data type:

Table 1. Measurement level icons
  Numeric String Date Time
Scale (Continuous)
Scale icon
n/a
Scale Date icon
Scale Time icon
Ordinal
Ordinal icon
Ordinal String icon
Ordinal Date icon
Ordinal Time icon
Nominal
Nominal icon
Nominal String icon
Nominal Date icon
Nominal Time icon

Categorical variable coding. The procedure temporarily recodes categorical predictors and dependent variables using one-of-c coding for the duration of the procedure. If there are c categories of a variable, then the variable is stored as c vectors, with the first category denoted (1,0,...,0), the next category (0,1,0,...,0), ..., and the final category (0,0,...,0,1).

This coding scheme increases the dimensionality of the feature space. In particular, the total number of dimensions is the number of scale predictors plus the number of categories across all categorical predictors. As a result, this coding scheme can lead to slower training. If your nearest neighbors training is proceeding very slowly, you might try reducing the number of categories in your categorical predictors by combining similar categories or dropping cases that have extremely rare categories before running the procedure.

All one-of-c coding is based on the training data, even if a holdout sample is defined (see Partitions (Nearest Neighbor Analysis)). Thus, if the holdout sample contains cases with predictor categories that are not present in the training data, then those cases are not scored. If the holdout sample contains cases with dependent variable categories that are not present in the training data, then those cases are scored.

Rescaling. Scale features are normalized by default. All rescaling is performed based on the training data, even if a holdout sample is defined (see Partitions (Nearest Neighbor Analysis)). If you specify a variable to define partitions, it is important that the features have similar distributions across the training and holdout samples. Use, for example, the Explore procedure to examine the distributions across partitions.

Frequency weights. Frequency weights are ignored by this procedure.

Replicating results. The procedure uses random number generation during random assignment of partitions and cross-validation folds. If you want to replicate your results exactly, in addition to using the same procedure settings, set a seed for the Mersenne Twister (see Partitions (Nearest Neighbor Analysis)), or use variables to define partitions and cross-validation folds.

To obtain a nearest neighbor analysis

This feature requires the Statistics Base option.

From the menus choose:

Analyze > Classify > Nearest Neighbor...

  1. Specify one or more features, which can be thought of independent variables or predictors if there is a target.

    Target (optional). If no target (dependent variable or response) is specified, then the procedure finds the k nearest neighbors only – no classification or prediction is done.

    Normalize scale features. Normalized features have the same range of values, which can improve the performance of the estimation algorithm. Adjusted normalization, [2*(x−min)/(max−min)]−1, is used. Adjusted normalized values fall between −1 and 1.

    Focal case identifier (optional). This allows you to mark cases of particular interest. For example, a researcher wants to determine whether the test scores from one school district – the focal case – are comparable to those from similar school districts. He uses nearest neighbor analysis to find the school districts that are most similar with respect to a given set of features. Then he compares the test scores from the focal school district to those from the nearest neighbors.

    Focal cases could also be used in clinical studies to select control cases that are similar to clinical cases. Focal cases are displayed in the k nearest neighbors and distances table, feature space chart, peers chart, and quadrant map. Information on focal cases is saved to the files specified on the Output tab.

    Cases with a positive value on the specified variable are treated as focal cases. It is invalid to specify a variable with no positive values.

Case label (optional). Cases are labeled using these values in the feature space chart, peers chart, and quadrant map.

Fields with unknown measurement level

The Measurement Level alert is displayed when the measurement level for one or more variables (fields) in the dataset is unknown. Since measurement level affects the computation of results for this procedure, all variables must have a defined measurement level.

Scan Data. Reads the data in the active dataset and assigns default measurement level to any fields with a currently unknown measurement level. If the dataset is large, that may take some time.

Assign Manually. Opens a dialog that lists all fields with an unknown measurement level. You can use this dialog to assign measurement level to those fields. You can also assign measurement level in Variable View of the Data Editor.

Since measurement level is important for this procedure, you cannot access the dialog to run this procedure until all fields have a defined measurement level.

This procedure pastes KNN command syntax.