IBM Support

CLUSTER with distance matrix as input

Troubleshooting


Problem

I want to use the IBM SPSS Statistics CLUSTER procedure to perform a hierarchical cluster of K objects. I already have a KxK matrix of proximities that I wish to use as input. However, when I run the CLUSTER procedure on these data, the procedure computes a distance matrix from the data, as if the data were case-level values on the variables. How can I indicate that the data already comprise a proximity matrix?

Resolving The Problem

To have your matrix of proximities recognized as such by the CLUSTER procedure, you must add two variables to the file and you must run the procedure as a syntax command that includes a /MATRIX IN subcommand. The two variables are ROWTYPE_ and VARNAME_. Both variables are string variables with a width of 8 characters. Content details are provided below.

The variable ROWTYPE_ must be the first variable in the file. It will typically have the value 'PROX' in all K rows. Correlations are sometimes used as similarity measures in cluster analysis. If the proximity matrix is a correlation matrix, then ROWTYPE_ will have the value 'CORR' in all K rows of the data file. When the values of ROWTYPE_ equal 'PROX', the cluster procedure will treat the matrix values as distances by default. If the matrix values are actually similarities, you need to indicate this by assigning a value label of 'SIMILARITY' to the ROWTYPE_ value 'PROX'. For a matrix of distance values, you can assign 'PROX' a value label of 'DISSIMILARITY', but this is not necessary.

The variable VARNAME_ will typically be the second variable in the file. A symmetric proximity matrix is assumed by CLUSTER, with row k and column k referring to the same object. Each column of the matrix will have a variable name associated with it. The value of VARNAME_ in each row of the file will be the variable name associated with that row's object, i.e. the variable name for the corresponding column.

Once the new variables have been added to the matrix file, you can run the cluster analysis from a syntax window with the CLUSTER command. The CLUSTER command must include the subcommand:

/MATRIX IN (*)

This subcommand indicates that the active data file is a matrix file, rather than a raw data file. The value of ROWTYPE_ in the data will indicate that is a matrix of proximities. The '*' in the above subcommand refers to the active data file. If you wish to analyze a file other than the current file, you can replace the '*' with a path and file name in quotes.

If your matrix file is currently in a text file (or not in a file at all), then the easiest way to prepare the data for cluster analysis in SPSS may be to precede the data with a MATRIX DATA command in a syntax window. The following commands illustrate the reading and analysis of a 5x5 similarity matrix. Use of the MATRIX DATA command will result in the proper placement of the ROWTPE_ and VARNAME_ variables. The /CONTENTS subcommand indicates how ROWTYPE_ should be filled. The VALUE LABELS command assigns the 'SIMILARITY' label to rowtype_ 'PROX'.

MATRIX DATA VARIABLES = x1 to x5
/FORMAT = LIST FULL DIAGONAL
/CONTENTS = PROX.
BEGIN DATA.
10 4 3 6 7
4 10 5 2 5
3 5 10 8 4
6 2 8 10 1
7 5 4 1 10
END DATA.
VALUE LABELS rowtype_ 'PROX' 'SIMILARITY' .
CLUSTER x1 x2 x3 x4 x5
/MATRIX = IN (*)
/METHOD BAVERAGE
/PRINT SCHEDULE DISTANCE
/PLOT DENDROGRAM .

If your proximity matrix is already in an SPSS data file, then the easiest approach may be to add the ROWTYPE_ and VARNAME_ variables in the Data Editor. You can insert variables at either the top of the Variable View or the far left of the Data View. Highlight the first variable name in the Data View, or the first case in the Variable View, and then twice click 'Insert Variable' in the Data menu. Define both variables in the Variables View and fill in the ROWTYPE_ value and variable names in the Data View. You will still need to run the CLUSTER command from a syntax window to include the /MATRIX IN subcommand.

[{"Product":{"code":"SSLVMB","label":"IBM SPSS Statistics"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"Not Applicable","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"Not Applicable","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Historical Number

26381

Document Information

Modified date:
16 April 2020

UID

swg21479630