Clustering binary data with K-Means (should be avoided)
I have a number of variables containing binary data (such as 0-1 or Yes-No responses, also known as dichotomous data). I would like to use K-Means Clustering to form clusters of similar cases. But I have heard that it is inappropriate to cluster binary-valued data. Is this true?
Resolving the problem
Yes, it is unlikely that binary data can be clustered satisfactorily. To see why, consider what happens as the K-Means algorithm processes cases.
For binary data, the Euclidean distance measure used by K-Means reduces to counting the number of variables on which two cases disagree. After the initial centers are chosen (which depends on the order of the cases), the centers are still binary data. For the first iteration, as the cases are compared to cluster means, they will always be at some integer distance from each of the centers. There will often be ties, and the case will be assigned to a cluster in an arbitrary manner. Using Euclidean distance (the only measure available to K-Means), it is impossible to overcome the symmetry and break the ties in any meaningful way.
After the first iteration, the distances will no longer be strictly integer valued, as the cluster centers will have been updated to be the proportion of ones in the cluster. However, the arbitrariness of the assignments up to this point will remain, and subsequent iterations cannot be expected to eliminate it. This is evident if the iteration history is considered: it will very frequently fail to converge to stable cluster assignments after 10 steps (the default). Requesting more steps may result in stable assignments, but not necessarily meaningful ones.
If all of the cluster variables are binary, then one can employ the distance measures for binary variables that are available for the Hierarchical Cluster procedure (CLUSTER command). Hierarchical Cluster is in the Statistics Base module (like K-Means Cluster) and is available from the Analyze->Classify->Hierarchical Cluster menu. If you click the Method button in the main Hierarchical Cluster dialog, you can choose one type of distance measure from interval variable, counts, or binary variable-based distance measures and then choose from a list of specific distance measures for the chosen type. The squared Euclidean distance is the default distance measure for Hierarchical Cluster. Hierarchical Cluster is more memory intensive than the K-Means or TwoStep Cluster procedures, with the memory requirement varying on the order of the square of the number of variables. See Technote 1476125 regarding memory issues for Hierarchical Cluster and Technote 1480659 for a caution regarding the plots produced by Hierarchical Cluster.
The TwoStep Cluster procedure will cluster cases by continous or categorical variables or a mix of such variables. If all of the variables are continuous, then TwoStep will calculate the Euclidean distance between cases. If one or more of the cluster variables are categorical, then TwoStep employs a log-likelihood distance measure. TwoStep is in the Statistics Base module and is available from the SPSS Statistics menu system at
An alternative strategy which is sometimes employed is to run factor analysis or principal component analysis on the binary variables, saving the factor or component scores as new variables and clustering the cases on the basis of those scores. Thus, the data being clustered are no longer binary. See Technote 1477550 for suggestions on factor and principal components analysis of categorical variables, including binary variables.