Find the K in K-means by Parametric Bootstrap
Nina Zumel, Win-Vector LLC
This app explores the use of parametric bootstrap simulations to determine the appropriate k for K-means clustering a data set. The approach was inspired by the boot.comp function in mixtools, an R package for fitting and analyzing finite mixture models (see this article by Ron Pearson for an example of using mixtools to fit mixtures of gaussians).
The Idea
We test the hypothesis that a data set is composed of k+1 clusters (or more) against the null hypothesis that it has only k clusters. To do this, we K-means cluster the data with k=k, and use that clustering to assign all points to a cluster. We then estimate the mean and covariance matrix of the data via principle components analysis. From these estimates, we can then generate simulated data sets of the same size as our true data set, where each cluster is generated from a gaussian distribution with the same mean and covariance matrix as the corresponding cluster in the true data. In other words, the null hypothesis is that the data was generated from k gaussian clusters.
We then K-means cluster the simulated data sets with k=k+1 and collect the distribution of the total within-sum-of-squares (total WSS) of these clusterings. For a data set with k actual clusters, we would expect the total WSS of a k+1 clustering to be lower than that of a k clustering, because total WSS tends to decrease as we partition the data into more (and internally "tighter") clusters. Our simulations give us a plausible distribution of the range of total WSS that we would tend to see.
If the real data is really in k (gaussian) clusters, then when we k+1 cluster it, we would expect a total WSS within the range of our simulations. The hope is that if the data is actually in more than k clusters, then K-means will discover a k+1 clustering that is better (lower total WSS) than what we saw in our simulations, because the data is grouped into tighter clusters than what our null hypothesis assumed.
So if the total WSS of the true data's k+1 clustering is lower than at least 95% of the total WSS from the simulated data (for a significance level of p < 0.05), then we reject the null hypothesis, and assume that the data has at least k+1 clusters. This continues with increasing values of k until we can no longer reject our null hypothesis.
The Stop Early search style stops searching the first time we accept the null hypothesis. The Full Scan search style simulates all the way up to k=10 clusters and selects the last k where the null hypothesis was rejected.
Some Points to Consider
Obviously, this method only makes sense if we believe that the clusters are "gaussian-like": in the experiments run by this app, we generate clusters from an n-dimensional spherical gaussian to which we apply a random linear transformation and a random shift.
Even when the clusters in the data are truly generated from gaussians and we know k, K-means cannot always generate the same mixture parameters as a proper mixture model: K-means produces a "hard" clustering, or a partitioning where every point belongs to only one cluster, whereas a mixture model is a "soft" partitioning. Specifically, k-means cannot recover a cluster that is contained within another cluster, or discriminate clusters that substantially overlap. A mixture model may be able to detect both these cases. So if you strongly believe that your data is a mixture of gaussians (or some other distribution), and you want to recover the population parameters rather than simply find a "best partitioning" of the data, you should use
mixtoolsor another mixture model analysis package.This idea should extend to other cluster metrics, like the Calinski-Harabasz Index, as well as to other clustering algorithms. You can even modify the null hypothesis to cover other models of generating the clusters. Feel free to try your own experiments. If you want to compare your results to ours, you can download the data sets generated by the app via the "Save Generated Data" button.
Links and References
Our blog post on this approach is here.
We cover other approaches to estimating the number of clusters in Chapter 8 of our book Practical Data Science in R (Manning, 2014). This chapter is available as a free sample chapter, here.
The code to run the parametric bootstrap and to generate the test data sets is available from Github here.