12 Clustering
Goals
- Learn about partitional clustering
- Learn about hierarchical clustering
- Use clustering validation methods
- Apply different methods to larger data sets
The goal of clustering is to classify data points into groups (clusters) by without giving the algorithm any knowledge of the correct classification. This type of approach is called unsupervised learning and it is appropriate when the “truth” for your data classification is unavailable or difficult to obtain.
If the truth is unknown, we need a way of deciding which data points belong together. One common set of approaches relies on a measure of closeness, or distance between points. Of those, the classic K-means approach is the most straightforward.
12.1 K-means algorithm
- divide data into K clusters
- calculate centroids for each
- go through each data point until nothing changes
- calculate distance to each centroid
- assign to nearest centroid
- recalculate centroids for the two affected clusters
Let us apply the k-means algorithm to our well-studied penguin data set. In the script below, we remove the NAs, and select out the categorical variables, as they are not directly useful for the distance-based algorithm, leaving the four numeric variables to define similarity between individuals. The question is, will they cluster penguins according to species?t
The plot is produced by performing PCA to reduce the number of variables from 4 to 2, helping present the data points in the way that optimizes their visual separation. Notice that the clusters are not well separated, and when compared with the actual classification given by species
, they do not do well.
However, the four measurements have very different variances, so we try scaling them to make them all have equal variance of one:
Now we get much better separation, as well as much better prediction quality. However, if you run the above code several times, you will see different results, because k-means starts with a random selection of centroids. In cases like this, where there is not very obvious clusters, it may converge to different classifications. Here, for some trials we see very good prediction quality for all three species, but other times two of the species are commingled.
12.1.1 Assumptions of K-means algorithm
- There is a meaningful distance measure
- Clusters are roughly spherical
- Clusters are of similar size
12.2 Hierarchical clustering
Hierarchical clustering is different approach from k-means, although it is also based on a notion of distance. The goal is to create a tree, akin to phylogeny, based on proximity of different points to each other, and then to divide it into groups by cutting the tree a certain depth from the root.
12.2.1 Agglomerative clustering
Start with single data points as “clusters,” then iteratively combine the closest pair of clusters. The closeness may be defined in the following ways:
Single Linkage: In single linkage, we define the distance between two clusters as the minimum distance between any single data point in the first cluster and any single data point in the second cluster.
Complete Linkage: In complete linkage, we define the distance between two clusters to be the maximum distance between any single data point in the first cluster and any single data point in the second cluster.
Average Linkage: In average linkage, we define the distance between two clusters to be the average distance between data points in the first cluster and data points in the second cluster.
Centroid Method: In centroid method, the distance between two clusters is the distance between the two mean vectors of the clusters.
Ward’s Method: This method does not directly define a measure of distance between two points or clusters. It is an ANOVA based approach. One-way univariate ANOVAs are done for each variable with groups defined by the clusters at that stage of the process. At each stage, two clusters merge that provide the smallest increase in the combined error sum of squares.
12.2.2 Clustering penguin data using hierarchical methods
Try different methods and see which one generates the best results
Exercise Try using different clustering methods!
12.3 Clustering analysis and validation
12.3.1 Hopkins statistic
Comparing the mean nearest-neighbor distance between uniformly generated sample points and mean nearest-neighbor distance within the data set. \[ H = \frac{\sum u^d_i}{\sum u^d_i + \sum w^d_i} \] This quantifies the “clustering tendency” of the data set.
If H is substantially above 0.5, reject the null hypothesis that the data are generated by a Poisson point process (no clustering tendency.)
- Red is high similarity (low dissimilarity)
- Blue is low similarity (high dissimilarity)
12.3.2 Elbow method
12.3.3 Silhouette Plot
Measures how similar an object \(i\) is to the other objects in its same cluster versus the objects outside of its cluster; \(S_i\) values range from -1 to 1. Close to 1 means very similar to objects in its own group and dissimilar to others
12.3.4 Lazy way: use all the methods!
12.3.5 Validation using bootstrapping
One common approach to validating clustering is to use the approach called bootstrapping which involves repeatedly sampling from the data set, running the clustering algorithm and comparing the results. One algorithm uses the Jaccard coefficient to quantify similarity between sets, which is defined as the number of points in the intersection of the two sets (those which are in both sets), divided by the number of points in the union of the two sets (the point that are in either one or the other set):
\[ J = \frac{ \vert A \cap B \vert }{\vert A \cup B \vert} \] The vertical lines indicate the number of points (cardinality) in the set.
12.4 Application to breast cancer data
The following measurements are based on biopsy data on patients with suspected breast cancer (see [5]). It contains several measurements of cell characteristics, as well as the classification of each biopsy into malignant or benign (2 or 4). Let us see if using clustering
12.5 References:
- https://www.r-bloggers.com/exploring-assumptions-of-k-means-clustering-using-r/
- https://onlinecourses.science.psu.edu/stat505/node/143/
- https://github.com/hhundiwala/hierarchical-clustering
- https://www.r-bloggers.com/bootstrap-evaluation-of-clusters/
- https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)