Silhouette (clustering)
{{Short description|Quality measure in cluster analysis}}
Silhouette is a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified.{{Cite journal
| doi = 10.1016/0377-0427(87)90125-7
| author = Peter J. Rousseeuw
| author-link = Peter J. Rousseeuw
| title = Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis
| journal = Computational and Applied Mathematics
| volume = 20
| pages = 53–65
| year = 1987| doi-access = free
}} It was proposed by Belgian statistician Peter Rousseeuw in 1987.
The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. A clustering with an average silhouette width of over 0.7 is considered to be "strong", a value over 0.5 "reasonable" and over 0.25 "weak", but with increasing dimensionality of the data, it becomes difficult to achieve such high values because of the curse of dimensionality, as the distances become more similar.{{Cite journal |last1=Beyer |first1=Kevin |last2=Goldstein |first2=Jonathan |last3=Ramakrishnan |first3=Raghu |last4=Shaft |first4=Uri |date=1999 |title=When is "nearest neighbor" meaningful?} | journal= Database Theory—ICDT'99: 7th International Conference Jerusalem, Israel, January 10--12, 1999 Proceedings 7 }} The silhouette score is specialized for measuring cluster quality when the clusters are convex-shaped, and may not perform well if the data clusters have irregular shapes or are of varying sizes.{{Cite journal |last1=Monshizadeh |first1=Mehrnoosh |last2=Khatri |first2=Vikramajeet |last3=Kantola |first3=Raimo |last4=Yan |first4=Zheng |date=2022-11-01 |title=A deep density based and self-determining clustering approach to label unknown traffic |journal=Journal of Network and Computer Applications |volume=207 |pages=103513 |doi=10.1016/j.jnca.2022.103513 |issn=1084-8045 |quote=However, both measures [silhouette coefficient and edge correlation] prefer convex-shaped clusters and cannot adapt to all cluster shapes produced by DBSCAN.|doi-access=free }} The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.
Definition
File:Silhouette-plot-orange.png data mining suite. At the bottom of the plot, silhouette identifies dolphin and porpoise as outliers in the group of mammals.]]
Assume the data have been clustered via any technique, such as k-medoids or k-means, into clusters.
For data point (data point in the cluster ), let
:
to be the smallest (hence the
We now define a silhouette (value) of one data point
:
and
:
Which can be also written as:
:
1-a(i)/b(i), & \mbox{if } a(i) < b(i) \\
0, & \mbox{if } a(i) = b(i) \\
b(i)/a(i)-1, & \mbox{if } a(i) > b(i) \\
\end{cases}
From the above definition it is clear that
:
Note that
For
The mean
| doi = 10.1016/j.ins.2015.06.039
| author = R.C. de Amorim, C. Hennig
| title = Recovering the number of clusters in data sets with noise features using feature rescaling factors
| journal = Information Sciences
| volume = 324
| pages = 126–145
| year = 2015| arxiv = 1602.06989
| s2cid = 315803
}}
Kaufman et al. introduced the term silhouette coefficient for the maximum value of the mean
|author1=Leonard Kaufman
|author2=Peter J. Rousseeuw
|author2-link=Peter J. Rousseeuw
|title=Finding groups in data : An introduction to cluster analysis
|date=1990
|publisher=Wiley-Interscience
|location=Hoboken, NJ
|isbn=9780471878766
|doi=10.1002/9780470316801
|page=[https://archive.org/details/findinggroupsind00kauf/page/87 87]
|url=https://archive.org/details/findinggroupsind00kauf/page/87
}}
i.e.,
:
SC = \max_k \tilde{s}\left(k\right),
where
Simplified Silhouette and Medoid Silhouette
Computing the silhouette coefficient needs all
:
which has the additional benefit that
:
:
If the cluster centers are medoids (as in k-medoids clustering) instead of arithmetic means (as in k-means clustering), this is also called the medoid-based silhouette{{Cite journal |last1=Van der Laan |first1=Mark |last2=Pollard |first2=Katherine |last3=Bryan |first3=Jennifer |date=2003 |title=A new partitioning around medoids algorithm |url=http://www.tandfonline.com/doi/abs/10.1080/0094965031000136012 |journal=Journal of Statistical Computation and Simulation |language=en |volume=73 |issue=8 |pages=575–584 |doi=10.1080/0094965031000136012 |s2cid=17437463 |issn=0094-9655|url-access=subscription }} or medoid silhouette.{{Cite conference |last1=Lenssen |first1=Lars |last2=Schubert |first2=Erich |date=2022 |title=Clustering by Direct Optimization of the Medoid Silhouette |url=https://link.springer.com/10.1007/978-3-031-17849-8_15 |conference=International Conference on Similarity Search and Applications |language=en |pages=190–204 |doi=10.1007/978-3-031-17849-8_15 |access-date=2022-10-20|arxiv=2209.12553 }}
If every object is assigned to the nearest medoid (as in k-medoids clustering), we know that
Silhouette Clustering
Instead of using the average silhouette to evaluate a clustering obtained from, e.g., k-medoids or k-means, we can try to directly find a solution that maximizes the Silhouette. We do not have a closed form solution to maximize this, but it will usually be best to assign points to the nearest cluster as done by these methods. Van der Laan et al. proposed to adapt the standard algorithm for k-medoids, PAM, for this purpose and call this algorithm PAMSIL:
- Choose initial medoids by using PAM
- Compute the average silhouette of this initial solution
- For each pair of a medoid {{Math|m}} and a non-medoid {{Math|x}}
- swap {{Math|m}} and {{Math|x}}
- compute the average silhouette of the resulting solution
- remember the best swap
- un-swap {{Math|m}} and {{Math|x}} for the next iteration
- Perform the best swap and return to 3, otherwise stop if no improvement was found.
The loop in step 3 is executed for
Because this is a fairly expensive operation, the authors propose to also use the medoid-based silhouette, and call the resulting algorithm PAMMEDSIL. It needs
Batool et al. propose a similar algorithm under the name OSil, and propose a CLARA-like sampling strategy for larger data sets, that solves the problem only for a sub-sample.{{Cite journal |last1=Batool |first1=Fatima |last2=Hennig |first2=Christian |date=2021 |title=Clustering with the Average Silhouette Width |url=https://linkinghub.elsevier.com/retrieve/pii/S0167947321000244 |journal=Computational Statistics & Data Analysis |language=en |volume=158 |pages=107190 |doi=10.1016/j.csda.2021.107190|arxiv=1910.11339 |s2cid=219260336 }}
By adopting recent improvements to the PAM algorithm, FastMSC reduces the runtime using the medoid silhouette to just
By starting with a maximum number of clusters {{Math|k{{sub|max}}}} and iteratively removing the worst center (in terms of a change in silhouette) and re-optimizing, the best (highest medoid silhouette) clustering can be automatically determined. As data structures can be reused, this reduces the computation cost substantially over repeatedly running the algorithm for different numbers of clusters.{{Cite journal |last1=Lenssen |first1=Lars |last2=Schubert |first2=Erich |date=2024-02-01 |title=Medoid Silhouette clustering with automatic cluster number selection |url=https://www.sciencedirect.com/science/article/pii/S0306437923001266 |journal=Information Systems |volume=120 |pages=102290 |doi=10.1016/j.is.2023.102290 |issn=0306-4379|arxiv=2309.03751 }}
This algorithm needs pairwise distances and is typically implemented with a pairwise distance matrix. The