Clustering high-dimensional data#Subspace clustering

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology can produce many measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.

Problems

Four problems need to be overcome for clustering in high-dimensional data:{{Cite journal | doi = 10.1145/1497577.1497578| title = Clustering high-dimensional data| journal = ACM Transactions on Knowledge Discovery from Data| volume = 3| pages = 1–58| year = 2009| last1 = Kriegel | first1 = H. P. | author-link = Hans-Peter Kriegel| last2 = Kröger | first2 = P. | last3 = Zimek | first3 = A. | s2cid = 17363900}}

  • Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential growth of the number of possible values with each dimension, complete enumeration of all subspaces becomes intractable with increasing dimensionality. This problem is known as the curse of dimensionality.
  • The concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges. The discrimination of the nearest and farthest point in particular becomes meaningless:

::\lim_{d \to \infty} \frac{\mathit{dist}_\max - \mathit{dist}_\min}{\mathit{dist}_\min} = 0

  • A cluster is intended to group objects that are related, based on observations of their attribute's values. However, given a large number of attributes some of the attributes will usually not be meaningful for a given cluster. For example, in newborn screening a cluster of samples might identify newborns that share similar blood values, which might lead to insights about the relevance of certain blood values for a disease. But for different diseases, different blood values might form a cluster, and other values might be uncorrelated. This is known as the local feature relevance problem: different clusters might be found in different subspaces, so a global filtering of attributes is not sufficient.
  • Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters might exist in arbitrarily oriented affine subspaces.

Recent research indicates that the discrimination problems only occur when there is a high number of irrelevant dimensions, and that shared-nearest-neighbor approaches can improve results.{{Cite conference | last1 = Houle | first1 = M. E. | last2 = Kriegel | first2 = H. P. | author-link2=Hans-Peter Kriegel | last3 = Kröger | first3 = P.| last4 = Schubert | first4 = E. | last5 = Zimek | first5 = A.| title = Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? | doi = 10.1007/978-3-642-13818-8_34 | conference = Scientific and Statistical Database Management | series = Lecture Notes in Computer Science | volume = 6187 | pages = 482 | year = 2010 | isbn = 978-3-642-13817-1 | url = http://www.dbs.ifi.lmu.de/~zimek/publications/SSDBM2010/SNN-SSDBM2010-preprint.pdf}}

Approaches

Approaches towards clustering in axis-parallel or arbitrarily oriented affine subspaces differ in how they interpret the overall goal, which is finding clusters in data with high dimensionality. An overall different approach is to find clusters based on pattern in the data matrix, often referred to as biclustering, which is a technique frequently utilized in bioinformatics.

=Subspace clustering=

Image:SubspaceClustering.png

Subspace clustering aims to look for clusters in different combinations of dimensions (i.e., subspaces) and unlike many other clustering approaches does not assume that all of the clusters in a dataset are found in the same set of dimensions.{{Cite journal |last=Parsons |first=Lance |last2=Haque |first2=Ehtesham |last3=Liu |first3=Huan |date=2004-06-01 |title=Subspace clustering for high dimensional data: a review |url=https://doi.org/10.1145/1007730.1007731 |journal=ACM SIGKDD Explorations Newsletter |volume=6 |issue=1 |pages=90–105 |doi=10.1145/1007730.1007731 |issn=1931-0145|url-access=subscription }} Subspace clustering can take bottom-up or top-down approaches. Bottom-up methods (such as CLIQUE) heuristically identify relevant dimensions by dividing the data space into a grid structure, selecting dense units, and then iteratively linking them if they are adjacent and dense.

The adjacent image shows a mere two-dimensional space where a number of clusters can be identified. In the one-dimensional subspaces, the clusters c_a (in subspace \{x\}) and c_b, c_c, c_d (in subspace \{y\}) can be found. c_c cannot be considered a cluster in a two-dimensional (sub-)space, since it is too sparsely distributed in the x axis. In two dimensions, the two clusters c_{ab} and c_{ad} can be identified.

The problem of subspace clustering is given by the fact that there are 2^d different subspaces of a space with d dimensions. If the subspaces are not axis-parallel, an infinite number of subspaces is possible. Hence, subspace clustering algorithms utilize some kind of heuristic to remain computationally feasible, at the risk of producing inferior results. For example, the downward-closure property (cf. association rules) can be used to build higher-dimensional subspaces only by combining lower-dimensional ones, as any subspace T containing a cluster, will result in a full space S also to contain that cluster (i.e. S ⊆ T), an approach taken by most of the traditional algorithms such as CLIQUE,{{Cite journal | last1 = Agrawal | first1 = R. | last2 = Gehrke | first2 = J. | last3 = Gunopulos | first3 = D. | last4 = Raghavan | first4 = P. | title = Automatic Subspace Clustering of High Dimensional Data | doi = 10.1007/s10618-005-1396-1 | journal = Data Mining and Knowledge Discovery | volume = 11 | pages = 5–33 | year = 2005 | citeseerx = 10.1.1.131.5152 | s2cid = 9289572 }} SUBCLU.{{Cite conference| doi = 10.1137/1.9781611972740.23| title = Density-Connected Subspace Clustering for High-Dimensional Data| conference = Proceedings of the 2004 SIAM International Conference on Data Mining| pages = [https://archive.org/details/proceedingsoffou0000siam/page/246 246]| year = 2004| last1 = Kailing| first1 = K.| last2 = Kriegel| first2 = H. P.| author-link2 = Hans-Peter Kriegel| last3 = Kröger| first3 = P.| isbn = 978-0-89871-568-2| url-access = registration| url = https://archive.org/details/proceedingsoffou0000siam/page/246| doi-access = free}} It is also possible to define a subspace using different degrees of relevance for each dimension, an approach taken by {{Proper name|iMWK-Means}},{{Cite journal | doi = 10.1016/j.patcog.2011.08.012| title = Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering| journal = Pattern Recognition| volume = 45| issue = 3| pages = 1061| year = 2012| last1 = De Amorim | first1 = R.C. | last2 = Mirkin | first2 = B. | bibcode = 2012PatRe..45.1061C}} EBK-Modes{{Cite book|last1=Carbonera|first1=Joel Luis|last2=Abel|first2=Mara|title=2014 IEEE 26th International Conference on Tools with Artificial Intelligence |chapter=An Entropy-Based Subspace Clustering Algorithm for Categorical Data |date=November 2014|pages=272–277 |publisher=IEEE|doi=10.1109/ictai.2014.48|isbn=9781479965724|s2cid=7208538 }} and CBK-Modes.{{Cite book|last1=Carbonera|first1=Joel Luis|last2=Abel|first2=Mara|title=Proceedings of the 17th International Conference on Enterprise Information Systems |chapter=CBK-Modes: A Correlation-based Algorithm for Categorical Data Clustering |date=2015|pages=603–608 |publisher=SCITEPRESS - Science and Technology Publications|doi=10.5220/0005367106030608|isbn=9789897580963}}

=Projected clustering=

Projected clustering seeks to assign each point to a unique cluster, but clusters may exist in different subspaces. The general approach is to use a special distance function together with a regular clustering algorithm.

For example, the PreDeCon algorithm checks which attributes seem to support a clustering for each point, and adjusts the distance function such that dimensions with low variance are amplified in the distance function.{{Cite conference | doi = 10.1109/ICDM.2004.10087| title = Density Connected Clustering with Local Subspace Preferences| conference = Fourth IEEE International Conference on Data Mining (ICDM'04)| pages = 27| year = 2004| last1 = Böhm | first1 = C.| last2 = Kailing | first2 = K.| last3 = Kriegel | first3 = H. -P. | author-link3 = Hans-Peter Kriegel| last4 = Kröger | first4 = P.| isbn = 0-7695-2142-8| url = http://www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/icdm04-predecon.pdf}} In the figure above, the cluster c_c might be found using DBSCAN with a distance function that places less emphasis on the x-axis and thus exaggerates the low difference in the y-axis sufficiently enough to group the points into a cluster.

PROCLUS uses a similar approach with a k-medoid clustering.{{Cite journal | doi = 10.1145/304181.304188| title = Fast algorithms for projected clustering| journal = ACM SIGMOD Record| volume = 28| issue = 2| pages = 61| year = 1999| last1 = Aggarwal | first1 = C. C. | last2 = Wolf | first2 = J. L. | last3 = Yu | first3 = P. S. | last4 = Procopiuc | first4 = C. | last5 = Park | first5 = J. S. | citeseerx = 10.1.1.681.7363}} Initial medoids are guessed, and for each medoid the subspace spanned by attributes with low variance is determined. Points are assigned to the medoid closest, considering only the subspace of that medoid in determining the distance. The algorithm then proceeds as the regular PAM algorithm.

If the distance function weights attributes differently, but never with 0 (and hence never drops irrelevant attributes), the algorithm is called a "soft"-projected clustering algorithm.

= Projection-based clustering =

Projection-based clustering is based on a nonlinear projection of high-dimensional data into a two-dimensional space.Thrun, M. C., & Ultsch, A. : Using Projection based Clustering to Find Distance and Density based Clusters in High-Dimensional Data, J. Classif., pp. 1-33, doi: 10.1007/s00357-020-09373-2. Typical projection-methods like t-distributed stochastic neighbor embedding (t-SNE),Van der Maaten, L., & Hinton, G.: Visualizing Data using t-SNE, Journal of Machine Learning Research, Vol. 9(11), pp. 2579-2605. 2008. or neighbor retrieval visualizer (NerV) Venna, J., Peltonen, J., Nybo, K., Aidos, H., & Kaski, S.: Information retrieval perspective to nonlinear dimensionality reduction for data visualization, The Journal of Machine Learning Research, Vol. 11, pp. 451-490. 2010. are used to project data explicitly into two dimensions disregarding the subspaces of higher dimension than two and preserving only relevant neighborhoods in high-dimensional data. In the next step, the Delaunay graphDelaunay, B.: Sur la sphere vide, Izv. Akad. Nauk SSSR, Otdelenie Matematicheskii i Estestvennyka Nauk, Vol. 7(793-800), pp. 1-2. 1934. between the projected points is calculated, and each vertex between two projected points is weighted with the high-dimensional distance between the corresponding high-dimensional data points. Thereafter the shortest path between every pair of points is computed using the Dijkstra algorithm.Dijkstra, E. W.: A note on two problems in connexion with graphs, Numerische mathematik, Vol. 1(1), pp. 269-271. 1959. The shortest paths are then used in the clustering process, which involves two choices depending on the structure type in the high-dimensional data. This Boolean choice can be decided by looking at the topographic map of high-dimensional structures.Thrun, M. C., & Ultsch, A.: Uncovering High-Dimensional Structures of Projections from Dimensionality Reduction Methods, MethodsX, Vol. 7, pp. 101093, doi: 10.1016/j.mex.20200.101093,2020. In a benchmarking of 34 comparable clustering methods, projection-based clustering was the only algorithm that always was able to find the high-dimensional distance or density-based structure of the dataset. Projection-based clustering is accessible in the open-source R package "ProjectionBasedClustering" on CRAN.{{Cite web|last=|first=|date=|title=CRAN - Package ProjectionBasedClustering|url=https://cran.r-project.org/web/packages/ProjectionBasedClustering/index.html|url-status=live|archive-url=https://web.archive.org/web/20180317152038/http://cran.r-project.org:80/web/packages/ProjectionBasedClustering/index.html |archive-date=2018-03-17 |access-date=|website=}}

=Bootstrap-based clustering=

Bootstrap aggregation (bagging) can be used to create multiple clusters and aggregate the findings. This is done by taking random subsamples of the data, performing a cluster analysis on each of them and then aggregating the results of the clusterings to generate a dissimilarity measure which can then be used to explore and cluster the original data.Dudoit, S. and Fridlyand, J. (2003). Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19/9, 1090–1099. doi:10.1093/bioinformatics/btg038.Strehl, A. & Ghosh, J. (2002). Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research. 3. 583-617. 10.1162/153244303321897735.

Since high-dimensional data are likely to have many non-informative features, weights can be used during the bagging process to increase the impact of the more informative aspects. This produces "ABC dissimilarities" which can then be used to explore and cluster the original data and also to assess which features appear to be more impactful in defining the clusters.

Amaratunga, D., Cabrera, J. & Kovtun, V.. (2008). Microarray learning with ABC. Biostatistics. 9. 128-36. 10.1093/biostatistics/kxm017.

Amaratunga, D. & Cabrera, J. & Lee, Y.S. (2014). Resampling-based similarity measures for high-dimensional data. Journal of Computational Biology. 22. 10.1089/cmb.2014.0195.

Cherkas, Y., Amaratunga, D., Raghavan, N., Sasaki, J. and McMillian, M. (2016). ABC gene-ranking for prediction of drug-induced cholestasis in rats, Toxicology Reports, 3: 252–261.

=Hybrid approaches=

Not all algorithms try to either find a unique cluster assignment for each point or all clusters in all subspaces; many settle for a result in between, where a number of possibly overlapping, but not necessarily exhaustive set of clusters are found. An example is FIRES, which is from its basic approach a subspace clustering algorithm, but uses a heuristic too aggressive to credibly produce all subspace clusters.{{Cite conference | doi = 10.1109/ICDM.2005.5| title = A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data| conference = Fifth IEEE International Conference on Data Mining (ICDM'05)| pages = 250| year = 2005| last1 = Kriegel | first1 = H. | author-link = Hans-Peter Kriegel| last2 = Kröger | first2 = P.| last3 = Renz | first3 = M.| last4 = Wurst | first4 = S.| isbn = 0-7695-2278-5| url = http://www.dbs.informatik.uni-muenchen.de/~kroegerp/papers/ICDM05-FIRES.pdf}} Another hybrid approach is to include a human-into-the-algorithmic-loop: Human domain expertise can help to reduce an exponential search space through heuristic selection of samples. This can be beneficial in the health domain where, e.g., medical doctors are confronted with high-dimensional descriptions of patient conditions and measurements on the success of certain therapies. An important question in such data is to compare and correlate patient conditions and therapy results along with combinations of dimensions. The number of dimensions is often very large, consequently one needs to map them to a smaller number of relevant dimensions to be more amenable for expert analysis. This is because irrelevant, redundant, and conflicting dimensions can negatively affect effectiveness and efficiency of the whole analytic process.{{Cite journal | doi = 10.1007/s40708-016-0043-5| pmid = 27747817| pmc = 5106406| title = Visual analytics for concept exploration in subspaces of patient groups: Making sense of complex datasets with the Doctor-in-the-loop| journal = Brain Informatics| volume = 3| issue = 4| pages = 233–247| year = 2016| last1 = Hund | first1 = M. | last2 = Böhm | first2 = D.| last3 = Sturm | first3 = W.| last4 = Sedlmair | first4 = M.| last5 = Schreck | first5 = T.| last6 = Keim | first6 = D.A.| last7 = Majnaric | first7 = L.| last8 = Holzinger | first8 = A.}}

=Correlation clustering=

Another type of subspaces is considered in Correlation clustering (Data Mining).

Software

  • ELKI includes various subspace and correlation clustering algorithms
  • FCPS includes over fifty clustering algorithmsThrun, M. C., & Stier, Q.: Fundamental Clustering Algorithms Suite, SoftwareX, Vol. 13(C), pp. 100642, doi: 10.1016/j.softx.2020.100642, 2021.

References