Interpretable Clustering with the Distinguishability Criterion (2404.15967v2)
Abstract: Cluster analysis is a popular unsupervised learning tool used in many disciplines to identify heterogeneous sub-populations within a sample. However, validating cluster analysis results and determining the number of clusters in a data set remains an outstanding problem. In this work, we present a global criterion called the Distinguishability criterion to quantify the separability of identified clusters and validate inferred cluster configurations. Our computational implementation of the Distinguishability criterion corresponds to the Bayes risk of a randomized classifier under the 0-1 loss. We propose a combined loss function-based computational framework that integrates the Distinguishability criterion with many commonly used clustering procedures, such as hierarchical clustering, k-means, and finite mixture models. We present these new algorithms as well as the results from comprehensive data analysis based on simulation studies and real data applications.
- Identifying prototypical components in behaviour using clustering algorithms. PloS one, 5(2):e9361, 2010.
- Multivariate weather anomaly detection using dbscan clustering algorithm. In Journal of Physics: Conference Series, volume 1869, page 012077. IOP Publishing, 2021.
- Techniques of data mining in healthcare: a review. International Journal of Computer Applications, 120(15), 2015.
- Spatial analysis and data mining techniques for identifying risk factors of out-of-hospital cardiac arrest. International Journal of Information Management, 37(1):1528–1538, 2017.
- Big data analytics enhanced healthcare systems: a review. The Journal of Supercomputing, 76:1754–1799, 2020.
- Qubic2: a novel biclustering algorithm for large-scale bulk rna-sequencing and single-cell rna-sequencing data analysis. bioRxiv, page 409961, 2018.
- Challenges in unsupervised clustering of single-cell rna-seq data. Nature Reviews Genetics, 20(5):273–282, 2019.
- A cluster robustness score for identifying cell subpopulations in single cell gene expression datasets from heterogeneous tissues and tumors. Bioinformatics, 35(6):962–971, 2019.
- Clustering and classification methods for single-cell rna-sequencing data. Briefings in bioinformatics, 21(4):1196–1208, 2020.
- John Harmon Wolfe. Object cluster analysis of social areas. PhD thesis, University of California, 1963.
- Richard M Cormack. A review of classification. Journal of the Royal Statistical Society: Series A (General), 134(3):321–353, 1971.
- Data clustering: application and trends. Artificial Intelligence Review, 56(7):6439–6475, 2023.
- Hans H Bock. Probabilistic models in cluster analysis. Computational Statistics & Data Analysis, 23(1):5–28, 1996.
- Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458):611–631, June 2002.
- Statistical significance for hierarchical clustering. Biometrics, 73(3):811–821, January 2017.
- Selective inference for hierarchical clustering. Journal of the American Statistical Association, pages 1–11, 2022.
- Selective inference for k-means clustering. arXiv preprint arXiv:2203.15267, 2022.
- Significance analysis for clustering with single-cell rna-sequencing data. Nature Methods, 20(8):1196–1202, July 2023.
- Christian Hennig. What are the true clusters? Pattern Recognition Letters, 64:53–62, October 2015.
- Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423, 2001.
- On clustering validation techniques. Journal of intelligent information systems, 17:107–145, 2001.
- Minho Kim and RS Ramakrishna. New indices for cluster validity assessment. Pattern Recognition Letters, 26(15):2353–2363, 2005.
- Understanding of internal clustering validation measures. In 2010 IEEE international conference on data mining, pages 911–916. IEEE, 2010.
- Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987.
- A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1):1–27, 1974.
- Joseph C Dunn. Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics, 4(1):95–104, 1974.
- Volodymyr Melnykov. Merging mixture components for clustering through pairwise overlap. Journal of Computational and Graphical Statistics, 25(1):66–90, January 2016.
- An entropy criterion for assessing the number of clusters in a mixture model. Journal of Classification, 13(2):195–212, September 1996.
- Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7):719–725, July 2000.
- Combining mixture components for clustering. Journal of Computational and Graphical Statistics, 19(2):332–353, January 2010.
- Ulrike Von Luxburg et al. Clustering stability: an overview. Foundations and Trends® in Machine Learning, 2(3):235–274, 2010.
- Stability-based validation of clustering solutions. Neural computation, 16(6):1299–1323, 2004.
- Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3):511–528, 2005.
- Christian Hennig. Methods for merging gaussian mixture components. Advances in Data Analysis and Classification, 4(1):3–34, January 2010.
- mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1):289–317, 2016.
- Lampros Mouselimis. ClusterR: Gaussian Mixture Models, K-Means, Mini-Batch-Kmeans, K-Medoids and Affinity Propagation Clustering, 2023. R package version 1.3.1.
- k-means++: The advantages of careful seeding. In Soda, volume 7, pages 1027–1035, 2007.
- David V Hinkley. Bootstrap methods. Journal of the Royal Statistical Society Series B: Statistical Methodology, 50(3):321–337, 1988.
- palmerpenguins: Palmer Archipelago (Antarctica) penguin data, 2020. R package version 0.1.0.
- L Luca Cavalli-Sforza. The human genome diversity project: past, present and future. Nature Reviews Genetics, 6(4):333–340, 2005.
- A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nature genetics, 38(11):1251–1260, 2006.
- Noah A Rosenberg. Distruct: a program for the graphical display of population structure. Molecular ecology notes, 4(1):137–138, 2004.
- Insights into human genetic variation and population history from 929 diverse genomes. Science, 367(6484):eaay5012, 2020.
- Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nature Biotechnology, 2023.
- Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
- Finite mixture models. Annual review of statistics and its application, 6:355–378, 2019.
- Deep learning. MIT press, 2016.
- Paul D McNicholas. Model-based clustering. Journal of Classification, 33:331–373, 2016.
- Model-based clustering. Annual Review of Statistics and Its Application, 10:573–595, 2023.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.