Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Clustering Evaluation: How to Validate Internal Clustering Validation Measures (2403.14830v1)

Published 21 Mar 2024 in stat.ML and cs.LG

Abstract: Deep clustering, a method for partitioning complex, high-dimensional data using deep neural networks, presents unique evaluation challenges. Traditional clustering validation measures, designed for low-dimensional spaces, are problematic for deep clustering, which involves projecting data into lower-dimensional embeddings before partitioning. Two key issues are identified: 1) the curse of dimensionality when applying these measures to raw data, and 2) the unreliable comparison of clustering results across different embedding spaces stemming from variations in training procedures and parameter settings in different clustering models. This paper addresses these challenges in evaluating clustering quality in deep learning. We present a theoretical framework to highlight ineffectiveness arising from using internal validation measures on raw and embedded data and propose a systematic approach to applying clustering validity indices in deep clustering contexts. Experiments show that this framework aligns better with external validation measures, effectively reducing the misguidance from the improper use of clustering validity indices in deep learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Agresti, Alan. 2010. Analysis of ordinal categorical data. Vol. 656. John Wiley & Sons.
  2. When Is “Nearest Neighbor” Meaningful? Pages 217–235 of: Beeri, Catriel, & Buneman, Peter (eds), Database Theory — ICDT’99. Berlin, Heidelberg: Springer Berlin Heidelberg.
  3. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1–27.
  4. Deep clustering for unsupervised learning of visual features. Pages 132–149 of: Proceedings of European Conference on Computer Vision.
  5. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, 224–227.
  6. Imagenet: A large-scale hierarchical image database. Pages 248–255 of: IEEE Conference on Computer Vision and Pattern Recognition.
  7. Desgraupes, Bernard. 2013. Clustering indices. University of Paris Ouest-Lab Modal’X, 1(1), 34.
  8. PageRank, HITS and a unified framework for link analysis. Pages 353–354 of: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval.
  9. Dunn, Joseph C. 1974. Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics, 4(1), 95–104.
  10. A density-based algorithm for discovering clusters in large spatial databases with noise. Pages 226–231 of: kdd, vol. 96.
  11. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. Pages 5736–5745 of: Proceedings of IEEE International Conference on Computer Vision.
  12. Characterising virtual eigensignatures for general purpose face recognition. Pages 446–456 of: Face Recognition. Springer.
  13. Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means. BMC bioinformatics, 23(4), 1–22.
  14. Exploring network structure, dynamics, and function using NetworkX. Tech. rept. Los Alamos National Lab.(LANL), Los Alamos, NM (United States).
  15. Clustering validity assessment: Finding the optimal partitioning of a data set. Pages 187–194 of: Proceedings 2001 IEEE international conference on data mining. IEEE.
  16. A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters, 29(6), 773–786.
  17. The dip test of unimodality. The annals of Statistics, 70–84.
  18. Hennig, Christian. 2023. fpc: Flexible Procedures for Clustering. R package version 2.2-11.
  19. Holm, Sture. 1979. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, 65–70.
  20. Deep significance clustering: a novel approach for identifying risk-stratified and predictive patient subgroups. Journal of the American Medical Informatics Association, 28(12), 2641–2653.
  21. DICE: Deep Significance Clustering for Outcome-Aware Stratification. arXiv preprint arXiv:2101.02344.
  22. A general statistical framework for assessing categorical clustering in free recall. Psychological bulletin, 83(6), 1072.
  23. Data Clustering: A Review. ACM Computing Surveys, 31(3).
  24. Kendall, Maurice G. 1938. A new measure of rank correlation. Biometrika, 30(1/2), 81–93.
  25. Kiefer, J. 1964. The Advanced Theory of Statistics, Volume 2,” Inference and Relationship.”.
  26. Kleinberg, Jon M. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5), 604–632.
  27. Knight, William R. 1966. A computer method for calculating Kendall’s tau with ungrouped data. Journal of the American Statistical Association, 61(314), 436–439.
  28. A survey of eigenvector methods for web information retrieval. SIAM review, 47(1), 135–161.
  29. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
  30. Attention-based deep clustering method for scRNA-seq cell type identification. PLOS Computational Biology, 19(11), e1011641.
  31. Understanding of internal clustering validation measures. Pages 911–916 of: 2010 IEEE international conference on data mining. IEEE.
  32. NbClust: an R package for determining the relevant number of clusters in a data Set. J. Stat. Softw, 61, 1–36.
  33. Stacked convolutional auto-encoders for hierarchical feature extraction. Pages 52–59 of: International Conference on Artificial Neural Networks.
  34. hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11), 205.
  35. A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access, 6, 39501–39514.
  36. Columbia object image library (coil-20).
  37. clusterability: Performs Tests for Cluster Tendency of a Data Set. R package version 0.1.1.0.
  38. The pagerank citation ranking: Bring order to the web. Tech. rept. Technical report, stanford University.
  39. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
  40. Deepdpm: Deep clustering with an unknown number of clusters. Pages 9861–9870 of: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  41. Rousseeuw, Peter J. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53–65.
  42. Sarle, WS. 1983. SAS Technical report a-108, cubic clustering criterion, SAS Institute Inc. URL: https://support. sas. com/documentation/onlinedoc/v82/techreport_a108. pdf.
  43. statsmodels: Econometric and statistical modeling with python. In: 9th Python in Science Conference.
  44. The CMU pose, illumination, and expression (PIE) database. Pages 53–58 of: Proceedings of fifth IEEE international conference on automatic face gesture recognition. IEEE.
  45. Auto-encoder based data clustering. Pages 117–124 of: Iberoamerican Congress on Pattern Recognition.
  46. Spearman, Charles. 1961. The proof and measurement of association between two things.
  47. Visualizing data using t-SNE. Journal of machine learning research, 9(11).
  48. Extracting and composing robust features with denoising autoencoders. Pages 1096–1103 of: Proceedings of the 25th international conference on Machine learning.
  49. An Unsupervised Deep Learning Framework via Integrated Optimization of Representation Learning and GMM-Based Modeling. Pages 249–265 of: Asian Conference on Computer Vision. Springer.
  50. Deep embedding for determining the number of clusters. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32.
  51. DNB: A joint learning framework for deep Bayesian nonparametric clustering. IEEE Transactions on Neural Networks and Learning Systems, 33(12), 7610–7620.
  52. Face recognition in unconstrained videos with matched background similarity. Pages 529–534 of: IEEE Conference on Computer Vision and Pattern Recognition.
  53. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. Pages 3861–3870 of: international conference on machine learning.
  54. Joint unsupervised learning of deep representations and image clusters. Pages 5147–5156 of: IEEE Conference on Computer Vision and Pattern Recognition.
  55. CRC standard probability and statistics tables and formulae. Crc Press.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com