Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust and Automatic Data Clustering: Dirichlet Process meets Median-of-Means (2311.15384v1)

Published 26 Nov 2023 in stat.ML, cs.LG, and stat.ME

Abstract: Clustering stands as one of the most prominent challenges within the realm of unsupervised machine learning. Among the array of centroid-based clustering algorithms, the classic $k$-means algorithm, rooted in Lloyd's heuristic, takes center stage as one of the extensively employed techniques in the literature. Nonetheless, both $k$-means and its variants grapple with noteworthy limitations. These encompass a heavy reliance on initial cluster centroids, susceptibility to converging into local minima of the objective function, and sensitivity to outliers and noise in the data. When confronted with data containing noisy or outlier-laden observations, the Median-of-Means (MoM) estimator emerges as a stabilizing force for any centroid-based clustering framework. On a different note, a prevalent constraint among existing clustering methodologies resides in the prerequisite knowledge of the number of clusters prior to analysis. Utilizing model-based methodologies, such as Bayesian nonparametric models, offers the advantage of infinite mixture models, thereby circumventing the need for such requirements. Motivated by these facts, in this article, we present an efficient and automatic clustering technique by integrating the principles of model-based and centroid-based methodologies that mitigates the effect of noise on the quality of clustering while ensuring that the number of clusters need not be specified in advance. Statistical guarantees on the upper bound of clustering error, and rigorous assessment through simulated and real datasets suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Approximation schemes for euclidean k-medians and related problems. Conference Proceedings of the Annual ACM Symposium on Theory of Computing.
  2. k-means++: the advantages of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms.
  3. Measure theory and probability theory. Springer Science & Business Media.
  4. Distributed and provably good seedings for k-means in constant rounds. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 292–300. PMLR.
  5. Pattern recognition and machine learning, volume 4. Springer.
  6. Clustering via concave minimization. In Mozer, M., Jordan, M., and Petsche, T., editors, Advances in Neural Information Processing Systems, volume 9. MIT Press.
  7. Bregman, L. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200–217.
  8. K-bmom: A robust lloyd-type clustering algorithm based on bootstrap median-of-means. Computational Statistics & Data Analysis, 167:107370.
  9. Clustering high-dimensional data with ordered weighted ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization. In Ruiz, F., Dy, J., and van de Meent, J.-W., editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 7176–7189. PMLR.
  10. Entropy weighted power k-means clustering. In International conference on artificial intelligence and statistics, pages 691–701. PMLR.
  11. K-modes clustering. Journal of Classification, 18(1):35–55.
  12. A new partitioning around medoids algorithm. Journal of Statistical Computation and Simulation, 73(8):575–584.
  13. Sub-Gaussian mean estimators. The Annals of Statistics, 44(6):2695 – 2725.
  14. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).
  15. Dudley, R. (1967). The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330.
  16. Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200):675–701.
  17. Bayesian Nonparametrics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
  18. Comparing partitions. Journal of Classification, 2(1):193–218.
  19. Data clustering: A user’s dilemma. In Pal, S. K., Bandyopadhyay, S., and Biswas, S., editors, Pattern Recognition and Machine Intelligence, pages 1–10, Berlin, Heidelberg. Springer Berlin Heidelberg.
  20. Revisiting k-means: New algorithms via bayesian nonparametrics. In Langford, J. and Pineau, J., editors, Proceedings of the 29th International Conference on Machine Learning (ICML-12), ICML ’12, pages 513–520, New York, NY, USA. Omnipress.
  21. Robust classification via mom minimization. Machine learning, 109:1635–1665.
  22. Lerasle, M. (2019). Lecture notes: Selected topics on robust statistical learning theory.
  23. Lloyd, S. (1982). Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137.
  24. Foundations of Machine Learning. The MIT Press, 2nd edition.
  25. Murphy, K. P. (2018). Machine learning: A probabilistic perspective (adaptive computation and machine learning series). The MIT Press: London, UK.
  26. Wiley-interscience series in discrete mathematics.
  27. On spectral clustering: Analysis and an algorithm. In Dietterich, T., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems, volume 14. MIT Press.
  28. On the uniform concentration bounds and large sample properties of clustering with bregman divergences. Stat, 10(1):e360.
  29. Uniform concentration bounds toward a unified framework for robust clustering. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems, volume 34, pages 8307–8319. Curran Associates, Inc.
  30. Pollard, D. (1981). Strong consistency of k-means clustering. The Annals of Statistics, pages 135–140.
  31. R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  32. What to do when K-Means clustering fails: A simple yet principled alternative algorithm. PLoS One, 11(9):e0162259.
  33. Robust continuous clustering. Proc Natl Acad Sci U S A, 114(37):9814–9819.
  34. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
  35. Terada, Y. (2014). Strong consistency of reduced k-means clustering. Scandinavian Journal of Statistics, 41(4):913–931.
  36. The minmax k-means clustering algorithm. Pattern Recognition, 47(7):2505–2516.
  37. Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
  38. A framework for feature selection in clustering. J. Am. Stat. Assoc., 105(490):713–726.
  39. A comprehensive survey of clustering algorithms. Annals of Data Science, 2(2):165–193.
  40. Power k-means clustering. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6921–6931. PMLR.
  41. K-harmonic means - a data clustering algorithm. Hewlett-Packard Labs Technical Report HPL-1999-124, 55.
  42. A local search algorithm for k-means with outliers. Neurocomputing, 450:230–241.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets