Universal Lower Bounds and Optimal Rates: Achieving Minimax Clustering Error in Sub-Exponential Mixture Models (2402.15432v2)
Abstract: Clustering is a pivotal challenge in unsupervised machine learning and is often investigated through the lens of mixture models. The optimal error rate for recovering cluster labels in Gaussian and sub-Gaussian mixture models involves ad hoc signal-to-noise ratios. Simple iterative algorithms, such as Lloyd's algorithm, attain this optimal error rate. In this paper, we first establish a universal lower bound for the error rate in clustering any mixture model, expressed through a Chernoff divergence, a more versatile measure of model information than signal-to-noise ratios. We then demonstrate that iterative algorithms attain this lower bound in mixture models with sub-exponential tails, notably emphasizing location-scale mixtures featuring Laplace-distributed errors. Additionally, for datasets better modelled by Poisson or Negative Binomial mixtures, we study mixture models whose distributions belong to an exponential family. In such mixtures, we establish that Bregman hard clustering, a variant of Lloyd's algorithm employing a Bregman divergence, is rate optimal.
- Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 670–688. IEEE.
- Community recovery in non-binary and temporal stochastic block models. arXiv preprint arXiv:2008.04790.
- Bandeira, A. S. and R. van Handel (2016). Sharp nonasymptotic bounds on the norm of random matrices with independent entries. The Annals of Probability 44(4), 2479 – 2506.
- Clustering with Bregman divergences. Journal of machine learning research 6(10).
- Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis 71, 52–78.
- Robust Bregman clustering. The Annals of Statistics 49(3), 1679–1701.
- Cutoff for exact recovery of Gaussian mixture models. IEEE Transactions on Information Theory 67(6), 4223–4238.
- Chen, X. and A. Y. Zhang (2021). Optimal clustering in anisotropic Gaussian mixture models. arXiv preprint arXiv:2101.05402.
- Trimmed k𝑘kitalic_k-means: an attempt to robustify quantizers. The Annals of Statistics 25(2), 553–576.
- Tail bounds on the spectral norm of sub-exponential random matrices. Random Matrices: Theory and Applications.
- Exact recovery and Bregman hard clustering of node-attributed stochastic block model. In Thirty-seventh Conference on Neural Information Processing Systems.
- Community detection in degree-corrected block models. The Annals of Statistics 46(5), 2153 – 2185.
- Gao, C. and A. Y. Zhang (2022). Iterative algorithm for discrete structure recovery. The Annals of Statistics 50(2), 1066 – 1094.
- A general trimming approach to robust cluster analysis. The Annals of Statistics 36(3), 1324 – 1345.
- Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences 249, 124–131.
- A robust model-based clustering based on the geometric median and the median covariation matrix. Statistics and Computing 34(1), 55.
- Validation of noise models for single-cell transcriptomics. Nature methods 11(6), 637–640.
- A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome medicine 9(1), 1–12.
- The elements of statistical learning: data mining, inference, and prediction, Volume 2. Springer.
- Huber, P. J. (1964). Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics 35(1), 73 – 101.
- Adversarially robust clustering with optimality guarantees. arXiv preprint arXiv:2306.09977.
- A simple linear time (1+ϵitalic-ϵ\epsilonitalic_ϵ)-approximation algorithm for k-means clustering in any dimensions. In 45th Annual IEEE Symposium on Foundations of Computer Science, pp. 454–462. IEEE.
- Asymptotic Bayes risk for Gaussian mixture in a semi-supervised setting. In 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pp. 639–643. IEEE.
- Minimax Gaussian classification & clustering. In Artificial Intelligence and Statistics, pp. 1–9.
- Lloyd, S. (1982). Least squares quantization in pcm. IEEE transactions on information theory 28(2), 129–137.
- Lu, Y. and H. H. Zhou (2016). Statistical and computational guarantees of Lloyd’s algorithm and its variants. arXiv preprint arXiv:1612.02099.
- Finite mixture models. Annual review of statistics and its application 6, 355–378.
- Minimax supervised clustering in the anisotropic Gaussian mixture model: A new take on robust interpolation. arXiv preprint arXiv:2111.07041.
- Ndaoud, M. (2022). Sharp optimal recovery in the two component Gaussian mixture model. The Annals of Statistics 50(4), 2096–2126.
- Consistency of Lloyd’s algorithm under perturbations. arXiv preprint arXiv:2309.00578.
- Can semi-supervised learning use all the data effectively? A lower bound perspective. In Thirty-seventh Conference on Neural Information Processing Systems.
- Tukey, J. W. (1960). A survey of sampling from contaminated distributions. Contributions to probability and statistics, 448–485.
- Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, Volume 47. Cambridge University Press.
- Top 10 algorithms in data mining. Knowledge and information systems 14, 1–37.