Robust Mixture Learning when Outliers Overwhelm Small Groups (2407.15792v1)
Abstract: We study the problem of estimating the means of well-separated mixtures when an adversary may add arbitrary outliers. While strong guarantees are available when the outlier fraction is significantly smaller than the minimum mixing weight, much less is known when outliers may crowd out low-weight clusters - a setting we refer to as list-decodable mixture learning (LD-ML). In this case, adversarial outliers can simulate additional spurious mixture components. Hence, if all means of the mixture must be recovered up to a small error in the output list, the list size needs to be larger than the number of (true) components. We propose an algorithm that obtains order-optimal error guarantees for each mixture mean with a minimal list-size overhead, significantly improving upon list-decodable mean estimation, the only existing method that is applicable for LD-ML. Although improvements are observed even when the mixture is non-separated, our algorithm achieves particularly strong guarantees when the mixture is separated: it can leverage the mixture structure to partially cluster the samples before carefully iterating a base learner for list-decodable mean estimation at different scales.
- David R Bickel. Robust cluster analysis of microarray gene expression data with the number of clusters determined biologically. Bioinformatics, 19(7):818–824, 2003.
- Statistical methods for astronomy. arXiv preprint arXiv:1205.2064, 2012.
- List-decodable robust mean estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1047–1060, 2018.
- Robustly learning mixtures of k arbitrary Gaussians. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1234–1247, 2022.
- Algorithmic high-dimensional robust statistics. Cambridge University Press, 2023.
- On learning mixtures of well-separated Gaussians. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 85–96. IEEE, 2017.
- List-decodable covariance estimation. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1276–1283, 2022.
- Clustering mixture models in almost-linear time via list-decodable mean estimation. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1262–1275, 2022.
- Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1035–1046, 2018.
- Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864, 2019.
- Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1021–1034, 2018.
- Clustering mixtures with almost optimal separation in polynomial time. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1248–1261, 2022.
- Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
- Peter Elias. List decoding for noisy channels. Technical report, Research Laboratory of Electronics, Massachusetts Institute of Technology, 1957.
- Learning from untrusted data. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 47–60, 2017.
- List-decodable mean estimation via iterative multi-filtering. Advances in Neural Information Processing Systems, 33:9312–9323, 2020.
- List-decodable mean estimation in nearly-PCA time. Advances in Neural Information Processing Systems, 34:10195–10208, 2021.
- List decodable mean estimation in nearly linear time. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 141–148. IEEE, 2020.
- List-decodable sparse mean estimation via difference-of-pairs filtering. Advances in Neural Information Processing Systems, 35:13947–13960, 2022.
- List-decodable sparse mean estimation. Advances in Neural Information Processing Systems, 35:24031–24045, 2022.
- List-decodable linear regression. Advances in neural information processing systems, 32, 2019.
- List decodable learning via sum of squares. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 161–180. SIAM, 2020.
- Statistical query lower bounds for list-decodable linear regression. Advances in Neural Information Processing Systems, 34:3191–3204, 2021.
- List-decodable subspace recovery: Dimension independent error in polynomial time. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1279–1297. SIAM, 2021.
- List decodable subspace recovery. In Conference on Learning Theory, pages 3206–3226. PMLR, 2020.
- A discriminative framework for clustering via similarity functions. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 671–680, 2008.
- A data prism: Semi-verified learning in the small-alpha regime. In Conference On Learning Theory, pages 1530–1546. PMLR, 2018.
- Outlier-robust clustering of Gaussians and other non-spherical mixtures. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 149–159. IEEE, 2020.
- Settling the robust learnability of mixtures of Gaussians. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 518–531, 2021.
- A review of robust clustering methods. Advances in Data Analysis and Classification, 4:89–109, 2010.
- A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, pages 226–231, 1996.
- Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, pages 129–137, 1982.
- Empirical risk minimization for heavy-tailed losses. 2015.
- Robustly learning a gaussian: Getting optimal error, efficiently. In Artur Czumaj, editor, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 2683–2702. SIAM, 2018.