Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adversarially robust clustering with optimality guarantees (2306.09977v2)

Published 16 Jun 2023 in math.ST, cs.LG, and stat.TH

Abstract: We consider the problem of clustering data points coming from sub-Gaussian mixtures. Existing methods that provably achieve the optimal mislabeling error, such as the Lloyd algorithm, are usually vulnerable to outliers. In contrast, clustering methods seemingly robust to adversarial perturbations are not known to satisfy the optimal statistical guarantees. We propose a simple robust algorithm based on the coordinatewise median that obtains the optimal mislabeling rate even when we allow adversarial outliers to be present. Our algorithm achieves the optimal error rate in constant iterations when a weak initialization condition is satisfied. In the absence of outliers, in fixed dimensions, our theoretical guarantees are similar to that of the Lloyd algorithm. Extensive experiments on various simulated and public datasets are conducted to support the theoretical guarantees of our method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. A technique for obtaining true approximations for k-center with covering constraints. In Integer Programming and Combinatorial Optimization: 21st International Conference, IPCO 2020, London, UK, June 8–10, 2020, Proceedings, pages 52–65. Springer, 2020.
  2. Center based clustering: A foundational perspective. 2014.
  3. Center-based clustering under perturbation stability. Information Processing Letters, 112(1-2):49–54, 2012.
  4. Fuzzy kc-means clustering algorithm for medical image segmentation. Journal of information Engineering and Applications, ISSN, 22245782:2225–0506, 2012.
  5. An ℓpsubscriptℓ𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT theory of pca and spectral clustering. The Annals of Statistics, 50(4):2359–2385, 2022.
  6. A method of moments for mixture models and hidden markov models. In Conference on Learning Theory, pages 33–1. JMLR Workshop and Conference Proceedings, 2012.
  7. Learning mixtures of separated nonspherical gaussians. 2005.
  8. K-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035, 2007.
  9. A survey on clustering algorithms for wireless sensor networks. Computer Communications, 30(14):2826–2841, 2007. Network Coverage and Routing Schemes for Wireless Sensor Networks.
  10. Chanderjit Bajaj. Proving geometric algorithm non-solvability: An application of factoring polynomials. Journal of Symbolic Computation, 2(1):99–102, 1986.
  11. Outlier-robust clustering of gaussians and other non-spherical mixtures. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 149–159. IEEE, 2020.
  12. Differentially private clustering in high-dimensional euclidean spaces. In International Conference on Machine Learning, pages 322–331. PMLR, 2017.
  13. Peter J Bickel. On some alternative estimates for shift in the p-variate one sample problem. The Annals of Mathematical Statistics, pages 1079–1090, 1964.
  14. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
  15. Robust spectral clustering for noisy data: Modeling sparse corruptions improves latent embeddings. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 737–746, 2017.
  16. Clustering via concave minimization. Advances in neural information processing systems, 9, 1996.
  17. Learning gaussian mixtures with arbitrary separation. arXiv preprint arXiv:0907.1054, 2009.
  18. Statistical guarantees for the em algorithm: From population to sample-based analysis. 2017.
  19. Trimmed k𝑘kitalic_k-means: an attempt to robustify quantizers. The Annals of Statistics, 25(2):553–576, 1997.
  20. On a transformation and re-transformation technique for constructing an affine equivariant multivariate median. Proceedings of the American mathematical society, 124(8):2539–2547, 1996.
  21. A note on the robustness of multivariate medians. Statistics & Probability Letters, 45(3):269–276, 1999.
  22. Voronoi diagrams based on convex distance functions. In Proceedings of the first annual symposium on Computational geometry, pages 235–244, 1985.
  23. Robust covariance and scatter matrix estimation under huber’s contamination model. The Annals of Statistics, 46(5):1932–1960, 2018.
  24. Probal Chaudhuri. On a geometric notion of quantiles for multivariate data. Journal of the American statistical association, 91(434):862–872, 1996.
  25. Algorithms for facility location problems with outliers. In SODA, volume 1, pages 642–651. Citeseer, 2001.
  26. Geometric median in nearly linear time. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 9–21, 2016.
  27. Optimal clustering in anisotropic gaussian mixture models. arXiv preprint arXiv:2101.05402, 2021.
  28. Rajesh N Dave. Characterization and detection of noise in clustering. Pattern Recognition Letters, 12(11):657–664, 1991.
  29. Rajesh N Dave. Robust fuzzy clustering algorithms. In [Proceedings 1993] Second IEEE International Conference on Fuzzy Systems, pages 1281–1286. IEEE, 1993.
  30. Neil E Day. Estimating the components of a mixture of normal distributions. Biometrika, 56(3):463–474, 1969.
  31. Renata M.C.R. de Souza and Francisco de A.T. de Carvalho. Clustering of interval data based on city–block distances. Pattern Recognition Letters, 25(3):353–365, 2004.
  32. JJ De Gruijter and AB McBratney. A modified fuzzy k-means method for predictive classification. 1988.
  33. Robust clustering methods: a unified view. IEEE Transactions on fuzzy systems, 5(2):270–293, 1997.
  34. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864, 2019.
  35. Robust k𝑘kitalic_k-means++. In Conference on Uncertainty in Artificial Intelligence, pages 799–808. PMLR, 2020.
  36. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
  37. A probabilistic analysis of em for mixtures of separated, spherical gaussians. Journal of Machine Learning Research, 8:203–226, 2007.
  38. The clustering of internet, internet of things and social network. In 2010 Third International Symposium on Knowledge Acquisition and Modeling, pages 417–420. IEEE, 2010.
  39. Optimal estimation of high-dimensional location gaussian mixtures. arXiv preprint arXiv:2002.05818, 2020.
  40. Hidden integrality and semirandom robustness of sdp relaxation for sub-gaussian mixture model. Mathematics of Operations Research, 47(3):2464–2493, 2022.
  41. Pk-means: k-means using partition based cluster initialization method. In Proceedings of International Conference on Advancements in Computing & Management (ICACM), 2019.
  42. A review of robust clustering methods. Advances in Data Analysis and Classification, 4:89–109, 2010.
  43. Local search methods for k-means with outliers. Proceedings of the VLDB Endowment, 10(7):757–768, 2017.
  44. Community detection in degree-corrected block models. 2018.
  45. A tail inequality for quadratic forms of subgaussian random vectors. Electronic communications in Probability, 17:1–6, 2012.
  46. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1021–1034, 2018.
  47. Model selection for gaussian mixture models. Statistica Sinica, pages 147–169, 2017.
  48. Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational Statistics & Data Analysis, 44(4):625–638, 2004.
  49. Peter J Huber. A robust version of the probability ratio test. The Annals of Mathematical Statistics, pages 1753–1758, 1965.
  50. Peter J Huber. Robust estimation of a location parameter. Breakthroughs in statistics: Methodology and distribution, pages 492–518, 1992.
  51. Anil K Jian. Data clustering: 50 years beyond k-means, pattern recognition letters. Corrected Proof, 2009.
  52. Robust clustering with applications in computer vision. IEEE transactions on pattern analysis and machine intelligence, 13(8):791–802, 1991.
  53. A possibilistic approach to clustering. IEEE transactions on fuzzy systems, 1(2):98–110, 1993.
  54. Rolf Klein. Concrete and abstract Voronoi diagrams, volume 400. Springer Science & Business Media, 1989.
  55. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 2009.
  56. Multivariate normal mixtures: a fast consistent method of moments. Journal of the American Statistical Association, 88(422):468–476, 1993.
  57. Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  58. Robust multivariate mean estimation: the optimality of trimmed mean. 2021.
  59. Robustly learning general mixtures of gaussians. Journal of the ACM, 2023.
  60. Testing the number of components in a normal mixture. Biometrika, 88(3):767–778, 2001.
  61. Hendrik P Lopuhaa. On the relation between s-estimators and m-estimators of multivariate location and covariance. The Annals of Statistics, pages 1662–1683, 1989.
  62. Optimal clustering by lloyd algorithm for low-rank mixture model. arXiv preprint arXiv:2207.04600, 2022.
  63. Regularized color clustering in medical image database. IEEE transactions on medical imaging, 19(11):1150–1155, 2000.
  64. Yu Lu and Harrison H Zhou. Statistical and computational guarantees of lloyd’s algorithm and its variants. arXiv preprint arXiv:1612.02099, 2016.
  65. Optimality of spectral clustering in the gaussian mixture model. The Annals of Statistics, 49(5):2506–2530, 2021.
  66. Christos D Maravelias. Habitat selection and clustering of a pelagic fish: effects of topography and bathymetry on species dynamics. Canadian Journal of Fisheries and Aquatic Sciences, 56(3):437–450, 1999.
  67. Fast distributed k-center clustering with outliers on massive data. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  68. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1027–1038, 2019.
  69. Clustering social networks. In Algorithms and Models for the Web-Graph: 5th International Workshop, WAW 2007, San Diego, CA, USA, December 11-12, 2007. Proceedings 5, pages 56–67. Springer, 2007.
  70. Settling the polynomial learnability of mixtures of gaussians. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 93–102. IEEE, 2010.
  71. Medical image segmentation using k-means clustering and improved watershed algorithm. In 2006 IEEE southwest symposium on image analysis and interpretation, pages 61–65. IEEE, 2006.
  72. K-means-sharp: modified centroid update for outlier-robust k-means clustering. In 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), pages 14–19. IEEE, 2017.
  73. Karl Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71–110, 1894.
  74. Species clustering in competitive lotka-volterra models. Physical review letters, 98(25):258101, 2007.
  75. P Rousseeuw and P Kaufman. Clustering by means of medoids. In Proceedings of the statistical data analysis based on the L1 norm conference, neuchatel, switzerland, volume 31, 1987.
  76. P Sasikumar and Sibaram Khara. K-means clustering in wireless sensor networks. In 2012 Fourth international conference on computational intelligence and communication networks, pages 140–144. IEEE, 2012.
  77. A robust spectral clustering algorithm for sub-gaussian mixture models with outliers. Operations Research, 71(1):224–244, 2023.
  78. Robust mixture modelling using sub-gaussian stable distribution. arXiv preprint arXiv:1701.06749, 2017.
  79. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841–860, 2004.
  80. Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
  81. Endre Weiszfeld. Sur le point pour lequel la somme des distances de n points donnés est minimum. Tohoku Mathematical Journal, First Series, 43:355–386, 1937.
  82. Byzantine-robust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pages 5650–5659. PMLR, 2018.
  83. Node clustering in wireless sensor networks: Recent developments and deployment challenges. IEEE network, 20(3):20–25, 2006.
  84. Practical multi-party private collaborative k-means clustering. Neurocomputing, 467:256–265, 2022.
  85. Understanding regularized spectral clustering via graph conductance. Advances in Neural Information Processing Systems, 31, 2018.
  86. Upper bound estimations of misclassification rate in the heteroscedastic clustering model with sub-gaussian noises. Stat, 12(1):e505, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.