Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A provable initialization and robust clustering method for general mixture models (2401.05574v3)

Published 10 Jan 2024 in math.ST, stat.ML, and stat.TH

Abstract: Clustering is a fundamental tool in statistical machine learning in the presence of heterogeneous data. Most recent results focus primarily on optimal mislabeling guarantees when data are distributed around centroids with sub-Gaussian errors. Yet, the restrictive sub-Gaussian model is often invalid in practice since various real-world applications exhibit heavy tail distributions around the centroids or suffer from possible adversarial attacks that call for robust clustering with a robust data-driven initialization. In this paper, we present initialization and subsequent clustering methods that provably guarantee near-optimal mislabeling for general mixture models when the number of clusters and data dimensions are finite. We first introduce a hybrid clustering technique with a novel multivariate trimmed mean type centroid estimate to produce mislabeling guarantees under a weak initialization condition for general error distributions around the centroids. A matching lower bound is derived, up to factors depending on the number of clusters. In addition, our approach also produces similar mislabeling guarantees even in the presence of adversarial outliers. Our results reduce to the sub-Gaussian case in finite dimensions when errors follow sub-Gaussian distributions. To solve the problem thoroughly, we also present novel data-driven robust initialization techniques and show that, with probabilities approaching one, these initial centroid estimates are sufficiently good for the subsequent clustering algorithm to achieve the optimal mislabeling rates. Furthermore, we demonstrate that the Lloyd algorithm is suboptimal for more than two clusters even when errors are Gaussian and for two clusters when error distributions have heavy tails. Both simulated data and real data examples further support our robust initialization procedure and clustering algorithm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. {barticle}[author] \bauthor\bsnmAbbasi, \bfnmAmeer Ahmed\binitsA. A. and \bauthor\bsnmYounis, \bfnmMohamed\binitsM. (\byear2007). \btitleA survey on clustering algorithms for wireless sensor networks. \bjournalComputer Communications \bvolume30 \bpages2826-2841. \bnoteNetwork Coverage and Routing Schemes for Wireless Sensor Networks. \bdoihttps://doi.org/10.1016/j.comcom.2007.05.024 \endbibitem
  2. {barticle}[author] \bauthor\bsnmAbbe, \bfnmEmmanuel\binitsE., \bauthor\bsnmFan, \bfnmJianqing\binitsJ. and \bauthor\bsnmWang, \bfnmKaizheng\binitsK. (\byear2022). \btitleAn ℓpsubscriptℓ𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT theory of PCA and spectral clustering. \bjournalThe Annals of Statistics \bvolume50 \bpages2359–2385. \endbibitem
  3. {binproceedings}[author] \bauthor\bsnmArthur, \bfnmDavid\binitsD. and \bauthor\bsnmVassilvitskii, \bfnmSergei\binitsS. (\byear2007). \btitleK-means++ the advantages of careful seeding. In \bbooktitleProceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms \bpages1027–1035. \endbibitem
  4. {barticle}[author] \bauthor\bsnmBakshi, \bfnmAinesh\binitsA. and \bauthor\bsnmKothari, \bfnmPravesh\binitsP. (\byear2020). \btitleOutlier-robust clustering of non-spherical mixtures. \bjournalarXiv preprint arXiv:2005.02970. \endbibitem
  5. {barticle}[author] \bauthor\bsnmBeatty, \bfnmAnne\binitsA., \bauthor\bsnmLiao, \bfnmScott\binitsS. and \bauthor\bsnmYu, \bfnmJeff Jiewei\binitsJ. J. (\byear2013). \btitleThe spillover effect of fraudulent financial reporting on peer firms’ investments. \bjournalJournal of Accounting and Economics \bvolume55 \bpages183–205. \endbibitem
  6. {binproceedings}[author] \bauthor\bsnmBojchevski, \bfnmAleksandar\binitsA., \bauthor\bsnmMatkovic, \bfnmYves\binitsY. and \bauthor\bsnmGünnemann, \bfnmStephan\binitsS. (\byear2017). \btitleRobust spectral clustering for noisy data: Modeling sparse corruptions improves latent embeddings. In \bbooktitleProceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining \bpages737–746. \endbibitem
  7. {bbook}[author] \bauthor\bsnmBoucheron, \bfnmStéphane\binitsS., \bauthor\bsnmLugosi, \bfnmGábor\binitsG. and \bauthor\bsnmMassart, \bfnmPascal\binitsP. (\byear2013). \btitleConcentration inequalities: A nonasymptotic theory of independence. \bpublisherOxford university press. \endbibitem
  8. {barticle}[author] \bauthor\bsnmCai, \bfnmFan\binitsF., \bauthor\bsnmLe-Khac, \bfnmNhien-An\binitsN.-A. and \bauthor\bsnmKechadi, \bfnmTahar\binitsT. (\byear2016). \btitleClustering approaches for financial data analysis: a survey. \bjournalarXiv preprint arXiv:1609.08520. \endbibitem
  9. {barticle}[author] \bauthor\bsnmChen, \bfnmMengjie\binitsM., \bauthor\bsnmGao, \bfnmChao\binitsC. and \bauthor\bsnmRen, \bfnmZhao\binitsZ. (\byear2018). \btitleRobust covariance and scatter matrix estimation under Huber’s contamination model. \bjournalThe Annals of Statistics \bvolume46 \bpages1932–1960. \endbibitem
  10. {barticle}[author] \bauthor\bsnmChen, \bfnmXin\binitsX. and \bauthor\bsnmZhang, \bfnmAnderson Y\binitsA. Y. (\byear2021). \btitleOptimal clustering in anisotropic gaussian mixture models. \bjournalarXiv preprint arXiv:2101.05402. \endbibitem
  11. {barticle}[author] \bauthor\bparticlede \bsnmMiranda Cardoso, \bfnmJose Vinicius\binitsJ. V., \bauthor\bsnmYing, \bfnmJiaxi\binitsJ. and \bauthor\bsnmPalomar, \bfnmDaniel\binitsD. (\byear2021). \btitleGraphical models in heavy-tailed markets. \bjournalAdvances in Neural Information Processing Systems \bvolume34 \bpages19989–20001. \endbibitem
  12. {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ. and \bauthor\bsnmFan, \bfnmYingying\binitsY. (\byear2008). \btitleHigh dimensional classification using features annealed independence rules. \bjournalAnnals of statistics \bvolume36 \bpages2605. \endbibitem
  13. {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmLi, \bfnmQuefeng\binitsQ. and \bauthor\bsnmWang, \bfnmYuyan\binitsY. (\byear2017). \btitleEstimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. \bjournalJournal of the Royal Statistical Society Series B: Statistical Methodology \bvolume79 \bpages247–265. \endbibitem
  14. {barticle}[author] \bauthor\bsnmHsu, \bfnmDaniel\binitsD., \bauthor\bsnmKakade, \bfnmSham\binitsS. and \bauthor\bsnmZhang, \bfnmTong\binitsT. (\byear2012). \btitleA tail inequality for quadratic forms of subgaussian random vectors. \bjournalElectronic communications in Probability \bvolume17 \bpages1–6. \bdoi10.21214/ECP.v7-2079 \endbibitem
  15. {barticle}[author] \bauthor\bsnmHuber, \bfnmPeter J\binitsP. J. (\byear1965). \btitleA robust version of the probability ratio test. \bjournalThe Annals of Mathematical Statistics \bpages1753–1758. \endbibitem
  16. {barticle}[author] \bauthor\bsnmHuber, \bfnmPeter J\binitsP. J. (\byear1992). \btitleRobust estimation of a location parameter. \bjournalBreakthroughs in statistics: Methodology and distribution \bpages492–518. \endbibitem
  17. {barticle}[author] \bauthor\bsnmJana, \bfnmSoham\binitsS., \bauthor\bsnmKulkarni, \bfnmSanjeev\binitsS. and \bauthor\bsnmYang, \bfnmKun\binitsK. (\byear2023). \btitleAdversarially robust clustering with optimality guarantees. \bjournalarXiv preprint arXiv:2306.09977. \endbibitem
  18. {barticle}[author] \bauthor\bsnmKannan, \bfnmRavindran\binitsR., \bauthor\bsnmVempala, \bfnmSantosh\binitsS. \betalet al. (\byear2009). \btitleSpectral algorithms. \bjournalFoundations and Trends® in Theoretical Computer Science \bvolume4 \bpages157–288. \endbibitem
  19. {bbook}[author] \bauthor\bsnmKaufman, \bfnmLeonard\binitsL. and \bauthor\bsnmRousseeuw, \bfnmPeter J\binitsP. J. (\byear2009). \btitleFinding groups in data: an introduction to cluster analysis. \bpublisherJohn Wiley & Sons. \endbibitem
  20. {binproceedings}[author] \bauthor\bsnmKumar, \bfnmAmit\binitsA., \bauthor\bsnmSabharwal, \bfnmYogish\binitsY. and \bauthor\bsnmSen, \bfnmSandeep\binitsS. (\byear2004). \btitleA simple linear time (1+/spl epsiv/)-approximation algorithm for k-means clustering in any dimensions. In \bbooktitle45th Annual IEEE Symposium on Foundations of Computer Science \bpages454–462. \bpublisherIEEE. \endbibitem
  21. {barticle}[author] \bauthor\bsnmLiu, \bfnmAllen\binitsA. and \bauthor\bsnmMoitra, \bfnmAnkur\binitsA. (\byear2023). \btitleRobustly Learning General Mixtures of Gaussians. \bjournalJournal of the ACM. \endbibitem
  22. {barticle}[author] \bauthor\bsnmLloyd, \bfnmStuart\binitsS. (\byear1982). \btitleLeast squares quantization in PCM. \bjournalIEEE transactions on information theory \bvolume28 \bpages129–137. \endbibitem
  23. {barticle}[author] \bauthor\bsnmLöffler, \bfnmMatthias\binitsM., \bauthor\bsnmZhang, \bfnmAnderson Y\binitsA. Y. and \bauthor\bsnmZhou, \bfnmHarrison H\binitsH. H. (\byear2021). \btitleOptimality of spectral clustering in the Gaussian mixture model. \bjournalThe Annals of Statistics \bvolume49 \bpages2506–2530. \endbibitem
  24. {barticle}[author] \bauthor\bsnmLu, \bfnmYu\binitsY. and \bauthor\bsnmZhou, \bfnmHarrison H\binitsH. H. (\byear2016). \btitleStatistical and computational guarantees of lloyd’s algorithm and its variants. \bjournalarXiv preprint arXiv:1612.02099. \endbibitem
  25. {barticle}[author] \bauthor\bsnmLugosi, \bfnmGábor\binitsG. and \bauthor\bsnmMendelson, \bfnmShahar\binitsS. (\byear2019). \btitleMean estimation and regression under heavy-tailed distributions: A survey. \bjournalFoundations of Computational Mathematics \bvolume19 \bpages1145–1190. \endbibitem
  26. {barticle}[author] \bauthor\bsnmLugosi, \bfnmGabor\binitsG. and \bauthor\bsnmMendelson, \bfnmShahar\binitsS. (\byear2021). \btitleRobust multivariate mean estimation: the optimality of trimmed mean. \endbibitem
  27. {barticle}[author] \bauthor\bsnmMaravelias, \bfnmChristos D\binitsC. D. (\byear1999). \btitleHabitat selection and clustering of a pelagic fish: effects of topography and bathymetry on species dynamics. \bjournalCanadian Journal of Fisheries and Aquatic Sciences \bvolume56 \bpages437–450. \endbibitem
  28. {binproceedings}[author] \bauthor\bsnmOlukanmi, \bfnmPeter O\binitsP. O. and \bauthor\bsnmTwala, \bfnmBhekisipho\binitsB. (\byear2017). \btitleK-means-sharp: modified centroid update for outlier-robust k-means clustering. In \bbooktitle2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech) \bpages14–19. \bpublisherIEEE. \endbibitem
  29. {barticle}[author] \bauthor\bsnmOmran, \bfnmMahamed GH\binitsM. G., \bauthor\bsnmEngelbrecht, \bfnmAndries P\binitsA. P. and \bauthor\bsnmSalman, \bfnmAyed\binitsA. (\byear2007). \btitleAn overview of clustering methods. \bjournalIntelligent Data Analysis \bvolume11 \bpages583–605. \endbibitem
  30. {barticle}[author] \bauthor\bsnmPigolotti, \bfnmSimone\binitsS., \bauthor\bsnmLópez, \bfnmCristóbal\binitsC. and \bauthor\bsnmHernández-García, \bfnmEmilio\binitsE. (\byear2007). \btitleSpecies clustering in competitive Lotka-Volterra models. \bjournalPhysical review letters \bvolume98 \bpages258101. \endbibitem
  31. {barticle}[author] \bauthor\bsnmRonan, \bfnmTom\binitsT., \bauthor\bsnmQi, \bfnmZhijie\binitsZ. and \bauthor\bsnmNaegle, \bfnmKristen M\binitsK. M. (\byear2016). \btitleAvoiding common pitfalls when clustering biological data. \bjournalScience signaling \bvolume9 \bpagesre6–re6. \endbibitem
  32. {binproceedings}[author] \bauthor\bsnmSasikumar, \bfnmP\binitsP. and \bauthor\bsnmKhara, \bfnmSibaram\binitsS. (\byear2012). \btitleK-means clustering in wireless sensor networks. In \bbooktitle2012 Fourth international conference on computational intelligence and communication networks \bpages140–144. \bpublisherIEEE. \endbibitem
  33. {binproceedings}[author] \bauthor\bsnmSfikas, \bfnmGiorgos\binitsG., \bauthor\bsnmNikou, \bfnmChristophoros\binitsC. and \bauthor\bsnmGalatsanos, \bfnmNikolaos\binitsN. (\byear2007). \btitleRobust image segmentation with mixtures of Student’s t-distributions. In \bbooktitle2007 IEEE International Conference on Image Processing \bvolume1 \bpagesI–273. \bpublisherIEEE. \endbibitem
  34. {bmisc}[author] \bauthor\bsnmSlate, \bfnmDavid\binitsD. (\byear1991). \btitleLetter Recognition. \bhowpublishedUCI Machine Learning Repository. \bnoteDOI: https://doi.org/10.24432/C5ZP40. \endbibitem
  35. {barticle}[author] \bauthor\bsnmSrivastava, \bfnmPrateek R\binitsP. R., \bauthor\bsnmSarkar, \bfnmPurnamrita\binitsP. and \bauthor\bsnmHanasusanto, \bfnmGrani A\binitsG. A. (\byear2023). \btitleA robust spectral clustering algorithm for sub-Gaussian mixture models with outliers. \bjournalOperations Research \bvolume71 \bpages224–244. \endbibitem
  36. {barticle}[author] \bauthor\bsnmSun, \bfnmQiang\binitsQ., \bauthor\bsnmZhou, \bfnmWen-Xin\binitsW.-X. and \bauthor\bsnmFan, \bfnmJianqing\binitsJ. (\byear2020). \btitleAdaptive huber regression. \bjournalJournal of the American Statistical Association \bvolume115 \bpages254–265. \endbibitem
  37. {binproceedings}[author] \bauthor\bsnmVassilvitskii, \bfnmSergei\binitsS. and \bauthor\bsnmArthur, \bfnmDavid\binitsD. (\byear2006). \btitlek-means++: The advantages of careful seeding. In \bbooktitleProceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms \bpages1027–1035. \endbibitem
  38. {barticle}[author] \bauthor\bsnmVempala, \bfnmSantosh\binitsS. and \bauthor\bsnmWang, \bfnmGrant\binitsG. (\byear2004). \btitleA spectral algorithm for learning mixture models. \bjournalJournal of Computer and System Sciences \bvolume68 \bpages841–860. \endbibitem
  39. {barticle}[author] \bauthor\bsnmWang, \bfnmBingyan\binitsB. and \bauthor\bsnmFan, \bfnmJianqing\binitsJ. (\byear2022). \btitleRobust matrix completion with heavy-tailed noise. \bjournalarXiv preprint arXiv:2206.04276. \endbibitem
  40. {barticle}[author] \bauthor\bsnmXu, \bfnmRui\binitsR. and \bauthor\bsnmWunsch, \bfnmDonald\binitsD. (\byear2005). \btitleSurvey of clustering algorithms. \bjournalIEEE Transactions on neural networks \bvolume16 \bpages645–678. \endbibitem
  41. {barticle}[author] \bauthor\bsnmZhang, \bfnmYilin\binitsY. and \bauthor\bsnmRohe, \bfnmKarl\binitsK. (\byear2018). \btitleUnderstanding regularized spectral clustering via graph conductance. \bjournalAdvances in Neural Information Processing Systems \bvolume31. \endbibitem
Citations (2)

Summary

We haven't generated a summary for this paper yet.