A provable initialization and robust clustering method for general mixture models (2401.05574v3)
Abstract: Clustering is a fundamental tool in statistical machine learning in the presence of heterogeneous data. Most recent results focus primarily on optimal mislabeling guarantees when data are distributed around centroids with sub-Gaussian errors. Yet, the restrictive sub-Gaussian model is often invalid in practice since various real-world applications exhibit heavy tail distributions around the centroids or suffer from possible adversarial attacks that call for robust clustering with a robust data-driven initialization. In this paper, we present initialization and subsequent clustering methods that provably guarantee near-optimal mislabeling for general mixture models when the number of clusters and data dimensions are finite. We first introduce a hybrid clustering technique with a novel multivariate trimmed mean type centroid estimate to produce mislabeling guarantees under a weak initialization condition for general error distributions around the centroids. A matching lower bound is derived, up to factors depending on the number of clusters. In addition, our approach also produces similar mislabeling guarantees even in the presence of adversarial outliers. Our results reduce to the sub-Gaussian case in finite dimensions when errors follow sub-Gaussian distributions. To solve the problem thoroughly, we also present novel data-driven robust initialization techniques and show that, with probabilities approaching one, these initial centroid estimates are sufficiently good for the subsequent clustering algorithm to achieve the optimal mislabeling rates. Furthermore, we demonstrate that the Lloyd algorithm is suboptimal for more than two clusters even when errors are Gaussian and for two clusters when error distributions have heavy tails. Both simulated data and real data examples further support our robust initialization procedure and clustering algorithm.
- {barticle}[author] \bauthor\bsnmAbbasi, \bfnmAmeer Ahmed\binitsA. A. and \bauthor\bsnmYounis, \bfnmMohamed\binitsM. (\byear2007). \btitleA survey on clustering algorithms for wireless sensor networks. \bjournalComputer Communications \bvolume30 \bpages2826-2841. \bnoteNetwork Coverage and Routing Schemes for Wireless Sensor Networks. \bdoihttps://doi.org/10.1016/j.comcom.2007.05.024 \endbibitem
- {barticle}[author] \bauthor\bsnmAbbe, \bfnmEmmanuel\binitsE., \bauthor\bsnmFan, \bfnmJianqing\binitsJ. and \bauthor\bsnmWang, \bfnmKaizheng\binitsK. (\byear2022). \btitleAn ℓpsubscriptℓ𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT theory of PCA and spectral clustering. \bjournalThe Annals of Statistics \bvolume50 \bpages2359–2385. \endbibitem
- {binproceedings}[author] \bauthor\bsnmArthur, \bfnmDavid\binitsD. and \bauthor\bsnmVassilvitskii, \bfnmSergei\binitsS. (\byear2007). \btitleK-means++ the advantages of careful seeding. In \bbooktitleProceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms \bpages1027–1035. \endbibitem
- {barticle}[author] \bauthor\bsnmBakshi, \bfnmAinesh\binitsA. and \bauthor\bsnmKothari, \bfnmPravesh\binitsP. (\byear2020). \btitleOutlier-robust clustering of non-spherical mixtures. \bjournalarXiv preprint arXiv:2005.02970. \endbibitem
- {barticle}[author] \bauthor\bsnmBeatty, \bfnmAnne\binitsA., \bauthor\bsnmLiao, \bfnmScott\binitsS. and \bauthor\bsnmYu, \bfnmJeff Jiewei\binitsJ. J. (\byear2013). \btitleThe spillover effect of fraudulent financial reporting on peer firms’ investments. \bjournalJournal of Accounting and Economics \bvolume55 \bpages183–205. \endbibitem
- {binproceedings}[author] \bauthor\bsnmBojchevski, \bfnmAleksandar\binitsA., \bauthor\bsnmMatkovic, \bfnmYves\binitsY. and \bauthor\bsnmGünnemann, \bfnmStephan\binitsS. (\byear2017). \btitleRobust spectral clustering for noisy data: Modeling sparse corruptions improves latent embeddings. In \bbooktitleProceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining \bpages737–746. \endbibitem
- {bbook}[author] \bauthor\bsnmBoucheron, \bfnmStéphane\binitsS., \bauthor\bsnmLugosi, \bfnmGábor\binitsG. and \bauthor\bsnmMassart, \bfnmPascal\binitsP. (\byear2013). \btitleConcentration inequalities: A nonasymptotic theory of independence. \bpublisherOxford university press. \endbibitem
- {barticle}[author] \bauthor\bsnmCai, \bfnmFan\binitsF., \bauthor\bsnmLe-Khac, \bfnmNhien-An\binitsN.-A. and \bauthor\bsnmKechadi, \bfnmTahar\binitsT. (\byear2016). \btitleClustering approaches for financial data analysis: a survey. \bjournalarXiv preprint arXiv:1609.08520. \endbibitem
- {barticle}[author] \bauthor\bsnmChen, \bfnmMengjie\binitsM., \bauthor\bsnmGao, \bfnmChao\binitsC. and \bauthor\bsnmRen, \bfnmZhao\binitsZ. (\byear2018). \btitleRobust covariance and scatter matrix estimation under Huber’s contamination model. \bjournalThe Annals of Statistics \bvolume46 \bpages1932–1960. \endbibitem
- {barticle}[author] \bauthor\bsnmChen, \bfnmXin\binitsX. and \bauthor\bsnmZhang, \bfnmAnderson Y\binitsA. Y. (\byear2021). \btitleOptimal clustering in anisotropic gaussian mixture models. \bjournalarXiv preprint arXiv:2101.05402. \endbibitem
- {barticle}[author] \bauthor\bparticlede \bsnmMiranda Cardoso, \bfnmJose Vinicius\binitsJ. V., \bauthor\bsnmYing, \bfnmJiaxi\binitsJ. and \bauthor\bsnmPalomar, \bfnmDaniel\binitsD. (\byear2021). \btitleGraphical models in heavy-tailed markets. \bjournalAdvances in Neural Information Processing Systems \bvolume34 \bpages19989–20001. \endbibitem
- {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ. and \bauthor\bsnmFan, \bfnmYingying\binitsY. (\byear2008). \btitleHigh dimensional classification using features annealed independence rules. \bjournalAnnals of statistics \bvolume36 \bpages2605. \endbibitem
- {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmLi, \bfnmQuefeng\binitsQ. and \bauthor\bsnmWang, \bfnmYuyan\binitsY. (\byear2017). \btitleEstimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. \bjournalJournal of the Royal Statistical Society Series B: Statistical Methodology \bvolume79 \bpages247–265. \endbibitem
- {barticle}[author] \bauthor\bsnmHsu, \bfnmDaniel\binitsD., \bauthor\bsnmKakade, \bfnmSham\binitsS. and \bauthor\bsnmZhang, \bfnmTong\binitsT. (\byear2012). \btitleA tail inequality for quadratic forms of subgaussian random vectors. \bjournalElectronic communications in Probability \bvolume17 \bpages1–6. \bdoi10.21214/ECP.v7-2079 \endbibitem
- {barticle}[author] \bauthor\bsnmHuber, \bfnmPeter J\binitsP. J. (\byear1965). \btitleA robust version of the probability ratio test. \bjournalThe Annals of Mathematical Statistics \bpages1753–1758. \endbibitem
- {barticle}[author] \bauthor\bsnmHuber, \bfnmPeter J\binitsP. J. (\byear1992). \btitleRobust estimation of a location parameter. \bjournalBreakthroughs in statistics: Methodology and distribution \bpages492–518. \endbibitem
- {barticle}[author] \bauthor\bsnmJana, \bfnmSoham\binitsS., \bauthor\bsnmKulkarni, \bfnmSanjeev\binitsS. and \bauthor\bsnmYang, \bfnmKun\binitsK. (\byear2023). \btitleAdversarially robust clustering with optimality guarantees. \bjournalarXiv preprint arXiv:2306.09977. \endbibitem
- {barticle}[author] \bauthor\bsnmKannan, \bfnmRavindran\binitsR., \bauthor\bsnmVempala, \bfnmSantosh\binitsS. \betalet al. (\byear2009). \btitleSpectral algorithms. \bjournalFoundations and Trends® in Theoretical Computer Science \bvolume4 \bpages157–288. \endbibitem
- {bbook}[author] \bauthor\bsnmKaufman, \bfnmLeonard\binitsL. and \bauthor\bsnmRousseeuw, \bfnmPeter J\binitsP. J. (\byear2009). \btitleFinding groups in data: an introduction to cluster analysis. \bpublisherJohn Wiley & Sons. \endbibitem
- {binproceedings}[author] \bauthor\bsnmKumar, \bfnmAmit\binitsA., \bauthor\bsnmSabharwal, \bfnmYogish\binitsY. and \bauthor\bsnmSen, \bfnmSandeep\binitsS. (\byear2004). \btitleA simple linear time (1+/spl epsiv/)-approximation algorithm for k-means clustering in any dimensions. In \bbooktitle45th Annual IEEE Symposium on Foundations of Computer Science \bpages454–462. \bpublisherIEEE. \endbibitem
- {barticle}[author] \bauthor\bsnmLiu, \bfnmAllen\binitsA. and \bauthor\bsnmMoitra, \bfnmAnkur\binitsA. (\byear2023). \btitleRobustly Learning General Mixtures of Gaussians. \bjournalJournal of the ACM. \endbibitem
- {barticle}[author] \bauthor\bsnmLloyd, \bfnmStuart\binitsS. (\byear1982). \btitleLeast squares quantization in PCM. \bjournalIEEE transactions on information theory \bvolume28 \bpages129–137. \endbibitem
- {barticle}[author] \bauthor\bsnmLöffler, \bfnmMatthias\binitsM., \bauthor\bsnmZhang, \bfnmAnderson Y\binitsA. Y. and \bauthor\bsnmZhou, \bfnmHarrison H\binitsH. H. (\byear2021). \btitleOptimality of spectral clustering in the Gaussian mixture model. \bjournalThe Annals of Statistics \bvolume49 \bpages2506–2530. \endbibitem
- {barticle}[author] \bauthor\bsnmLu, \bfnmYu\binitsY. and \bauthor\bsnmZhou, \bfnmHarrison H\binitsH. H. (\byear2016). \btitleStatistical and computational guarantees of lloyd’s algorithm and its variants. \bjournalarXiv preprint arXiv:1612.02099. \endbibitem
- {barticle}[author] \bauthor\bsnmLugosi, \bfnmGábor\binitsG. and \bauthor\bsnmMendelson, \bfnmShahar\binitsS. (\byear2019). \btitleMean estimation and regression under heavy-tailed distributions: A survey. \bjournalFoundations of Computational Mathematics \bvolume19 \bpages1145–1190. \endbibitem
- {barticle}[author] \bauthor\bsnmLugosi, \bfnmGabor\binitsG. and \bauthor\bsnmMendelson, \bfnmShahar\binitsS. (\byear2021). \btitleRobust multivariate mean estimation: the optimality of trimmed mean. \endbibitem
- {barticle}[author] \bauthor\bsnmMaravelias, \bfnmChristos D\binitsC. D. (\byear1999). \btitleHabitat selection and clustering of a pelagic fish: effects of topography and bathymetry on species dynamics. \bjournalCanadian Journal of Fisheries and Aquatic Sciences \bvolume56 \bpages437–450. \endbibitem
- {binproceedings}[author] \bauthor\bsnmOlukanmi, \bfnmPeter O\binitsP. O. and \bauthor\bsnmTwala, \bfnmBhekisipho\binitsB. (\byear2017). \btitleK-means-sharp: modified centroid update for outlier-robust k-means clustering. In \bbooktitle2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech) \bpages14–19. \bpublisherIEEE. \endbibitem
- {barticle}[author] \bauthor\bsnmOmran, \bfnmMahamed GH\binitsM. G., \bauthor\bsnmEngelbrecht, \bfnmAndries P\binitsA. P. and \bauthor\bsnmSalman, \bfnmAyed\binitsA. (\byear2007). \btitleAn overview of clustering methods. \bjournalIntelligent Data Analysis \bvolume11 \bpages583–605. \endbibitem
- {barticle}[author] \bauthor\bsnmPigolotti, \bfnmSimone\binitsS., \bauthor\bsnmLópez, \bfnmCristóbal\binitsC. and \bauthor\bsnmHernández-García, \bfnmEmilio\binitsE. (\byear2007). \btitleSpecies clustering in competitive Lotka-Volterra models. \bjournalPhysical review letters \bvolume98 \bpages258101. \endbibitem
- {barticle}[author] \bauthor\bsnmRonan, \bfnmTom\binitsT., \bauthor\bsnmQi, \bfnmZhijie\binitsZ. and \bauthor\bsnmNaegle, \bfnmKristen M\binitsK. M. (\byear2016). \btitleAvoiding common pitfalls when clustering biological data. \bjournalScience signaling \bvolume9 \bpagesre6–re6. \endbibitem
- {binproceedings}[author] \bauthor\bsnmSasikumar, \bfnmP\binitsP. and \bauthor\bsnmKhara, \bfnmSibaram\binitsS. (\byear2012). \btitleK-means clustering in wireless sensor networks. In \bbooktitle2012 Fourth international conference on computational intelligence and communication networks \bpages140–144. \bpublisherIEEE. \endbibitem
- {binproceedings}[author] \bauthor\bsnmSfikas, \bfnmGiorgos\binitsG., \bauthor\bsnmNikou, \bfnmChristophoros\binitsC. and \bauthor\bsnmGalatsanos, \bfnmNikolaos\binitsN. (\byear2007). \btitleRobust image segmentation with mixtures of Student’s t-distributions. In \bbooktitle2007 IEEE International Conference on Image Processing \bvolume1 \bpagesI–273. \bpublisherIEEE. \endbibitem
- {bmisc}[author] \bauthor\bsnmSlate, \bfnmDavid\binitsD. (\byear1991). \btitleLetter Recognition. \bhowpublishedUCI Machine Learning Repository. \bnoteDOI: https://doi.org/10.24432/C5ZP40. \endbibitem
- {barticle}[author] \bauthor\bsnmSrivastava, \bfnmPrateek R\binitsP. R., \bauthor\bsnmSarkar, \bfnmPurnamrita\binitsP. and \bauthor\bsnmHanasusanto, \bfnmGrani A\binitsG. A. (\byear2023). \btitleA robust spectral clustering algorithm for sub-Gaussian mixture models with outliers. \bjournalOperations Research \bvolume71 \bpages224–244. \endbibitem
- {barticle}[author] \bauthor\bsnmSun, \bfnmQiang\binitsQ., \bauthor\bsnmZhou, \bfnmWen-Xin\binitsW.-X. and \bauthor\bsnmFan, \bfnmJianqing\binitsJ. (\byear2020). \btitleAdaptive huber regression. \bjournalJournal of the American Statistical Association \bvolume115 \bpages254–265. \endbibitem
- {binproceedings}[author] \bauthor\bsnmVassilvitskii, \bfnmSergei\binitsS. and \bauthor\bsnmArthur, \bfnmDavid\binitsD. (\byear2006). \btitlek-means++: The advantages of careful seeding. In \bbooktitleProceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms \bpages1027–1035. \endbibitem
- {barticle}[author] \bauthor\bsnmVempala, \bfnmSantosh\binitsS. and \bauthor\bsnmWang, \bfnmGrant\binitsG. (\byear2004). \btitleA spectral algorithm for learning mixture models. \bjournalJournal of Computer and System Sciences \bvolume68 \bpages841–860. \endbibitem
- {barticle}[author] \bauthor\bsnmWang, \bfnmBingyan\binitsB. and \bauthor\bsnmFan, \bfnmJianqing\binitsJ. (\byear2022). \btitleRobust matrix completion with heavy-tailed noise. \bjournalarXiv preprint arXiv:2206.04276. \endbibitem
- {barticle}[author] \bauthor\bsnmXu, \bfnmRui\binitsR. and \bauthor\bsnmWunsch, \bfnmDonald\binitsD. (\byear2005). \btitleSurvey of clustering algorithms. \bjournalIEEE Transactions on neural networks \bvolume16 \bpages645–678. \endbibitem
- {barticle}[author] \bauthor\bsnmZhang, \bfnmYilin\binitsY. and \bauthor\bsnmRohe, \bfnmKarl\binitsK. (\byear2018). \btitleUnderstanding regularized spectral clustering via graph conductance. \bjournalAdvances in Neural Information Processing Systems \bvolume31. \endbibitem