Debiasing and a local analysis for population clustering using semidefinite programming (2401.10927v1)
Abstract: In this paper, we consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of $2$ sub-gaussian distributions. In particular, we analyze computational efficient algorithms proposed by the same author, to partition data into two groups approximately according to their population of origin given a small sample. This work is motivated by the application of clustering individuals according to their population of origin using $p$ markers, when the divergence between any two of the populations is small. We build upon the semidefinite relaxation of an integer quadratic program that is formulated essentially as finding the maximum cut on a graph, where edge weights in the cut represent dissimilarity scores between two nodes based on their $p$ features. Here we use $\Delta2 :=p \gamma$ to denote the $\ell_22$ distance between two centers (mean vectors), namely, $\mu{(1)}$, $\mu{(2)}$ $\in$ $\mathbb{R}p$. The goal is to allow a full range of tradeoffs between $n, p, \gamma$ in the sense that partial recovery (success rate $< 100\%$) is feasible once the signal to noise ratio $s2 := \min{np \gamma2, \Delta2}$ is lower bounded by a constant. Importantly, we prove that the misclassification error decays exponentially with respect to the SNR $s2$. This result was introduced earlier without a full proof. We therefore present the full proof in the present work. Finally, for balanced partitions, we consider a variant of the SDP1, and show that the new estimator has a superb debiasing property. This is novel to the best of our knowledge.
- Abbe, E. (2017). Community detection and the stochastic block model: recent developments. Journal of Machine Learning Research 4.
- An ℓpsubscriptℓ𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT theory of PCA and spectral clustering. Ann. Statist. 50 2359–2385.
- On spectral learning of mixtures of distributions. In Proceedings of the 18th Annual COLT. (Version in http://www.cs.ucsc.edu/ optas/papers/).
- NP-hardness of Euclidean sum-of-squares clustering. Machine Learning 75 245–248.
- Learning mixtures of arbitrary gaussians. In Proceedings of 33rd ACM Symposium on Theory of Computing.
- Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics 45 77–120.
- Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization. IEEE Trans. Inform. Theory 64 4872–4994.
- Separating populations with wide data: a spectral analysis. Electronic Journal of Statistics 3 76–113.
- Model assisted variable clustering: minimax-optimal recovery and algorithms. The Annals of Statistics 48 111–137.
- A rigorous analysis of population stratification with limited data. In Proceedings of the 18th ACM-SIAM SODA.
- Hanson-Wright inequality in Hilbert spaces with application to K𝐾Kitalic_K-means clustering for non-Euclidean data. Bernoulli 27 586–614.
- Learning with semi-definite programming: statistical bounds based on fixed point analysis and excess risk curvature. Journal of Machine Learning Research 22.
- A two-round variant of em for gaussian mixtures. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI).
- Clustering large graphs via the singular value decomposition. Machine Learning 9–33.
- Hidden integrality of SDP relaxations for sub-Gaussian mixture models. In Conference On Learning Theory.
- Hidden integrality and semi-random robustness of SDP relaxation for Sub-Gaussian mixture model .
- Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Mathematical Journal 23 298–305. URL http://eudml.org/doc/12723
- Partial recovery bounds for clustering with the relaxed k𝑘kitalic_k-means. Mathematical Statistics and Learning 1 317–374.
- Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. JACM 42 1115–1145.
- Community detection in sparse networks via grothendieck’s inequality. Probability Theory and Related Fields 165 1025–1049.
- Joint mean and covariance estimation for unreplicated matrix-variate data. Journal of the American Statistical Association (Theory and Methods) 114 682–696.
- Efficiently learning mixtures of two gaussians. In Proceedings of the Forty-second ACM Symposium on Theory of Computing. ACM.
- The spectral method for general mixture models. In Proc. of the 18th Annual COLT.
- Karp, R. M. (1972). Reducibility among combinatorial problems. In Proceedings of a symposium on the Complexity of Computer Computations, held March 20-22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, USA (R. E. Miller and J. W. Thatcher, eds.). The IBM Research Symposia Series, Plenum Press, New York.
- Clustering with spectral norm and the k-means algorithm. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science. IEEE Computer Society.
- When do birds of a feather flock together? k𝑘kitalic_k-means, proximity, and conic programming. Mathematical Programming 179 295–341.
- Optimality of spectral clustering in the Gaussian mixture model. The Annals of Statistics 49 2506 – 2530.
- Ndaoud, M. (2022). Sharp optimal recovery in the two component gaussian mixture model. Ann. Statist. 50 2096–2126.
- Population structure and eigenanalysis. PLoS Genet 2. Doi:10.1371/journal.pgen.0020190.
- Approximating K-means-type clustering via semidefinite programming. SIAM Journal on Optimization 18 186–205.
- Principal components analysis corrects for stratification in genome-wide association studies. nature genetics 38 904–909.
- Spectral clustering and the high-dimensional stochastic blockmodel. Ann. Statist. 39 1878–1915.
- Royer, M. (2017). Adaptive clustering through semidefinite programming. Advances in Neural Information Processing Systems 1795–1803.
- A spectral algorithm of learning mixtures of distributions. In Proceedings of the 43rd IEEE FOCS.
- Zhou, S. (2006). Routing, disjoint Paths, and classification. Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PA.
- Zhou, S. (2023). Semidefinite programming on population clustering: a global analysis.
- Shuheng Zhou (25 papers)