Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold (2205.11677v4)
Abstract: The stochastic block model is a canonical random graph model for clustering and community detection on network-structured data. Decades of extensive study on the problem have established many profound results, among which the phase transition at the Kesten-Stigum threshold is particularly interesting both from a mathematical and an applied standpoint. It states that no estimator based on the network topology can perform substantially better than chance on sparse graphs if the model parameter is below a certain threshold. Nevertheless, if we slightly extend the horizon to the ubiquitous semi-supervised setting, such a fundamental limitation will disappear completely. We prove that with an arbitrary fraction of the labels revealed, the detection problem is feasible throughout the parameter domain. Moreover, we introduce two efficient algorithms, one combinatorial and one based on optimization, to integrate label information with graph structures. Our work brings a new perspective to the stochastic model of networks and semidefinite program research.
- Emmanuel Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18(177):1–86, 2018. URL http://jmlr.org/papers/v18/16-480.html.
- Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 670–688, 2015. doi: 10.1109/FOCS.2015.47.
- Proof of the achievability conjectures for the general stochastic block model. Communications on Pure and Applied Mathematics, 71(7):1334–1406, 2018. doi: https://doi.org/10.1002/cpa.21719. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.21719.
- Exact recovery in the stochastic block model. IEEE Transactions on Information Theory, 62:471–487, 2016.
- Graph powering and spectral robustness. SIAM J. Math. Data Sci., 2:132–157, 2020a.
- Entrywise eigenvector analysis of random matrices with low expected rank. Annals of statistics, 48 3:1452–1474, 2020b.
- Multisection in the Stochastic Block Model Using Semidefinite Programming, pages 125–162. Springer International Publishing, Cham, 2017. ISBN 978-3-319-69802-1. doi: 10.1007/978-3-319-69802-1˙4. URL https://doi.org/10.1007/978-3-319-69802-1_4.
- Farid Alizadeh. Interior point methods in semidefinite programming with applications to combinatorial optimization. SIAM J. Optim., 5:13–51, 1995.
- On semidefinite relaxations for the block model. The Annals of Statistics, 46(1):149 – 179, 2018. doi: 10.1214/17-AOS1545. URL https://doi.org/10.1214/17-AOS1545.
- Afonso S. Bandeira. Random Laplacian matrices and convex relaxations. Foundations of Computational Mathematics, 18:345–379, 2018.
- Information-theoretic thresholds for community detection in sparse networks. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 383–416, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR. URL https://proceedings.mlr.press/v49/banks16.html.
- A nonparametric view of network models and Newman-Girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50):21068–21073, 2009. doi: 10.1073/pnas.0907096106. URL https://www.pnas.org/doi/abs/10.1073/pnas.0907096106.
- On the purity of the limiting gibbs state for the Ising model on the Bethe lattice. Journal of Statistical Physics, 79:473–482, 1995.
- The phase transition in inhomogeneous random graphs. Random Struct. Algorithms, 31(1):3–122, aug 2007. ISSN 1042-9832.
- Non-backtracking spectrum of random graphs: Community detection and non-regular Ramanujan graphs. 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 1347–1357, 2015.
- The Grothendieck constant is strictly smaller than Krivine’s bound. 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 453–462, 2011.
- Graph bisection algorithins with good average case behavior. In 25th Annual Symposium on Foundations of Computer Science, 1984., pages 181–192, 1984. doi: 10.1109/SFCS.1984.715914.
- James A. Cavender. Taxonomy with confidence. Mathematical Biosciences, 40(3):271–280, 1978. ISSN 0025-5564. doi: https://doi.org/10.1016/0025-5564(78)90089-5. URL https://www.sciencedirect.com/science/article/pii/0025556478900895.
- Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. J. Mach. Learn. Res., 17:27:1–27:57, 2016.
- Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23:493–507, 1952.
- Amin Coja-Oghlan. Graph partitioning via adaptive spectral techniques. Combinatorics, Probability and Computing, 19:227 – 284, 2009.
- Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical review. E, Statistical, nonlinear, and soft matter physics, 84 6 Pt 2:066106, 2011.
- Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977. doi: https://doi.org/10.1111/j.2517-6161.1977.tb01600.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1977.tb01600.x.
- Strong consistency, graph Laplacians, and the stochastic block model. J. Mach. Learn. Res., 22:117:1–117:44, 2021.
- On the evolution of random graphs. Transactions of the American Mathematical Society, 286:257–257, 1984.
- Broadcasting on trees and the Ising model. The Annals of Applied Probability, 10(2):410 – 433, 2000. doi: 10.1214/aoap/1019487349. URL https://doi.org/10.1214/aoap/1019487349.
- Spectral techniques applied to sparse random graphs. Random Struct. Algorithms, 27:251–275, 2005.
- Achieving optimal misclassification proportion in stochastic block models. Journal of Machine Learning Research, 18(60):1–45, 2017. URL http://jmlr.org/papers/v18/16-245.html.
- Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115–1145, nov 1995. ISSN 0004-5411. doi: 10.1145/227683.227684. URL https://doi.org/10.1145/227683.227684.
- Efficient discovery of overlapping communities in massive networks. Proceedings of the National Academy of Sciences, 110(36):14534–14539, 2013. doi: 10.1073/pnas.1221839110. URL https://www.pnas.org/doi/abs/10.1073/pnas.1221839110.
- Alexander Grothendieck. Résumé des résultats essentiels dans la théorie des produits tensoriels topologiques et des espaces nucléaires. Annales de l’Institut Fourier, 4:73–112, 1952. doi: 10.5802/aif.46. URL https://aif.centre-mersenne.org/articles/10.5802/aif.46/.
- Community detection in sparse networks via Grothendieck’s inequality. Probability Theory and Related Fields, 165:1025–1049, 2014.
- Achieving exact cluster recovery threshold via semidefinite programming. IEEE Transactions on Information Theory, 62:2788–2797, 2016.
- A clustering algorithm based on graph connectivity. Information Processing Letters, 76(4):175–181, 2000. ISSN 0020-0190. doi: https://doi.org/10.1016/S0020-0190(00)00142-3. URL https://www.sciencedirect.com/science/article/pii/S0020019000001423.
- Yasunari Higuchi. Remarks on the limiting gibbs states on a (d+1)-tree. Publications of The Research Institute for Mathematical Sciences, 13:335–348, 1977.
- Stochastic blockmodels: First steps. Social Networks, 5:109–137, 1983.
- Phase transitions in semidefinite relaxations. Proceedings of the National Academy of Sciences, 113:E2218 – E2223, 2016.
- Stephen C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967. doi: 10.1007/BF02289588. URL https://doi.org/10.1007/BF02289588.
- Spectral learning. In IJCAI, 2003.
- Global and local information in clustering labeled block models. IEEE Transactions on Information Theory, 62:5906–5917, 2014.
- Matrix completion from noisy entries. In J. Mach. Learn. Res., 2009.
- H. Kesten and B. P. Stigum. A Limit Theorem for Multidimensional Galton-Watson Processes. The Annals of Mathematical Statistics, 37(5):1211 – 1223, 1966. doi: 10.1214/aoms/1177699266. URL https://doi.org/10.1214/aoms/1177699266.
- The largest eigenvalue of sparse random graphs. Combinatorics, Probability and Computing, 12:61 – 72, 2003.
- Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences, 110:20935 – 20940, 2013.
- Dong-Hyun Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, International Conference on Machine Learning, volume 3 of ICML ’13, page 896, 2013.
- Statistical properties of community structure in large social and information networks. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08, page 695–704, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605580852. doi: 10.1145/1367497.1367591. URL https://doi.org/10.1145/1367497.1367591.
- Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Computing, 7(1):76–80, 2003. doi: 10.1109/MIC.2003.1167344.
- Contextual stochastic block model: Sharp thresholds and contiguity. J. Mach. Learn. Res., 24:54:1–54:34, 2020.
- Laurent Massoulié. Community detection thresholds and the weak ramanujan property. In Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing, STOC ’14, page 694–703, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450327107. doi: 10.1145/2591796.2591857. URL https://doi.org/10.1145/2591796.2591857.
- F. McSherry. Spectral partitioning of random graphs. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pages 529–537, 2001. doi: 10.1109/SFCS.2001.959929.
- How robust are reconstruction thresholds for community detection? In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’16, page 828–841, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450341325. doi: 10.1145/2897518.2897573. URL https://doi.org/10.1145/2897518.2897573.
- Semidefinite programs on sparse random graphs and their application to community detection. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’16, page 814–827, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450341325. doi: 10.1145/2897518.2897548. URL https://doi.org/10.1145/2897518.2897548.
- Elchanan Mossel. Survey: Information flow on trees. In Graphs, Morphisms and Statistical Physics, 2001.
- Local algorithms for block models with side information. Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 2015.
- Reconstruction and estimation in the planted partition model. Probability Theory and Related Fields, 162(3):431–461, Aug 2015. ISSN 1432-2064. doi: 10.1007/s00440-014-0576-6. URL https://doi.org/10.1007/s00440-014-0576-6.
- A proof of the block model threshold conjecture. Combinatorica, 38:665–708, 2018.
- Raj Rao Nadakuditi and Mark E. J. Newman. Graph spectra and the detectability of community structure in networks. Physical review letters, 108 18:188701, 2012.
- Random graph models of social networks. Proceedings of the National Academy of Sciences, 99(suppl_1):2566–2572, 2002. doi: 10.1073/pnas.012582999. URL https://www.pnas.org/doi/abs/10.1073/pnas.012582999.
- Mark E. J. Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review. E, Statistical, nonlinear, and soft matter physics, 69 2 Pt 2:026113, 2004.
- Badri Padhukasahasram. Inferring ancestry from population genomic data and its applications. Frontiers in Genetics, 5:204, 2014. ISSN 1664-8021. doi: 10.3389/fgene.2014.00204. URL https://www.frontiersin.org/article/10.3389/fgene.2014.00204.
- A semidefinite program for unbalanced multisection in the stochastic block model. 2017 International Conference on Sampling Theory and Applications (SampTA), pages 64–67, 2017.
- H. T. H. Piaggio. Introduction to mathematical probability. by j. v. uspensky. pp. ix, 411. 30s. 1937. (mcgraw-hill). The Mathematical Gazette, 22(249):202–204, 1938. doi: 10.2307/3607501.
- Graph-based semi-supervised learning for natural language understanding. In EMNLP, 2019.
- Computing gaussian mixture models with EM using equivalence constraints. In NIPS, 2003.
- Jianbo Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. doi: 10.1109/34.868688.
- Frank Spitzer. Markov random fields on an infinite tree. The Annals of Probability, 3(3):387–398, 1975. ISSN 00911798. URL http://www.jstor.org/stable/2959462.
- Steven H. Strogatz. Exploring complex networks. Nature, 410(6825):268–276, 2001. doi: 10.1038/35065725. URL https://doi.org/10.1038/35065725.
- Terence Tao. Topics in random matrix theory. American Physical Society, 2011.
- Alexandre B. Tsybakov. Introduction to nonparametric estimation. In Springer series in statistics, 2009.
- Van H. Vu. Spectral norm of random matrices. Combinatorica, 27:721–736, 2007.
- Constrained k-means clustering with background knowledge. In ICML, 2001.
- Self-training with noisy student improves ImageNet classification. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2020.
- Covariate regularized community detection in sparse graphs. Journal of the American Statistical Association, 116:734 – 745, 2016.
- Optimal cluster recovery in the labeled stochastic block model. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 973–981, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
- Phase transitions in semisupervised clustering of sparse networks. Phys. Rev. E, 90:052802, Nov 2014. doi: 10.1103/PhysRevE.90.052802. URL https://link.aps.org/doi/10.1103/PhysRevE.90.052802.