Untangling Gaussian Mixtures (2403.06671v1)
Abstract: Tangles were originally introduced as a concept to formalize regions of high connectivity in graphs. In recent years, they have also been discovered as a link between structural graph theory and data science: when interpreting similarity in data sets as connectivity between points, finding clusters in the data essentially amounts to finding tangles in the underlying graphs. This paper further explores the potential of tangles in data sets as a means for a formal study of clusters. Real-world data often follow a normal distribution. Accounting for this, we develop a quantitative theory of tangles in data sets drawn from Gaussian mixtures. To this end, we equip the data with a graph structure that models similarity between the points and allows us to apply tangle theory to the data. We provide explicit conditions under which tangles associated with the marginal Gaussian distributions exist asymptotically almost surely. This can be considered as a sufficient formal criterion for the separabability of clusters in the data.
- Hypertree width and related hypergraph invariants. European Journal of Combinatorics, 28(8):2167–2181, 2007.
- A. C. Berry. The accuracy of the gaussian approximation to the sum of independent variates. Transactions of the American Mathematical Society, 49(1):122–136, 1941.
- I.-J. Bienaymé. Considérations à l’appui de la découverte de Laplace sur la loi de probabilité dans la méthode des moindres carrés. Imprimerie de Mallet-Bachelier, 1853.
- On objects dual to tree-cut decompositions. CoRR, abs/2103.14667, 2021.
- Classification and Regression Trees. Wadsworth, 1984.
- Canonical tree-decompositions of finite graphs i. Existence and algorithms. J. Comb. Theory, Ser. B, 116:1–24, 2016.
- Learning mixtures of gaussians using the k-means algorithm. CoRR, abs/0912.0086, 2009.
- P. L. Chebyshev. Des valeurs moyennes. J. Math. Pures Appl, 12(2):177–184, 1867.
- Tight FPT approximations for k-median and k-means. In 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, volume 132 of LIPIcs, pages 42:1–42:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
- Hierarchical clustering: Objective functions and algorithms. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pages 378–397, 2018.
- S. Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, pages 118–127, 2016.
- J. V. De Oliveira and W. Pedrycz. Advances in fuzzy clustering and its applications. John Wiley & Sons, 2007.
- R. Diestel. Abstract separation systems. Order, 35(1):157–170, 2018.
- R. Diestel. Tangles: a new paradigm for clusters and types. CoRR, abs/2006.01830, 2020.
- Structural submodularity and tangles in abstract separation systems. J. Combin. Theory Ser. A, 167:155–180, 2019.
- Profiles of separations: in graphs, matroids, and beyond. Combinatorica, 39(1):37–75, 2019.
- R. Diestel and S. Oum. Unifying duality theorems for width parameters in graphs and matroids (extended abstract). In Graph-Theoretic Concepts in Computer Science - 40th International Workshop, WG 2014. Revised Selected Papers, pages 1–14, 2014.
- R. Diestel and S.-i. Oum. Tangle-tree duality in abstract separation systems. Adv. Math., 377:107470, 24, 2021.
- R. Diestel and G. Whittle. Tangles and the Mona Lisa. ArXiv, 2016. arXiv:1603.06652 [math.CO].
- Trees of tangles in abstract separation systems. J. Combin. Theory Ser. A, 180:105425, 27, 2021.
- C.-G. Esseen. Fourier analysis of distribution functions. a mathematical study of the Laplace-Gaussian law. Acta Mathematica, 77(1):1–125, 1945.
- E. Fluck. Tangles and single linkage hierarchical clustering. In 44th International Symposium on Mathematical Foundations of Computer Science, MFCS 2019, volume 138 of LIPIcs, pages 38:1–38:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
- E. Fluck. Tangles and hierarchical clustering. SIAM Journal on Discrete Mathematics, 38(1):75–92, 2024.
- Tangles, tree-decompositions and grids in matroids. Journal of Combinatorial Theory, Series B, 99(4):657–667, 2009.
- The canonical directed tree decomposition and its applications to the directed disjoint paths problem. CoRR, abs/2009.13184, 2020.
- C. Giraud and N. Verzelen. Partial recovery bounds for clustering with the relaxed k-means. CoRR, abs/1807.07547, 2018.
- M. Grohe. Tangled up in blue (a survey on connectivity, decompositions, and tangles). ArXiv, 2016. arXiv:1605.06704 [cs.DM].
- Handbook of cluster analysis. CRC Press, 2015.
- W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc., 58:13–30, 1963.
- Directed tree-width. J. Comb. Theory, Ser. B, 82(1):138–154, 2001.
- A. Klenke. Probability Theory: A Comprehensive Course. Universitext. Springer-Verlag London, London, UK, 2nd edition, 2014. Translation from the German language edition.
- Clustering with tangles: Algorithmic framework and theoretical guarantees. J. Mach. Learn. Res., 24:190:1–190:56, 2023.
- A. Liu and A. Moitra. Settling the robust learnability of mixtures of gaussians. In S. Khuller and V. V. Williams, editors, STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021, pages 518–531. ACM, 2021.
- A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA, pages 93–102. IEEE Computer Society, 2010.
- D. Pelleg and A. W. Moore. Mixtures of rectangles: Interpretable soft clustering. In C. E. Brodley and A. P. Danyluk, editors, Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001, pages 401–408. Morgan Kaufmann, 2001.
- B. A. Reed. Introducing directed tree width. Electron. Notes Discret. Math., 3:222–229, 1999.
- N. Robertson and P. D. Seymour. Graph minors. x. obstructions to tree-decomposition. Journal of Combinatorial Theory, Series B, 52(2):153–190, 1991.
- I. S. Tyurin. A refinement of the remainder in the Lyapunov theorem. Theory of Probability & Its Applications, 56(4):693–696, 2012.
- U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.