$G$-Mapper: Learning a Cover in the Mapper Construction (2309.06634v3)
Abstract: The Mapper algorithm is a visualization technique in topological data analysis (TDA) that outputs a graph reflecting the structure of a given dataset. However, the Mapper algorithm requires tuning several parameters in order to generate a ``nice" Mapper graph. This paper focuses on selecting the cover parameter. We present an algorithm that optimizes the cover of a Mapper graph by splitting a cover repeatedly according to a statistical test for normality. Our algorithm is based on $G$-means clustering which searches for the optimal number of clusters in $k$-means by iteratively applying the Anderson-Darling test. Our splitting procedure employs a Gaussian mixture model to carefully choose the cover according to the distribution of the given data. Experiments for synthetic and real-world datasets demonstrate that our algorithm generates covers so that the Mapper graphs retain the essence of the datasets, while also running significantly fast.
- Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. The annals of mathematical statistics, pages 193–212, 1952.
- A social perspective on perceived distances reveals deep community structure. Proceedings of the National Academy of Sciences, 119(4):e2003634119, 2022.
- James C Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Springer Science & Business Media, 2013.
- Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
- F-mapper: A fuzzy mapper clustering algorithm. Knowledge-Based Systems, 189:105107, 2020.
- Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society, 46(2):255–308, 2009.
- Statistical analysis and parameter selection for mapper. The Journal of Machine Learning Research, 19(1):478–516, 2018.
- Adaptive covers for mapper graphs using information criteria. In 2021 IEEE International Conference on Big Data (Big Data), pages 3789–3800. IEEE, 2021.
- A benchmark for 3D mesh segmentation. ACM Trans. Graph., 28(3), jul 2009.
- Morphometric analysis of Passiflora Leaves: the Relationship Between Landmarks of the Vasculature and Elliptical Fourier Descriptors of the Blade. GigaScience, 6(1), 01 2017.
- Hypergraph co-optimal transport: Metric and categorical properties. Journal of Applied and Computational Topology, pages 1–60, 2023.
- Extending persistence using Poincaré and Lefschetz duality. Foundations of Computational Mathematics, 9(1):79–103, 2009.
- An interactive web-based dashboard to track COVID-19 in real time. The Lancet infectious diseases, 20(5):533–534, 2020.
- Pattern Classification. John Wiley & Sons, 2012.
- Joseph C Dunn. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3), 1973.
- Ralph B. D’Agostino. Tests for the normal distribution. In Goodness-of-fit Techniques, pages 367–420. Routledge, 2017.
- A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, pages 226–231, 1996.
- Augmentations of Forman’s Ricci curvature and their applications in community detections. arXiv preprint arXiv:2306.06474, 2023.
- Learning the k in k-means. Advances in neural information processing systems, 16, 2003.
- Investigation on several model selection criteria for determining the number of cluster. Neural Information Processing-Letters and Reviews, 4(1):1–10, 2004.
- K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622:178–210, 2023.
- Algorithms for Clustering Data. Prentice-Hall, Inc., 1988.
- Stephen C Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967.
- Michal Konkol. Fuzzy agglomerative clustering. In Artificial Intelligence and Soft Computing: 14th International Conference, ICAISC 2015, Zakopane, Poland, June 14-18, 2015, Proceedings, Part I 14, pages 207–217. Springer, 2015.
- Learning multiple layers of features from tiny images. Technical Report, 2009.
- Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Science translational medicine, 7(311):311ra174–311ra174, 2015.
- The Gudhi library: Simplicial complexes and persistent homology. In Hoon Hong and Chee Yap, editors, Mathematical Software – ICMS 2014, pages 167–174, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg.
- Frank J Massey Jr. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68–78, 1951.
- Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences, 108(17):7265–7270, 2011.
- X-means: Extending k-means with efficient estimation of the number of clusters. In Icml, volume 1, pages 727–734, 2000.
- Topological data analysis reveals core heteroblastic and ontogenetic programs embedded in leaves of grapevine (vitaceae) and maracuyá (passifloraceae). PLOS Computational Biology, 20(2):e1011845, 2024.
- TopoAct: Visually exploring the shape of activations in deep learning. In Computer Graphics Forum, pages 382–397. Wiley Online Library, 2021.
- TopoBERT: Exploring the topology of fine-tuned word representations. Information Visualization, 22(3):186–208, 2023.
- Georges Reeb. Sur les points singuliers d’une forme de Pfaff completement integrable ou d’une fonction numerique [On the singular points of a completely integrable Pfaff form or of a numerical function]. Comptes Rendus Acad. Sciences Paris, 222:847–849, 1946.
- Fuzzy spectral clustering by PCCA+: Application to Markov state models and data classification. Advances in Data Analysis and Classification, 7:147–179, 2013.
- An analysis of variance test for normality (complete samples). Biometrika, 52(3/4):591–611, 1965.
- Unsupervised k-means clustering algorithm. IEEE access, 8:80716–80727, 2020.
- Topological methods for the analysis of high dimensional data sets and 3D object recognition. PBG@ Eurographics, 2, 2007.
- Michael A Stephens. EDF statistics for goodness of fit and some comparisons. Journal of the American statistical Association, 69(347):730–737, 1974.
- Giotto-TDA: A topological data analysis toolkit for machine learning and data exploration. Journal of Machine Learning Research, 22(42):1–6, 2020.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Kepler mapper: A flexible Python implementation of the mapper algorithm. Journal of Open Source Software, 4(42):1315, 2019.
- Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17:395–416, 2007.
- Hierarchical fuzzy spectral clustering in social networks using spectral characterization. In The twenty-eighth international flairs conference. Citeseer, 2015.
- Mapper interactive: A scalable, extendable, and interactive toolbox for the visual exploration of high-dimensional data. In 2021 IEEE 14th Pacific Visualization Symposium (PacificVis), pages 101–110. IEEE, 2021.
- Comparing mapper graphs of artificial neuron activations. In 2023 Topological Data Analysis and Visualization (TopoInVis), pages 41–50. IEEE, 2023.
- Enrique Alvarado (7 papers)
- Robin Belton (6 papers)
- Emily Fischer (2 papers)
- Kang-Ju Lee (5 papers)
- Sourabh Palande (5 papers)
- Sarah Percival (8 papers)
- Emilie Purvine (28 papers)