Hierarchical topological clustering (2601.00892v1)

Published 31 Dec 2025 in cs.LG, cs.CV, physics.data-an, stat.ME, and stat.ML

Abstract: Topological methods have the potential of exploring data clouds without making assumptions on their the structure. Here we propose a hierarchical topological clustering algorithm that can be implemented with any distance choice. The persistence of outliers and clusters of arbitrary shape is inferred from the resulting hierarchy. We demonstrate the potential of the algorithm on selected datasets in which outliers play relevant roles, consisting of images, medical and economic data. These methods can provide meaningful clusters in situations in which other techniques fail to do so.

Abstract PDF Chat (Pro)

Summary

The paper presents a method that leverages persistent H0 homology to extract clusters and persistent outliers without arbitrary parameter tuning.
The algorithm constructs a hierarchical partition via the Vietoris-Rips filtration, ensuring robust detection in diverse datasets including biomedical and economic data.
Practical applications illustrate superior performance over traditional methods like K-means and DBSCAN, particularly in detecting rare events and semantically salient structures.

Hierarchical Topological Clustering: Persistent Homology for Data Organization

Introduction and Motivation

Traditional clustering algorithms such as K-means, agglomerative hierarchical clustering, and DBSCAN each possess inherent biases in favor of specific geometric or density-driven data structures, and their ability to handle noise and outliers is limited by explicit or implicit parameter choices. The "Hierarchical topological clustering" (HTC) framework (2601.00892) directly addresses the challenge of extracting meaningful clusters and persistent outliers from datasets with arbitrary geometry by leveraging concepts from topological data analysis, particularly persistent homology at the $H_0$ level.

HTC is specifically designed to (1) operate under arbitrary (user-specified) metrics, (2) provide a natural interpretation of clusters and outliers based on their persistence across filtration values, and (3) avoid ambiguous hyperparameter tuning regarding cluster shape or minimal density. The paper’s focus spans a variety of domains where outlier discovery is semantically relevant: cellular interfaces, image compression artifacts, trade networks, and molecular biology.

Algorithmic Framework

The HTC algorithm constructs a hierarchical partition of a dataset by examining how connected components (equivalently, zero-dimensional homology classes) evolve within the Vietoris-Rips filtration induced by a chosen dissimilarity metric. The procedure is as follows:

Given a point set $X$ with a distance function $d$ , the maximal and minimal pairwise distances, $r_{\max}$ and $r_{\min}$ , are computed.
A discrete sequence of filtration values $\{ r_m \}$ is chosen covering $[0, r_{\max}]$ . For each $r_m$ , edges are included between points separated by distance at most $r_m$ .
For each value $r_m$ , clusters are defined as maximal sets of points connected via paths of such edges.
As $r$ increases, clusters merge, and the entire process naturally induces a hierarchy (dendrogram).
Persistent outliers are those points or clusters that remain distinct until large values of $r$ .

This strictly homological construction decouples cluster identification from assumptions regarding convexity, density, or distribution. All cluster merging and outlier detection is parameter-free except for the chosen metric and the granularity of the filtration.

Figure 1: Hierarchical topological clustering applied to a fragmented front separating malignant and healthy cells.

Performance on Geometric and Biomedical Data

The efficacy of HTC is first illustrated by analyzing the interface between malignant and healthy cells, represented as a fragmented two-dimensional point set. As the filtration parameter increases, the principal interface rapidly aggregates into a dominant component, while subpopulations corresponding to "detached islands" of malignant cells persist as outliers until high filtration levels. These islands are biologically significant, representing rare infiltration events of carcinoma cells. Classical K-means fails by enforcing spherical partitions, and DBSCAN’s outcome is highly contingent on $\varepsilon$ and $M_p$ choices, often misclassifying fragments or failing to detect rare events without explicit user tuning.

Figure 2: Evolution of the topological clusters with the filtration parameter $r$ .

Figure 3: Hierarchical clustering, K-means, and DBSCAN applied to the fragmented front; traditional methods misinterpret biologically salient structure.

Figure 4: Persistence barcode for the $H_0$ homology of the fragmented front dataset, visually encoding persistence of clusters and outliers.

Application to Images and Structural Outlier Detection

In an experiment involving compressed and defect-laden digital images, HTC, using the Wasserstein image distance, successfully separates images into clusters not only by compression level but also isolates outliers with visible defects. While complete linkage hierarchical clustering groups images predominantly by compression, HTC distinguishes between global structure (compression-induced blurring) and localized anomalies (line defects), permitting the automated detection of both poor-quality images and images with inserted artifacts.

Figure 5: Series of images with decreasing compression and eventual defects for the image test.

Figure 6: Hierarchical clustering with highest cophenetic correlation coefficient versus topological hierarchical clustering for the image test; HTC produces interpretable structure.

Figure 7: Hierarchical clustering with complete linkage versus topological hierarchical clustering for the image test; HTC partitions that correspond to semantically meaningful groupings.

Figure 8: Persistence barcode for the $H_0$ homology on images, showing merging behavior of clusters and persisting outliers.

Outlier and Partner Analysis in Economic Trade Data

HTC was applied to multi-dimensional international trade data for Spain and European partners, treating each country's trade profile as a point in high-dimensional space of normalized import/export statistics. Rather than arbitrary cluster assignment dependent on the choice of $k$ (K-means) or linkage type (agglomerative clustering), HTC generates a natural hierarchy:

Most countries with negligible trade rapidly merge into a dominant "background" cluster.
Major trade partners (France, Germany, Italy, Portugal) are persistent outliers, identifiable without parameter tuning, and remain separated until $r$ is large, explicitly quantifying their status as key outliers.
Figure 9: Relevant and irrelevant partners in 2019 Spanish trade visualized as clusters and persistent outliers.

Clustering and Outlier Discovery in Gene Expression

In the domain of cancer genomics, HTC was employed to cluster mRNA expression profiles for genes implicated in cell cycle regulation in breast cancer samples relative to a healthy baseline. Using both Euclidean and Fermat distances on gene-normalized data, HTC sequentially merges genes into a majority cluster while highlighting persistent outliers including CCNE1, SMC1B, CDKN2A, CDC6, PKMYT1, CDK1. All these are strongly implicated in tumorigenesis and therapy resistance according to the literature. This direct correspondence between persistent outliers and biological significance sharply contrasts with standard hierarchical clustering—which produces groups of ambiguous interpretation.

Figure 10: HTC study of mRNA gene expression using TCGA breast cancer data for cell cycle genes; last-to-merge outliers correspond to crucial cancer drivers or markers.

Figure 11: Persistence barcode for $H_0$ homology of the mRNA gene expression data for genes involved in the cell cycle.

Figure 12: Hierarchical clustering with weighted linkage applied to mRNA gene expression data; HTC provides sharper signal on relevant outliers.

Implications and Future Perspectives

HTC’s primary advantage is its interpretability and geometrically sound foundation: outliers and clusters are produced according to explicit, metric-governed merging criteria related to path-connectivity in the induced simplicial complex, without recourse to ad hoc parameters or ambiguous heuristics. Notably:

Practical Implications: The method is directly applicable to quality control in imaging (e.g., automated outlier detection in large medical or industrial imaging repositories), unsupervised feature extraction in multi-omics, and partner or anomaly analysis in economic or networked systems.
Theoretical Implications: HTC demonstrates the productive intersection of TDA and classical unsupervised learning, indicating a robust alternative for settings where cluster shape, density, or the presence of multiple scales of structure confound traditional approaches. Its potential as a front-end for subsequent supervised analysis (e.g., to select features for biological validation) is significant.
Scalability and Extensions: While the base algorithm has quadratic complexity in the number of data points, application to moderate datasets is efficient. Integration with sparsification or graph-theoretic pruning could extend scalability. Future work could generalize the approach to higher homology levels ( $H_k$ , $k>0$ ) for detecting higher-dimensional features such as cycles or voids, and to adaptive or learned metric selection for more nuanced data modalities.

Conclusion

The paper "Hierarchical topological clustering" (2601.00892) presents a methodologically rigorous topological clustering algorithm, substantiated through diverse real-world data applications. By relying on persistent $H_0$ homology and metric-induced connectivity, HTC provides robust and inherently interpretable detection of clusters and meaningful outliers, often outperforming or complementing standard methods in instances where geometric or structural fidelity is required. The approach stands as a foundational tool for both exploratory analysis and as a bridge integrating topological data analysis with mainstream machine learning.