Hierarchical Topological Clustering Algorithm

Updated 6 January 2026

Hierarchical topological clustering is a methodology that constructs multiscale nested clusters using algebraic topology and persistent homology to capture complex data structures.
It leverages techniques like Vietoris–Rips filtrations, nerve constructions, and cosheaf theory to support arbitrary cluster shapes and robust outlier detection.
The approach enables scalable implementations with adaptive parameter selection, proving effective in fields such as image analysis, genomics, and manifold learning.

A hierarchical topological clustering algorithm is a clustering paradigm that constructs a hierarchy or tree of nested groupings of a dataset, where the organization and extraction of clusters are governed by topological and/or density-based principles rather than solely by metric or linkage criteria. Through leveraging mathematical notions such as persistence, cosheaves, filtered complexes, and topological invariants, these algorithms provide a rigorous, multiscale view of clustering structure, robust detection of outliers, and support for arbitrary cluster shape and heterogeneous similarity measures.

1. Mathematical and Topological Foundations

Hierarchical topological clustering frameworks are grounded in concepts from algebraic topology, metric geometry, and persistent homology. Central objects include:

Vietoris–Rips filtrations: For a dataset $X = \{x_1, ..., x_N\}$ with a metric $d$ , at each scale $r$ one forms the complex $\mathrm{VR}(X, r)$ whose simplices consist of all subsets of $X$ with pairwise distances at most $r$ . The $H_0$ persistence diagram tracks the birth (isolation) and death (merging) of connected components (clusters) across filtration values (Carpio et al., 31 Dec 2025).
Nerve constructions and covers: The nerve of a cover $\{U(i)\}_{i\in I}$ of $X$ is the simplicial complex whose simplices correspond to subsets of indices with non-empty intersection $\bigcap_{i\in\sigma} U(i)\neq\emptyset$ . This approach generalizes classical clustering to arbitrarily shaped, overlapping cluster structure (Joyce et al., 2023).
Persistence diagrams: Clusters and outliers are ordered by the persistence (lifespan) of their corresponding features in the filtration; significant clusters are associated with long bars in the barcode.
Cosheaves and merge trees: The formalism of cosheaves allows clustering structure (e.g., connected components) to be captured functorially for each open set or scale, with the entire hierarchy encoded in a merge tree or dendrogram (Joyce et al., 2023).

This topological view avoids assumptions of convexity or spherical shape and provides robust cluster identification and outlier ranking via topological persistence.

2. Algorithmic Principles and Representative Methods

Several major hierarchical topological clustering algorithms instantiate these principles:

(a) Hierarchical Topological Clustering (HTC):

HTC applies a Vietoris–Rips filtration to arbitrary metric data, recursively connecting data points below a threshold $r$ , and recording cluster components and their merging history. The resulting persistence diagram directly encodes significant clusters and persistent outliers (Carpio et al., 31 Dec 2025).

(b) Accelerated HDBSCAN*:

This algorithm builds on the mutual reachability metric $d_{\text{mreach}}(x, y) = \max\{\text{core}_k(x), \text{core}_k(y), d(x, y)\}$ , where $\text{core}_k(x)$ is the distance to the $k$ -th nearest neighbor of $x$ . A minimum spanning tree (MST) is constructed in this metric. Hierarchies of density-based clusters are extracted by sweeping over MST edge weights; cluster persistence is used to score robustness, analogous to $H_0$ -persistence in topological data analysis (McInnes et al., 2017).

(c) Topological Hierarchical Decompositions (THD):

THDs generalize single-linkage clustering via the category of cosheaves and nerves of covers. For each resolution, clusters are components of the nerve; the entire hierarchy forms a merge tree functorially associated to the filtration. This scheme can encode classical, Mapper-based, and Reeb graph-based clusterings (Joyce et al., 2023).

(d) Pretopology-based and ART-based approaches:

Methods such as PretopoMD use set-theoretic pseudoclosures defined by logical rules (e.g., DNF over prenetworks) to generate a hierarchy of clusters and a dendrogram via quasi-hierarchical structures, supporting mixed data types directly (Levy et al., 27 Nov 2025). ART-based divisive topological clustering constructs a hierarchy using vigilance-driven prototype adaptation and edge-based graph structures, supporting continual learning (Masuyama et al., 2022).

(e) Graph-based and parallelized methods:

TMFG-DBHT builds a sparse planar, chordal graph structure (TMFG) and uses bubble-based hierarchies (DBHT) for robust, parallelizable multiscale cluster extraction, with notable speedup and accuracy on structured data (Raphael et al., 2024).

3. Algorithmic Implementation and Computational Complexity

Algorithmic implementations are heterogeneous, but a common structure is as follows:

Distance/preference computation: Compute pairwise distances or similarities (e.g., Euclidean, Wasserstein, domain-specific).
Graph or filtration construction: Build graphs (NN-graphs, MSTs, TMFG) or filtered complexes (Vietoris–Rips, Mapper nerves).
Hierarchy formation: Trace merging events (component merges in filtration, edge contraction in MST, cover nerve inclusions).
Persistence/scoring: Quantify cluster significance by persistence, connected component stability, or custom scoring indices (e.g., mutual reachability, logical rules).
Extraction/pruning: Extract significant clusters using dynamic programming (maximum stability), logical rule aggregation, or component persistence thresholds.

Complexity is method-specific:

HTC: $O(MN^2)$ worst case ( $M$ filtration steps), typically dominated by distance computations (Carpio et al., 31 Dec 2025);
Accelerated HDBSCAN*: $O(N\log N)$ on typical data due to dual-tree MST and efficient $k$ -NN search (McInnes et al., 2017);
Representative aggregation (SRSC): $O(n\log n)$ via kd-tree nearest neighbor search per level and recursive aggregation (Xie et al., 2021);
TMFG-DBHT: $O(n^2\log n)$ , with substantial parallelism and lazy heap-based optimizations (Raphael et al., 2024).

4. Outlier Detection, Arbitrary Shape, and Robustness

Hierarchical topological clustering inherently identifies outliers as persistent components surviving to large scales in the filtration. Because the core mechanism is based on proximity or connectivity thresholds, with no assumption of convexity, non-spherical, elongated, or arbitrarily shaped clusters are naturally supported (Carpio et al., 31 Dec 2025, McInnes et al., 2017).

Persistent homology (i.e., the length of bars in the $H_0$ barcode) or analogous stability scoring (as in HDBSCAN*) directly quantifies the prominence and significance of clusters and outliers, making these algorithms robust to noise and sampling artifacts.

Empirical evaluations demonstrate that topological methods (e.g., HTC, HDBSCAN*, TMFG-DBHT) outperform classical methods (K-means, standard hierarchical agglomerative clustering, DBSCAN) in settings where cluster shapes are heterogeneous, clusters vary in density, or significant outliers are present. Notably, methods such as PretopoMD and ART-based divisive clustering successfully integrate heterogeneous (mixed) data, logical cluster definitions, and continual learning requirements, showing superior reliability and interpretability in practical tests (Levy et al., 27 Nov 2025, Masuyama et al., 2022).

5. Practical Considerations and Parameter Selection

Key considerations for practitioners include:

Distance metric choice: Select domain-relevant metrics, including Euclidean for geometry, Wasserstein for images, Fermat for high-dimensional profiles (Carpio et al., 31 Dec 2025).
Filtration granularity: Number of filtration steps (HTC), persistence thresholds, minimum cluster size ( $m$ for HDBSCAN*), or cover refinement level (THD) should be chosen to resolve natural data scale gaps. Inspection of inter-point distances or barcode gaps is recommended (McInnes et al., 2017, Carpio et al., 31 Dec 2025).
Interpretability: Logical rule-based methods produce explicit cluster membership criteria, facilitating transparency. Persistent merges and dendrogram structure provide clear multiscale summaries.
Scalability: Algorithms leveraging $k$ -d trees, dual-tree traversal, or parallelization (TMFG-DBHT) are suitable for large datasets, achieving practical $O(N\log N)$ to $O(n^2)$ runtimes (McInnes et al., 2017, Xie et al., 2021, Raphael et al., 2024).

Possible limitations include worse-case quadratic or higher complexity for dense or poorly scaled data, the need for parameter tuning in some methods, and scalability bottlenecks for extreme high-dimensionality or large $n$ .

6. Comparative Performance and Applications

Hierarchical topological clustering algorithms deliver state-of-the-art performance across domains where classical clustering fails:

Fragmented or manifold-like data: HTC outperforms K-means and average-linkage in fragment detection (e.g., detaching cancer fronts, complex economic interactomes) (Carpio et al., 31 Dec 2025).
Time-series and high-dimensional clustering: TMFG-DBHT achieves high accuracy and parallel efficiency versus agglomerative clustering on time-series benchmarks (Raphael et al., 2024).
Mixed and heterogeneous data: PretopoMD and ART-based divisive clustering address arbitrary feature mixes, continual data streams, and robust, interpretable clustering (Levy et al., 27 Nov 2025, Masuyama et al., 2022).
Explainable clustering: THD frameworks directly enable explainability and functoriality in unsupervised tasks (e.g., via cosheaf-theoretic decompositions) (Joyce et al., 2023).

Empirical results confirm these advantages, e.g., significant gains in clustering accuracy (SRSC Rand Index 0.7461 vs. 0.4754 for HAC-A), loop closure improvements in SLAM using hierarchical unsupervised clustering, and clear detection of outlier classes and driver genes in biological data.

7. Extensions, Current Research, and Future Directions

Current research in hierarchical topological clustering encompasses:

Multiparameter persistence: Generalizing single-parameter filtrations to sheaf or zigzag frameworks, treating multiple clustering parameters simultaneously (Shiebler, 2021, McInnes et al., 2017).
Scalable distributed and approximate algorithms: Enhancing MST and cover constructions with parallel, approximate, or graph partitioning methods (Raphael et al., 2024, McInnes et al., 2017).
Integration with cosheaf theory and category theory: Developing stable, functorial summaries and Bayesian inference pipelines for parameter learning and uncertainty quantification (Shiebler, 2021, Joyce et al., 2023).
Logical and explainable clustering: Advancing hybrid approaches (PretopoMD) for interpretable and expert-driven clustering of heterogeneous, raw data (Levy et al., 27 Nov 2025).
Continual and streaming learning: ART-based frameworks supporting online, adaptive clustering and resilience to concept drift (Masuyama et al., 2022).

A plausible implication is that, due to their capacity for arbitrary metric support, multiscale robustness, and formal explainability, hierarchical topological clustering methods will continue to expand in applications ranging from high-dimensional genomics and imaging to sequential data analytics, robotics SLAM, and explainable AI.