PH-landmarks: Outlier-Robust PH Subsampling
- The paper introduces PH-landmarks, a method that uses local PH outlier scores to select a fixed-size landmark set that preserves topological signals in noisy datasets.
- It computes persistent homology on local δ-neighborhoods via a δ-link construction inspired by the Mayer–Vietoris sequence, ensuring theoretical rigor.
- Experimental results on synthetic datasets demonstrate that PH-landmarks outperforms traditional methods like random, max–min, and robust k-means in capturing the true data signal.
PH-landmarks is an outlier-robust subsampling methodology for landmark selection in persistent homology (PH) computations, designed to efficiently capture the topological signal of large, noisy point clouds. Motivated by the Mayer–Vietoris sequence, the algorithm assigns each point a local PH-based outlier score by evaluating the maximal persistence of homology classes in the δ-link of its neighborhood. Using these scores, PH-landmarks selects a fixed-size subset of representative or vital points whose inclusion preserves the topological structure of the data while discarding noise and outliers. Empirical comparisons on synthetic datasets demonstrate its superiority over conventional random, sequential max–min, and robust -means-based landmarking for low to moderate sampling densities in the presence of outlier noise (Stolz, 2021).
1. Mathematical Formulation and Outlier Definition
Given a point cloud , the PH-landmarks approach selects a subset of landmarks such that the Vietoris–Rips PH of approximates that of the “signal” in , mitigating the influence of outliers. Two core parameters are fixed: a local neighborhood radius and the desired number of landmarks .
For each , the local neighborhood is defined as:
The Vietoris–Rips complex is constructed at all scales. The δ-link is the subcomplex comprising all simplices in not containing . Its “star closure” is the union of and all simplices containing and entirely supported in ; by construction, this is contractible and serves as a local neighborhood for Mayer–Vietoris analysis.
The local PH-score quantifying outlierness is:
where denotes the persistence barcode in dimension (excluding infinite intervals). Two variants are used:
- (captures large deviations in low dimensions)
- (sensitive only to 1D topological features)
A point with is deemed a “super-outlier,” indicating extreme isolation. Such points are ranked last and only selected if (number of super-outliers) , ensuring outliers are rarely prioritized.
Two landmarking strategies are defined:
- “Representative” (PH-I): select points with smallest outlier score.
- “Vital” (PH-II): select points with largest outlier score.
2. Mayer–Vietoris Motivation and Theoretical Rationale
The PH-landmarks methodology is rooted in the Mayer–Vietoris long exact sequence. For a complex :
Setting , (which is contractible), and , if , then for , . In practical data settings, rarely vanishes exactly, but the largest finite-persistence class in measures the deviation from this isomorphism. Thus, points with small are inessential to global PH, while large values indicate points critical to the dataset’s topology (Stolz, 2021).
3. Algorithmic Workflow and Pseudocode
The annotated PH-landmarks (representative version) workflow is as follows:
- Initialization: (super-outlier set).
- For each :
- Compute .
- If : assign to .
- Otherwise: build the Vietoris–Rips filtration of up to dimension 2 and compute .
- Sort by non-decreasing .
- Select the first points for (the landmark set).
- If , append arbitrary points from to reach size .
- Return .
All Vietoris–Rips computations are performed locally on , significantly reducing computational complexity for large . Super-outliers are included as landmarks only when unavoidable, guarding against isolated noise dominating the landmark set.
4. Computational Complexity
Let , the landmark budget, and (average neighborhood size), where typically . The main computational costs are:
- Building up to dimension 2: in the worst case (all -vertex cliques).
- Computing persistence via Ripser: worst-case, typically much faster.
- Total over all points: .
- Sorting outlier scores: .
- Top selection: .
Thus, overall cost is with increasing with , so is chosen to keep moderate (Stolz, 2021).
5. Experimental Evaluation on Synthetic Data
PH-landmarks was benchmarked on six synthetic point clouds (), each a mixture of signal (structured manifold) and noise components:
- 3D examples (with ):
- Sphere–cube ()
- Sphere–plane (, )
- Sphere–line, Sphere–Laplace-line (with additive Laplace noise)
- 4D examples:
- Torus ( plus thickened noise)
- Klein bottle (similarly thickened)
For each dataset, sampling density was varied from $0.05$ to $1.0$. The primary metric was the “fraction of signal landmarks” (i.e., selected points drawn from the ground-truth manifold). Standard deviation over 20 randomizations was recorded for stochastic methods.
6. Quantitative Comparison with Competing Subsampling Schemes
PH-landmarks was compared with random-uniform, sequential max–min (de Silva–Carlsson), and robust -means⁻ (Chawla–Gionis, with and without explicit outlier removal) subsampling. The following table (for in the “sphere–cube” dataset) summarises the fraction of signal landmarks at various sampling densities ():
| Random | MaxMin | PH-I | PH-II | |||
|---|---|---|---|---|---|---|
| 0.05 | 0.60 ± 0.02 | 0.12 ± 0.05 | 0.58 ± 0.03 | 0.42 ± 0.04 | 0.75 | 0.68 |
| 0.10 | 0.60 ± 0.02 | 0.08 ± 0.04 | 0.60 ± 0.03 | 0.55 ± 0.04 | 0.82 | 0.75 |
| 0.50 | 0.60 ± 0.02 | 0.45 ± 0.06 | 0.62 ± 0.02 | 0.61 ± 0.02 | 0.79 | 0.83 |
| 1.00 | 0.60 ± 0.02 | 0.60 ± 0.02 | 0.60 ± 0.02 | 0.60 ± 0.02 | 0.60 | 0.60 |
At low (), PH-I and PH-II consistently achieve higher signal fractions than all competitors. Max–min is highly susceptible to noise in low-density regions, resulting in very low signal coverage at small . Random and -means⁻ maintain signal fractions near but lack explicit outlier rejection. PH-landmarks outperforms by down-weighting or excluding outliers, with PH-I favored when is low or noise is uniform, and PH-II performing best with clustered noise or high (Stolz, 2021).
7. Conclusions and Significance
PH-landmarks establishes a theoretically motivated, outlier-robust framework for topological landmark selection in persistent homology. By leveraging local PH computations tied to the Mayer–Vietoris sequence, the method efficiently ranks and selects representative or vital subsamples, preserving topological signal in the presence of substantial noise. Computational tractability is maintained as all PH calculations are localized. Extensive benchmarking confirms that PH-landmarks surpasses random, max–min, and robust -means⁻ schemes at low to moderate sampling densities, demonstrating reliable outlier resistance and signal fidelity (Stolz, 2021). This suggests PH-landmarks constitutes a practical advance in scalable and robust topological data analysis.