Papers
Topics
Authors
Recent
2000 character limit reached

PH-landmarks: Outlier-Robust PH Subsampling

Updated 16 December 2025
  • The paper introduces PH-landmarks, a method that uses local PH outlier scores to select a fixed-size landmark set that preserves topological signals in noisy datasets.
  • It computes persistent homology on local δ-neighborhoods via a δ-link construction inspired by the Mayer–Vietoris sequence, ensuring theoretical rigor.
  • Experimental results on synthetic datasets demonstrate that PH-landmarks outperforms traditional methods like random, max–min, and robust k-means in capturing the true data signal.

PH-landmarks is an outlier-robust subsampling methodology for landmark selection in persistent homology (PH) computations, designed to efficiently capture the topological signal of large, noisy point clouds. Motivated by the Mayer–Vietoris sequence, the algorithm assigns each point a local PH-based outlier score by evaluating the maximal persistence of homology classes in the δ-link of its neighborhood. Using these scores, PH-landmarks selects a fixed-size subset of representative or vital points whose inclusion preserves the topological structure of the data while discarding noise and outliers. Empirical comparisons on synthetic datasets demonstrate its superiority over conventional random, sequential max–min, and robust kk-means-based landmarking for low to moderate sampling densities in the presence of outlier noise (Stolz, 2021).

1. Mathematical Formulation and Outlier Definition

Given a point cloud D={y1,...,yn}RdD = \{y_1, ..., y_n\} \subset \mathbb{R}^d, the PH-landmarks approach selects a subset LDL \subset D of mnm \ll n landmarks such that the Vietoris–Rips PH of LL approximates that of the “signal” in DD, mitigating the influence of outliers. Two core parameters are fixed: a local neighborhood radius δ>0\delta > 0 and the desired number of landmarks mm.

For each yDy \in D, the local neighborhood is defined as:

Δy={zD{y}:zyδ}.\Delta_y = \{ z \in D \setminus \{y\} : \|z - y\| \leq \delta \}.

The Vietoris–Rips complex VR(Δy)\mathrm{VR}(\Delta_y) is constructed at all scales. The δ-link Lkδ(y)Lk^\delta(y) is the subcomplex comprising all simplices in VR(Δy)\mathrm{VR}(\Delta_y) not containing yy. Its “star closure” Stδ(y)\overline{\mathrm{St}}^\delta(y) is the union of Lkδ(y)Lk^\delta(y) and all simplices containing yy and entirely supported in Δy\Delta_y; by construction, this is contractible and serves as a local neighborhood for Mayer–Vietoris analysis.

The local PH-score quantifying outlierness is:

PHn(Lkδ(y)):=max[b,e)Bn(Lkδ(y))(eb),|\mathrm{PH}_n(Lk^\delta(y))| := \max_{[b, e) \in \mathcal{B}_n(Lk^\delta(y))}(e - b),

where Bn()\mathcal{B}_n(\cdot) denotes the persistence barcode in dimension nn (excluding infinite intervals). Two variants are used:

  • out_PH0,1,2(y):=maxn=0,1,2PHn(Lkδ(y))out\_PH^{0,1,2}(y) := \max_{n=0,1,2} |\mathrm{PH}_n(Lk^\delta(y))| (captures large deviations in low dimensions)
  • out_PH1(y):=PH1(Lkδ(y))out\_PH^1(y) := |\mathrm{PH}_1(Lk^\delta(y))| (sensitive only to 1D topological features)

A point yy with Δy<2|\Delta_y| < 2 is deemed a “super-outlier,” indicating extreme isolation. Such points are ranked last and only selected if D|D| - (number of super-outliers) <m< m, ensuring outliers are rarely prioritized.

Two landmarking strategies are defined:

  • “Representative” (PH-I): select mm points with smallest outlier score.
  • “Vital” (PH-II): select mm points with largest outlier score.

2. Mayer–Vietoris Motivation and Theoretical Rationale

The PH-landmarks methodology is rooted in the Mayer–Vietoris long exact sequence. For a complex X=ABX = A \cup B:

Hn(AB)ΦHn(A)Hn(B)ΨHn(X)Hn1(AB)\dots \to H_n(A \cap B) \xrightarrow{\Phi_*} H_n(A) \oplus H_n(B) \xrightarrow{\Psi_*} H_n(X) \xrightarrow{\partial_*} H_{n-1}(A \cap B) \to \dots

Setting A=X{y}A = X \setminus \{y\}, B=St(y)B = \overline{\mathrm{St}}(y) (which is contractible), and AB=Lk(y)A \cap B = Lk(y), if Hn(Lk(y))=0H_n(Lk(y)) = 0, then for n>0n > 0, Hn(A)Hn(X)H_n(A) \cong H_n(X). In practical data settings, Hn(Lkδ(y))H_n(Lk^\delta(y)) rarely vanishes exactly, but the largest finite-persistence class in PH(Lkδ(y))\mathrm{PH}(Lk^\delta(y)) measures the deviation from this isomorphism. Thus, points with small PHn(Lkδ(y))|\mathrm{PH}_n(Lk^\delta(y))| are inessential to global PH, while large values indicate points critical to the dataset’s topology (Stolz, 2021).

3. Algorithmic Workflow and Pseudocode

The annotated PH-landmarks (representative version) workflow is as follows:

  1. Initialization: SS \leftarrow \varnothing (super-outlier set).
  2. For each yDy \in D:
    • Compute Δy\Delta_y.
    • If Δy<2|\Delta_y| < 2: assign yy to SS.
    • Otherwise: build the Vietoris–Rips filtration of Δy\Delta_y up to dimension 2 and compute out_PH(y)out\_PH(y).
  3. Sort D=DSD' = D \setminus S by non-decreasing out_PH(y)out\_PH(y).
  4. Select the first min(m,D)\min(m, |D'|) points for LL (the landmark set).
  5. If L<m|L| < m, append arbitrary points from SS to reach size mm.
  6. Return (L,S)(L, S).

All Vietoris–Rips computations are performed locally on Δy\Delta_y, significantly reducing computational complexity for large nn. Super-outliers are included as landmarks only when unavoidable, guarding against isolated noise dominating the landmark set.

4. Computational Complexity

Let n=Dn = |D|, mm the landmark budget, and Δyk|\Delta_y| \leq k (average neighborhood size), where typically knk \ll n. The main computational costs are:

  • Building VR(Δy)\mathrm{VR}(\Delta_y) up to dimension 2: O(k3)O(k^3) in the worst case (all kk-vertex cliques).
  • Computing persistence via Ripser: O(k3)O(k^3) worst-case, typically much faster.
  • Total over all points: O(nk3)O(n k^3).
  • Sorting nn outlier scores: O(nlogn)O(n \log n).
  • Top mm selection: O(m)O(m).

Thus, overall cost is O(nC(δ)+nlogn)O(n C(\delta) + n \log n) with C(δ)k3C(\delta) \approx k^3 increasing with δ\delta, so δ\delta is chosen to keep kk moderate (Stolz, 2021).

5. Experimental Evaluation on Synthetic Data

PH-landmarks was benchmarked on six synthetic point clouds (n=3000n=3000), each a mixture of signal (structured manifold) and noise components:

  • 3D examples (with p{0.3,0.6,0.9}p \in \{0.3, 0.6, 0.9\}):
    • Sphere–cube (pUniform(S2)+(1p)Uniform([1,1]3)p \cdot \mathrm{Uniform}(S^2) + (1-p) \cdot \mathrm{Uniform}([-1,1]^3))
    • Sphere–plane (pUniform(S2)+(1p)Uniform([3,3]2)p \cdot \mathrm{Uniform}(S^2) + (1-p) \cdot \mathrm{Uniform}([-3,3]^2), z=0z=0)
    • Sphere–line, Sphere–Laplace-line (with additive Laplace noise)
  • 4D examples:
    • Torus (pUniform(T2R4)p \cdot \mathrm{Uniform}(T^2 \subset \mathbb{R}^4) plus thickened noise)
    • Klein bottle (similarly thickened)

For each dataset, sampling density s=m/ns = m/n was varied from $0.05$ to $1.0$. The primary metric was the “fraction of signal landmarks” (i.e., selected points drawn from the ground-truth manifold). Standard deviation over 20 randomizations was recorded for stochastic methods.

6. Quantitative Comparison with Competing Subsampling Schemes

PH-landmarks was compared with random-uniform, sequential max–min (de Silva–Carlsson), and robust k2k^2-means⁻ (Chawla–Gionis, with and without explicit outlier removal) subsampling. The following table (for p=0.6p=0.6 in the “sphere–cube” dataset) summarises the fraction of signal landmarks at various sampling densities (ss):

s=m/ns = m/n Random MaxMin kk^{--} koutfk^{--}_{\mathrm{outf}} PH-I PH-II
0.05 0.60 ± 0.02 0.12 ± 0.05 0.58 ± 0.03 0.42 ± 0.04 0.75 0.68
0.10 0.60 ± 0.02 0.08 ± 0.04 0.60 ± 0.03 0.55 ± 0.04 0.82 0.75
0.50 0.60 ± 0.02 0.45 ± 0.06 0.62 ± 0.02 0.61 ± 0.02 0.79 0.83
1.00 0.60 ± 0.02 0.60 ± 0.02 0.60 ± 0.02 0.60 ± 0.02 0.60 0.60

At low ss (s0.3s \lesssim 0.3), PH-I and PH-II consistently achieve higher signal fractions than all competitors. Max–min is highly susceptible to noise in low-density regions, resulting in very low signal coverage at small ss. Random and k2k^2-means⁻ maintain signal fractions near pp but lack explicit outlier rejection. PH-landmarks outperforms by down-weighting or excluding outliers, with PH-I favored when pp is low or noise is uniform, and PH-II performing best with clustered noise or high pp (Stolz, 2021).

7. Conclusions and Significance

PH-landmarks establishes a theoretically motivated, outlier-robust framework for topological landmark selection in persistent homology. By leveraging local PH computations tied to the Mayer–Vietoris sequence, the method efficiently ranks and selects representative or vital subsamples, preserving topological signal in the presence of substantial noise. Computational tractability is maintained as all PH calculations are localized. Extensive benchmarking confirms that PH-landmarks surpasses random, max–min, and robust kk-means⁻ schemes at low to moderate sampling densities, demonstrating reliable outlier resistance and signal fidelity (Stolz, 2021). This suggests PH-landmarks constitutes a practical advance in scalable and robust topological data analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to PH-landmarks: Outlier-Robust Local PH Subsampling.