Subsampling Methods in Topological Data Analysis
- Subsampling methods are efficient algorithmic strategies that extract persistent topological features while significantly reducing computational costs.
- Techniques such as random sampling, landmark selection, and adaptive density approaches balance bias and variance to preserve critical invariants like Betti numbers.
- Empirical and theoretical guarantees ensure near-optimal preservation of topological invariants, making TDA scalable for large, high-dimensional datasets.
Subsampling methods for topological data analysis (TDA) are algorithmic strategies designed to reduce the computational cost of extracting topological features from large, high-dimensional datasets. These methodologies aim to preserve the essential topological descriptors—such as persistent homology, Betti curves, and Euler characteristic functions—while operating on a fraction of the original data or coordinates. Techniques include random and structured point-cloud subsampling, sparse landmark selection, adaptive density sampling, directional ε-nets, and low-rank representation-based column selection. By leveraging stability results for persistent homology and geometric considerations of data manifolds, subsampling enables practical computation and statistical inference with rigorous control over error bounds and loss of topological fidelity.
1. Foundations and Rationale
The computational complexity of persistent homology and related TDA pipelines scales super-linearly (often exponentially) in the number of data points, dimensions, or simplices, rendering full-scale analysis infeasible for large datasets. Subsampling methods alleviate this bottleneck by working with representative subsets whose topological invariants (e.g., persistence diagrams) are provably close to those of the full dataset under various metrics, such as the bottleneck or Wasserstein distances (Chazal et al., 2014, Cao et al., 2022, Stier et al., 3 Sep 2025, Minian, 26 Nov 2025). The stability of persistent homology under Gromov–Hausdorff or sup-norm perturbations (Cao et al., 2022) ensures that appropriately chosen subsamples can maintain global and local topological features. Key design principles include:
- Bias–variance tradeoff: Selecting subsample size and number balances approximation bias and estimator variance (Cao et al., 2022, Chazal et al., 2014).
- Geometric coverage: Dense sampling or ε-net covering controls the loss of observability for vertices and features (Fasy et al., 15 Nov 2025).
- Feature preservation: Strategies such as PH-aware landmark selection or strong collapses guarantee critical cycles and Betti numbers are retained (Stolz, 2021, Minian, 26 Nov 2025).
- Memory and runtime reduction: Subsampling shifts computational burden from intractable O(nk) operations to multiple O(mk) computations with m≪n (Chazal et al., 2014, Stier et al., 3 Sep 2025).
2. Taxonomy of Subsampling Techniques
Several distinct approaches have emerged, each suited to specific data modalities and theoretical regimes. Table 1 compares essential categories.
| Technique | Main Mechanism | Typical Use Case |
|---|---|---|
| Random/Bootstrap Sampling | Uniform or weighted subsamples | Generic point clouds, statistical inference (Chazal et al., 2014, Cao et al., 2022) |
| Landmark Selection | PH‐aware local filtering, MaxMin | Large clouds, outlier/noise scenarios (Stolz, 2021, Minian, 26 Nov 2025) |
| ε-Net on Spheres | Geometric covering of directions | Directional transforms, stratified data (Fasy et al., 15 Nov 2025) |
| δ-Core Strong Collapse | Removing dominated vertices | Simplicial complexes, homotopy reduction (Minian, 26 Nov 2025) |
| Sparse Coordinates/QR Pivot | Low-rank factorization, DEIM pivot | Persistence images, multi-way classification (Guo et al., 2017) |
| Adaptive Density Sampling | Numerical geometry, coverage tests | Algebraic varieties, manifold sampling (Dufresne et al., 2018) |
| Subsequence Embedding | Arithmetic subsequence extraction | Irregular time series, TDE+TDA (Dakurah et al., 17 Oct 2024) |
| Stochastic Filtration (ALBATROSS) | Iterated small random subsamples | Large biological, neuroimaging matrices (Stier et al., 3 Sep 2025) |
Each method exploits specific algebraic, geometric, or statistical properties to minimize subsample size while preserving critical topological descriptors.
3. Algorithmic Frameworks and Representative Procedures
Subsampling pipelines generally consist of three core stages: (a) selection of subsample points or coordinates, (b) computation of topological transforms (persistent homology, Betti/Euler functions), and (c) aggregation or model fitting for downstream tasks.
Sparse-TDA: Low-Rank Selection via Pivoted QR
The Sparse-TDA algorithm (Guo et al., 2017) computes persistence images (PI) from data samples, stacks these as rows of a matrix A, approximates A by its rank-r truncated SVD, and applies a column-pivoted QR on the right singular vectors to select the r most discriminative PI coordinates. These indices J reduce each PI vector from high dimension d to r, supporting efficient and interpretable classification. The pipeline is:
- Compute persistent diagrams D_i via PH for each sample X_i.
- Convert D_i to PI vector x_i ∈ ℝd.
- Form A ∈ ℝ{N×d}; approximate via SVD to rank r.
- Apply QR with column pivoting to V_rT from the SVD, obtaining indices J.
- Extract reduced features ; use for classification.
Error bounds show near-optimal approximation, and empirical results demonstrate ~10×–50× speedups in classifier training (Guo et al., 2017).
PH-landmarks: Outlier-Robust Local PH Subsampling
The PH-landmarks method (Stolz, 2021) ranks each point in a cloud by the maximal persistence of its local neighborhood filtration. Landmarks are chosen to minimize the bottleneck distance between the full and subsampled PH, yielding robustness against outliers and dense regions. The algorithm computes local PH for each point’s δ-neighborhood, sorts by outlierness, and selects m landmarks of lowest perturbation. This outperforms MaxMin and random schemes in noisy data.
ε-Net Covering for Directional Transforms
Subsampling for directional transforms (e.g., Persistent Homology Transform, Euler Characteristic Transform) relies on covering the sphere Sd with a minimal ε-net (Fasy et al., 15 Nov 2025). Greedy farthest-point seeding constructs sample sets of directions that guarantee all observable regions are hit, with the size of the net scaling as O(ε{-d}). This ensures faithfulness but demands balancing cost against loss of features.
δ-Core Strong Collapse
δ-core subsampling (Minian, 26 Nov 2025) removes dominated vertices from a point cloud or simplicial complex, preserving homotopy type and inducing a δ-interleaving of Rips filtrations. This enables substantial reduction in simplex count (often >60%), lowers bottleneck and Wasserstein distances compared to alternatives, and maintains computational tractability for persistent homology.
4. Theoretical Guarantees and Error Bounds
Subsampling strategies are informed by several key results:
- Stability of persistent homology: Under Gromov–Hausdorff or bottleneck metrics, small perturbations of the input data or filtrations yield controlled variations in the persistence diagram (Chazal et al., 2014, Cao et al., 2022, Minian, 26 Nov 2025).
- Bias–variance decomposition: Bootstrapped mean persistence diagrams converge to the ideal diagram at rates O(m{-1/2}) (variance) and O(n{-\beta}) (bias, depending on subsample size and data distribution parameters) (Cao et al., 2022).
- Homology inference for algebraic varieties: For adaptive (δ, ε)-dense samples, provided the homological feature size exceeds 2(ε+δ), true Betti numbers are observed in the sampled diagrams (Dufresne et al., 2018).
- δ-interleaving: δ-core filtration and original filtration are δ-interleaved, so bottleneck distances between their persistence diagrams are at most δ (Minian, 26 Nov 2025).
- Sparse coordinate selection: QR pivoting ensures a polynomial overhead over optimal SVD approximation for feature compression (Guo et al., 2017).
These error guarantees permit principled selection of subsample size, density, and parameters to control topological fidelity.
5. Comparative Applications, Trade-offs, and Practical Recommendations
Empirical benchmarks and workflow studies reveal characteristic trade-offs:
- Computational efficiency: Memory and time requirements drop from infeasible O(Nk) or exponential costs to tractable O(B·mk) via subsampling (B subsamples of size m≪N) (Chazal et al., 2014, Stier et al., 3 Sep 2025).
- Classification accuracy: In multi-way tasks (e.g., image texture, mesh posture), Sparse-TDA achieves nearly full-Persistence Image discriminatory power with sub-10% loss and radical time reduction (Guo et al., 2017).
- Geometric content: Adaptive sampling for varieties yields fully recovered Betti numbers and geometric features when sampling density matches theoretical lower bounds (Dufresne et al., 2018).
- Directional transforms: ε-net sizes must be tuned to mesh geometry; oversampling secures fidelity but is often costly, while undersampling can erase vertices or features (Fasy et al., 15 Nov 2025).
- Outlier-robustness: PH-landmarks maintain signal fraction and feature preservation even at low sampling densities, outperforming random and MaxMin in noisy regimes (Stolz, 2021).
- Memory savings and scalability: Protocols such as ALBATROSS allow PH analysis of >100,000 point datasets using O(30–100) point subsamples; accuracy is stable for n≥30 and hundreds of subsamples (Stier et al., 3 Sep 2025).
Recommendations include:
- Use bias/variance rate formulas to select subsample size/time budgets (Cao et al., 2022, Chazal et al., 2014).
- For persistent images, truncated SVD with QR pivoting reliably compresses features with interpretable indices (Guo et al., 2017).
- For algebraic varieties, adaptively tune ε/δ to match the homological feature size (Dufresne et al., 2018).
- For directional transforms, construct ε-nets of size O(ε{-d}) below the minimal observability angle (Fasy et al., 15 Nov 2025).
- Combine PH-aware local filtering and outlier scoring for robust landmark selection (Stolz, 2021).
- For time series, use subsequence embedding to avoid artifacts from missing or irregular sampling (Dakurah et al., 17 Oct 2024).
6. Extensions and Current Research Directions
Recent work examines blending subsampling with density weighting (Agerberg et al., 2022), streaming parallel computation, multi-scale core formation, and hybrid scoring methods (e.g., combining outlierness with geometric density or curvature) (Stolz, 2021, Minian, 26 Nov 2025). Open problems include:
- Tightening theoretical constants in risk bounds for subsampling estimators (Chazal et al., 2014).
- Optimizing subsampling for very high-dimensional or highly heterogeneous manifolds (Minian, 26 Nov 2025).
- Efficient construction of ε-nets and stratification in combinatorically complex meshes (Fasy et al., 15 Nov 2025).
- Statistical inference on aggregated persistence summaries and feature curves (stable ranks) (Agerberg et al., 2022, Stier et al., 3 Sep 2025).
- Integration with numerically certified algebraic solvers for polynomial systems (Dufresne et al., 2018).
- Practical calibration of parameters (subsample size, density thresholds, net resolution) via cross-validation or adaptive grid search.
Subsampling is now central to practical TDA, making large-scale, multi-class, and real-world applications tractable, while preserving rigorous control over topological invariants.