Denoising data reduction algorithm for Topological Data Analysis

Published 31 Mar 2026 in cs.CG, math.AT, and math.GT | (2603.29248v1)

Abstract: Persistent homology is a central tool in topological data analysis, but its application to large and noisy datasets is often limited by computational cost and the presence of spurious topological features. Noise not only increases data size but also obscures the underlying structure of the data. In this paper, we propose the Refined Characteristic Lattice Algorithm (RCLA), a grid-based method that integrates data reduction with threshold-based denoising in a single procedure. By incorporating a threshold parameter $k$, RCLA removes noise while preserving the essential structure of the data in a single pass. We further provide a theoretical guarantee by proving a stability theorem under a homogeneous Poisson noise model, which bounds the bottleneck distance between the persistence diagrams of the output and the underlying shape with high probability. In addition, we introduce an automatic parameter selection method based on nearest-neighbor statistics. Experimental results demonstrate that RCLA consistently outperforms existing methods, and its effectiveness is further validated on a 3D shape classification task.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents RCLA, which integrates topology-preserving data reduction and denoising into a single automated pipeline using a density threshold k.
It establishes a formal stability theorem ensuring that the persistence diagrams of processed data closely match the true topological signal under homogeneous Poisson noise.
Extensive experiments on synthetic and 3D animal mesh data show RCLA’s superior noise suppression and robust performance in downstream classification tasks.

Denoising Data Reduction for Topological Data Analysis: The Refined Characteristic Lattice Algorithm

Introduction

Persistent homology provides a robust framework for extracting meaningful topological summaries from high-dimensional and complex datasets; however, its computational cost grows rapidly with data size and is further exacerbated by the presence of noise. The work "Denoising data reduction algorithm for Topological Data Analysis" (2603.29248) addresses these challenges by presenting the Refined Characteristic Lattice Algorithm (RCLA), which integrates topology-preserving data reduction and threshold-based denoising into a single, automatically parameterized pipeline. RCLA is rigorously analyzed from both theoretical and empirical perspectives, including a formal stability guarantee under homogeneous Poisson noise and extensive experiments on synthetic and real 3D shape data.

Background: Data Reduction and Denoising in Persistent Homology

While data reduction techniques (e.g., witness complexes, random projection, cluster-based methods) have been standard for overcoming the combinatorial explosion in constructing Vietoris-Rips complexes, they generally treat all data points equally and transmit noise into reduced datasets. In contrast, denoising methods (e.g., Adaptive DBSCAN, LDOF, LUNAR) operate as preprocessing steps but lack guarantees of topological fidelity and require non-trivial parameter tuning.

The Characteristic Lattice Algorithm (CLA) improves computational tractability by partitioning the ambient space into a uniform grid of hypercubes and representing occupied cells by a single sample point, typically the cell center. However, CLA is agnostic to the presence of noise, which it systematically retains.

The Refined Characteristic Lattice Algorithm (RCLA)

RCLA extends CLA by introducing a density threshold $k$ ---only grid cells containing at least $k$ points are selected for representative extraction, and all points within sparser cells are filtered as noise. The core procedure proceeds as follows:

Figure 1: The steps of the CLA and RCLA algorithms. The key difference is that RCLA applies a cardinality threshold per cell, providing topology-aware denoising.

This refinement ensures that both topological signal preservation and noise removal are handled in a single operation, and the choice of threshold $k$ is automatized through Poisson process modeling of the ambient noise.

Stability Guarantee

A central contribution is a formal stability theorem quantifying the topological proximity (in the bottleneck distance) between the persistence diagrams (PDs) of the RCLA output and the ground-truth signal, given homogeneous Poisson noise. For suitable lattice diameter $\delta$ and threshold $k$ , the output satisfies

$d_B\left(B_n(X_{\mathrm{shape}}), B_n(X^*_{\delta,k})\right) \leq \sqrt{m} \delta$

with probability controlled by the Poisson intensity and known explicit formulas for the error probabilities associated to false positives (retained noise cells) and coverage (discarded signal). This result provides a rigorous, instance-dependent guarantee on the output quality that previous denoising or reduction schemes lack.

Automatic Parameter Selection

Selecting $(\delta, k)$ optimally is nontrivial, particularly under unknown and heterogeneous noise. The proposed method draws candidate grid sizes from quantiles of empirical nearest-neighbor distances and estimates the per-cell Poisson intensity through Bayesian modeling, using the observed number of empty cells. The threshold $k$ is determined to control the expected number of retained noise-only cells below a user-specified budget.

Grid candidates are scored using a cost functional penalizing both over-fragmentation (reflected by the number of connected components among cell centers at a specified linking radius) and high variance in representative spacing, and the optimal pair is chosen as the minimizer.

Figure 2: Example datasets preprocessed by CLA (left) and RCLA (right). RCLA aggressively suppresses spurious isolated cells, resulting in outputs that better reflect the underlying shape.

Experimental Validation

Data and Methods

Experiments are conducted on synthetic datasets comprising combinations of geometric shapes (single and double circles) contaminated with increasing fractions of homogeneous background noise. Multiple denoising approaches are benchmarked: RCLA, CLA, Adaptive DBSCAN, LDOF, and LUNAR.

Figure 3: The first point cloud, consisting of a noisy circle, serves as a canonical test for topological denoising and reduction.

Topological Fidelity

Quantitative evaluation is performed by comparing the $H_1$ bottleneck distances between the PDs of denoised/reduced outputs and the noise-free ground truth.

Figure 4: Visual and persistence diagram comparison among raw data, CLA, and RCLA. The output of RCLA closely matches the topological structure of the underlying clean shape, as evident from the near-perfect matching PDs.

RCLA consistently achieves lower bottleneck distances than all baselines---for example, in the circle data with $r=0.1$ noise, CLA has $k$ 0 while RCLA achieves $k$ 1. Increasing the noise ratio degrades all methods, but RCLA maintains superior mean fidelity and markedly reduced variance over 20 trials, a clear indicator of its robustness.

Figure 5: Point clouds from each method, selected as those with the smallest and largest bottleneck distance from $k$ 2. RCLA exhibits uniformly high fidelity and stability, while variances in LDOF and LUNAR outputs are evident.

Shape Classification Robustness

RCLA is applied to 3D animal mesh datasets (camel, elephant, horse) both with and without injected HPPP noise, across 20 different parameterizations. As part of a topological machine learning pipeline, persistence diagram statistics (44-dimensional feature vectors) are extracted and passed to a linear SVM for classification. RCLA achieves $k$ 394.5% data compression, and classification accuracy remains at or above 99.5% in all scenarios (mean 99.88%).

Figure 6: Visualization of the 3D animal datasets (clean and noisy). RCLA enables effective downstream classification even under high noise, supporting the utility of topological denoising in practical analysis pipelines.

Implications and Future Directions

The RCLA framework demonstrates that grid-based, topology-aware denoising with formal stability control is achievable in practice, integrating naturally with persistent homology workflows and persistence-based machine learning pipelines. The theoretical guarantees provide practitioners with quantitative assurances regarding the preservation of topological signal, and the automatic parameter selection sidesteps the need for manual, data-specific tuning prevalent in earlier works.

Potential avenues for extension include non-uniform grid schemes adaptive to local data density, coupling RCLA-style reduction with learned feature representations or hierarchical multi-scale analysis for improved performance on strongly non-uniform or manifold-based datasets, and further generalization of the stability analysis to dependent or structured noise models.

Conclusion

RCLA establishes a principled paradigm for simultaneous data reduction and denoising in topological data analysis, with explicit statistical and topological guarantees and strong empirical performance across noise regimes. The method fills a critical methodological gap by unifying noise suppression and structure preservation, enabling scalable and reliable topological computation for large, noisy datasets. Future research may further enhance adaptivity and computational efficiency, potentially broadening the class of applicable data sources and downstream tasks.

Markdown Report Issue