Sparse-TDA: Low-Rank Selection via Pivoted QR

Updated 16 December 2025

The paper introduces a novel subset selection algorithm using pivoted QR and sRRQR, achieving rigorous error bounds in CUR matrix approximations.
It leverages deterministic and randomized methods with sketching techniques to reduce computational costs in processing large and sparse matrices.
The approach enables accurate feature extraction for persistent homology and TDA, providing scalable and interpretable solutions for high-dimensional data.

Sparse-TDA is a suite of algorithms and analysis techniques centering on the selection of low-rank subsets via pivoted or strong rank-revealing QR (sRRQR) factorizations, developed to address both computational efficiency and interpretability in large-scale topological data analysis (TDA) and matrix approximation. The core methodology involves identifying a small subset of rows and columns that capture the dominant topological or algebraic variability in a matrix, under explicit low-rank, incoherence, and sparsity structural assumptions. Sparse-TDA leverages both deterministic and randomized QR-based pivoting for subset selection, providing rigorous error bounds for recovery and approximation—in particular, in the context of persistence-feature matrix sampling and CUR (column–row) low-rank approximations relevant to TDA and machine learning tasks.

1. Structural Foundations and Problem Statement

Sparse-TDA is formulated for a matrix $A \in \mathbb{R}^{n \times m}$ assumed to have target rank $k \ll \min\{n, m\}$ . The main goal is to select index sets $I \subset \{1,\ldots,n\}$ and $J \subset \{1,\ldots,m\}$ , typically of size $\ell \gtrsim k$ , such that a CUR decomposition $A \approx C U R$ (with $C = A(:, J)$ , $R = A(I, :)$ , $U = A(I, J)^\dagger$ ) achieves small spectral-norm error $\|A - C U R\|_2$ . The algorithmic focus is on reducing the number of entries of $A$ accessed (sublinear or streaming regime), crucial when $A$ is very large or implicitly defined, as in simplex boundary matrices for persistent homology or persistence image feature matrices.

The underlying structural assumptions are that $A = X Z Y^\top + E$ , where $X \in \mathbb{R}^{n \times k}$ and $Y \in \mathbb{R}^{m \times k}$ admit low-coherence and/or column-sparsity, $Z$ is block-diagonal, and $E$ is a small perturbation. These assumptions are tailored to typical scenarios in TDA, where persistent cycles or holes are supported on sparse subsets and possess near-orthogonality in an appropriate embedding (Cortinovis et al., 21 Feb 2024, Guo et al., 2017).

2. Pivoted QR and Strong Rank-Revealing QR

Pivoted QR (QR with column or row pivoting, QRCP) is central to Sparse-TDA, as it greedily selects representative features maximizing residual variance. Strong rank-revealing QR (sRRQR) extends classical QRCP by enforcing additional block structure and interlacing conditions that guarantee the leading $k$ pivots not only span a large subspace, but also have well-controlled spectral approximation properties (Gu–Eisenstat’s criteria). For $M \in \mathbb{R}^{p \times q}$ and rank $r$ , sRRQR produces a factorization

$MP = Q \begin{pmatrix} M_{11} & M_{12} \ 0 & M_{22} \end{pmatrix}$

with $P$ a permutation, $Q$ orthogonal, such that $\sigma_i(M_{11}) \ge \sigma_i(M)/\sqrt{1+\eta r(p-r)}$ and $\|M_{11}^{-1}M_{12}\|_{\max} \le 1$ , for a constant $\eta \approx 1$ .

In topological settings, pivoted QR is often applied to persistence image matrices, selecting a pixel (row) subset whose restricted features retain near-optimal classification power or signal reconstruction, and to boundary/chain matrices in persistent homology, extracting generator cycles (Guo et al., 2017, Cortinovis et al., 21 Feb 2024, Duersch et al., 2015, Grigori et al., 24 Mar 2025).

3. Randomized and Sublinear Algorithms

Randomization accelerates subset selection in high dimensions by first sketching the matrix with a random projection, followed by deterministic QR-based pivoting on the reduced matrix. Key procedures include:

Randomized Subspace Embedding: Form a sketch $A^{sk} = \Omega A$ , where $\Omega$ is an OSE (e.g., Gaussian, SRHT, CountSketch), with $d = O(\varepsilon^{-2}(k + \log(1/\delta)))$ sufficing for $(\varepsilon, \delta, k)$ -embedding (Grigori et al., 24 Mar 2025, Duersch et al., 2020, Fakih et al., 3 Sep 2025).
sRRQR on Sketch: Perform sRRQR (pivoted QR with strong stopping criterion) on $A^{sk}$ to select $k$ or $k'$ columns, induce the permutation on $A$ itself, and, if needed, pull back to the original column set via the sparsity structure of $\Omega$ (Grigori et al., 24 Mar 2025, Fakih et al., 3 Sep 2025).
Symmetric Row–Column Selection: Apply the above procedure to $A^\top$ for row selection, yielding index sets for both CUR columns and rows (Cortinovis et al., 21 Feb 2024).

Randomized QR-based subset selection offers computational cost reduced by $O(n/m)$ to $O(\text{nnz}(A)\cdot k)$ compared to direct QRCP, and requires only a single pass or a small number of passes over $A$ . For massive or streaming TDA matrices, sketching via sparse random projections ensures that only $O(k \log n)$ nonzero entries of $A$ need be accessed (Duersch et al., 2020, Grigori et al., 24 Mar 2025).

4. Theoretical Guarantees and Approximation Error

Sparse-TDA provides spectral and low-rank approximation bounds matching classical RRQR, modulated by embedding distortion, structural coherence, and sparsity constants.

CUR Approximation Error (Noiseless and Noisy Cases)

For exact-rank $A$ , with suitable sampling parameters,

$A = A(:, J) A(:, J)^\dagger A = A(:, J) A(:, J)^\dagger A A(I,:)^\dagger A(I,:)$

with failure probability decaying exponentially in oversampling parameter $\alpha$ (Cortinovis et al., 21 Feb 2024).

For perturbed $A = X Z Y^\top + E$ , with $\|E\| \leq \epsilon$ , the column-subset selection satisfies

$\|A - A(:, J)A(:, J)^\dagger A\| \leq \epsilon \cdot \left(1 + \sqrt{2}\Delta_A^{-1} \Delta_C^{-1} \sqrt{1+\Delta_A^2+\Delta_C^2} \right)$

with $\Delta_A^{-1}, \Delta_C^{-1}$ prescribed by lower bounds on local singular values and sample sizes. The full CUR error (after symmetric row selection) is bounded by

$\|A - C U R\| = O \left(\epsilon \sqrt{\frac{n m k}{\ell}}\frac{\sigma_1}{\sigma_k}(\|X^\dagger\|+\|Y^\dagger\|)\right)$

with total failure probability $O(k e^{-\alpha})$ under incoherence/sparsity assumptions (Cortinovis et al., 21 Feb 2024).

Pivoted QR Feature Selection

For persistence-feature matrix $F$ with $m$ pixels and $n$ samples, QRCP on $F^T$ (row-pivoted QR) yields pivot rows $S = \{p_1,\ldots,p_k\}$ so that the spectral-norm projection error

$\|F - F(S,:)^T (F(S,:))^{\dagger} F\|_2 \leq \sqrt{1 + k(n-k)} \,\sigma_{k+1}(F)$

approaches the best rank- $k$ SVD error. In practice, the factor is often close to unity due to strong decay of singular values (Guo et al., 2017, Duersch et al., 2015).

Randomized Guarantees

If $\Omega$ is an $\varepsilon$ -embedding and $f$ the sRRQR parameter, resulting error bounds are

$\|M - M(:, J)X\|_2 \leq \sigma_{k+1}(M) \sqrt{1 + \tfrac{1+\varepsilon}{1-\varepsilon}f^2 k(n-k)}$

and the selected subset $J$ achieves spectrum compression comparable to the best possible $k$ -column subset (Grigori et al., 24 Mar 2025, Fakih et al., 3 Sep 2025).

5. Algorithms and Computational Complexity

The general Sparse-TDA selection pipeline comprises:

Random Sampling: Select a subset of rows/columns (for row/column subset selection) either uniformly or by leverage-score estimates.
Sketching: Apply a sparse subspace embedding to obtain a compressed representation.
sRRQR-based Selection: Apply (deterministic) sRRQR on the sketch.
Pullback: Identify the corresponding columns/rows in $A$ .
(Optional) Iterative Refinement: Alternate column and row selection for improved approximation.

Pseudocode instantiations and complexity summaries:

Step	Complexity	Notes
Sketching	$O(\text{nnz}(A) \cdot k)$ or $O(m n \log m)$	Sparse OSE
sRRQR	$O(d n k + t_f n (d+n))$	Sketch of size $d$
Unpivoted QR	$O(2 m n k)$	On $m \times n$
Final update	$O(m n k)$	Optional

Iterative refinement may involve 2–4 passes; each pass further reduces the CUR error in practice (Cortinovis et al., 21 Feb 2024, Guo et al., 2017). Choices of sample size and oversampling (e.g., $\ell_0,\ell_a,\ell_b = \Theta(\mu k)$ , $\alpha = 5$ –10) depend on incoherence and desired failure probability.

For dense persistence-image feature matrices, Sparse-TDA's QR-based subset selection can be preceded by truncated SVD to determine numerical rank and accelerate subsequent QRCP (Guo et al., 2017). For large or sparse matrices, fully randomized or two-stage sketch/sRRQR algorithms (e.g., SE-QRCS, RQRCP) provide sample complexity and communication advantages (Fakih et al., 3 Sep 2025, Duersch et al., 2015, Duersch et al., 2020).

6. Practical Implementation in TDA and Empirical Results

Sparse-TDA serves key roles in modern TDA feature pipelines:

Persistent Homology Landmark Selection: For a boundary matrix $B$ , columns selected by pivoted QR/sRRQR correspond to persistent generator cycles. By restricting to $k$ columns with maximal residual variance, the selected features span the principal homological classes, controllably approximating Betti numbers and persistent cohomology with small algebraic error (Grigori et al., 24 Mar 2025).
Persistence Image Feature Dimensionality Reduction: Pivoted QR on the matrix of persistence images retains highly informative sparse sets of pixels, reducing classifier input dimensions, and dramatically accelerating training without significant loss in accuracy. Empirical results show that Sparse-TDA achieves accuracy competitive with kernel-based TDA at a fraction of computational cost and often improves over $\ell_1$ -based SVM feature selection (Guo et al., 2017).

Method	SHREC Syn	SHREC Real	Outex Texture
L1-SVM (linear)	89.6%	63.9%	55.1%
Sparse-TDA (linear)	91.5%	68.8%	62.6%
Kernel TDA	97.8%	65.3%	69.2%

Sparse-TDA also reduces training time by up to 45× compared to traditional feature pipelines, enabling real-time multi-way classification for topological features (Guo et al., 2017).

For extremely sparse or massive complexes, embedding via count-sketch or sparse JL and then applying sRRQR on the sketch ensures that pivot selection phase is streaming-efficient; only columns of the sketch and the selected subset of the original matrix need to be materialized (Duersch et al., 2020, Grigori et al., 24 Mar 2025, Fakih et al., 3 Sep 2025).

7. Extensions, Limitations, and Parameter Choices

Sparse-TDA methods can be adapted for general non-square or not particularly wide matrices via two-stage embeddings or pre-sketching both row and column spaces (Fakih et al., 3 Sep 2025). The theoretical guarantees are robust under modest over-sampling and unknown coherence/sparsity; over-sampling by a factor of 2–3 is advised if parameters are not precisely known. All core algorithms are backward-stable and avoid common pathological pivoting pathologies due to sRRQR's interlacing and block-norm controls (Cortinovis et al., 21 Feb 2024, Duersch et al., 2015).

A plausible implication is that, for large-scale persistent homology or high-dimensional classification tasks, Sparse-TDA provides a computationally tractable, theoretically justified, and interpretable mechanism for extracting the most topologically and statistically significant features via minimal, randomly guided, and QR-refined sampling.

Sparse-TDA's foundational approach is now at the core of scalable topological feature extraction, CUR-based surrogate modeling, and computational geometric learning in data-driven and high-throughput TDA contexts. Its guarantees represent the state of the art among randomized low-rank subset selection methods, especially in settings combining theoretical rigor and practical speed (Cortinovis et al., 21 Feb 2024, Grigori et al., 24 Mar 2025, Fakih et al., 3 Sep 2025, Duersch et al., 2020, Guo et al., 2017).