Papers
Topics
Authors
Recent
2000 character limit reached

Sparse-TDA: Low-Rank Selection via Pivoted QR

Updated 16 December 2025
  • The paper introduces a novel subset selection algorithm using pivoted QR and sRRQR, achieving rigorous error bounds in CUR matrix approximations.
  • It leverages deterministic and randomized methods with sketching techniques to reduce computational costs in processing large and sparse matrices.
  • The approach enables accurate feature extraction for persistent homology and TDA, providing scalable and interpretable solutions for high-dimensional data.

Sparse-TDA is a suite of algorithms and analysis techniques centering on the selection of low-rank subsets via pivoted or strong rank-revealing QR (sRRQR) factorizations, developed to address both computational efficiency and interpretability in large-scale topological data analysis (TDA) and matrix approximation. The core methodology involves identifying a small subset of rows and columns that capture the dominant topological or algebraic variability in a matrix, under explicit low-rank, incoherence, and sparsity structural assumptions. Sparse-TDA leverages both deterministic and randomized QR-based pivoting for subset selection, providing rigorous error bounds for recovery and approximation—in particular, in the context of persistence-feature matrix sampling and CUR (column–row) low-rank approximations relevant to TDA and machine learning tasks.

1. Structural Foundations and Problem Statement

Sparse-TDA is formulated for a matrix ARn×mA \in \mathbb{R}^{n \times m} assumed to have target rank kmin{n,m}k \ll \min\{n, m\}. The main goal is to select index sets I{1,,n}I \subset \{1,\ldots,n\} and J{1,,m}J \subset \{1,\ldots,m\}, typically of size k\ell \gtrsim k, such that a CUR decomposition ACURA \approx C U R (with C=A(:,J)C = A(:, J), R=A(I,:)R = A(I, :), U=A(I,J)U = A(I, J)^\dagger) achieves small spectral-norm error ACUR2\|A - C U R\|_2. The algorithmic focus is on reducing the number of entries of AA accessed (sublinear or streaming regime), crucial when AA is very large or implicitly defined, as in simplex boundary matrices for persistent homology or persistence image feature matrices.

The underlying structural assumptions are that A=XZY+EA = X Z Y^\top + E, where XRn×kX \in \mathbb{R}^{n \times k} and YRm×kY \in \mathbb{R}^{m \times k} admit low-coherence and/or column-sparsity, ZZ is block-diagonal, and EE is a small perturbation. These assumptions are tailored to typical scenarios in TDA, where persistent cycles or holes are supported on sparse subsets and possess near-orthogonality in an appropriate embedding (Cortinovis et al., 21 Feb 2024, Guo et al., 2017).

2. Pivoted QR and Strong Rank-Revealing QR

Pivoted QR (QR with column or row pivoting, QRCP) is central to Sparse-TDA, as it greedily selects representative features maximizing residual variance. Strong rank-revealing QR (sRRQR) extends classical QRCP by enforcing additional block structure and interlacing conditions that guarantee the leading kk pivots not only span a large subspace, but also have well-controlled spectral approximation properties (Gu–Eisenstat’s criteria). For MRp×qM \in \mathbb{R}^{p \times q} and rank rr, sRRQR produces a factorization

MP=Q(M11M12 0M22)MP = Q \begin{pmatrix} M_{11} & M_{12} \ 0 & M_{22} \end{pmatrix}

with PP a permutation, QQ orthogonal, such that σi(M11)σi(M)/1+ηr(pr)\sigma_i(M_{11}) \ge \sigma_i(M)/\sqrt{1+\eta r(p-r)} and M111M12max1\|M_{11}^{-1}M_{12}\|_{\max} \le 1, for a constant η1\eta \approx 1.

In topological settings, pivoted QR is often applied to persistence image matrices, selecting a pixel (row) subset whose restricted features retain near-optimal classification power or signal reconstruction, and to boundary/chain matrices in persistent homology, extracting generator cycles (Guo et al., 2017, Cortinovis et al., 21 Feb 2024, Duersch et al., 2015, Grigori et al., 24 Mar 2025).

3. Randomized and Sublinear Algorithms

Randomization accelerates subset selection in high dimensions by first sketching the matrix with a random projection, followed by deterministic QR-based pivoting on the reduced matrix. Key procedures include:

  • Randomized Subspace Embedding: Form a sketch Ask=ΩAA^{sk} = \Omega A, where Ω\Omega is an OSE (e.g., Gaussian, SRHT, CountSketch), with d=O(ε2(k+log(1/δ)))d = O(\varepsilon^{-2}(k + \log(1/\delta))) sufficing for (ε,δ,k)(\varepsilon, \delta, k)-embedding (Grigori et al., 24 Mar 2025, Duersch et al., 2020, Fakih et al., 3 Sep 2025).
  • sRRQR on Sketch: Perform sRRQR (pivoted QR with strong stopping criterion) on AskA^{sk} to select kk or kk' columns, induce the permutation on AA itself, and, if needed, pull back to the original column set via the sparsity structure of Ω\Omega (Grigori et al., 24 Mar 2025, Fakih et al., 3 Sep 2025).
  • Symmetric Row–Column Selection: Apply the above procedure to AA^\top for row selection, yielding index sets for both CUR columns and rows (Cortinovis et al., 21 Feb 2024).

Randomized QR-based subset selection offers computational cost reduced by O(n/m)O(n/m) to O(nnz(A)k)O(\text{nnz}(A)\cdot k) compared to direct QRCP, and requires only a single pass or a small number of passes over AA. For massive or streaming TDA matrices, sketching via sparse random projections ensures that only O(klogn)O(k \log n) nonzero entries of AA need be accessed (Duersch et al., 2020, Grigori et al., 24 Mar 2025).

4. Theoretical Guarantees and Approximation Error

Sparse-TDA provides spectral and low-rank approximation bounds matching classical RRQR, modulated by embedding distortion, structural coherence, and sparsity constants.

CUR Approximation Error (Noiseless and Noisy Cases)

For exact-rank AA, with suitable sampling parameters,

A=A(:,J)A(:,J)A=A(:,J)A(:,J)AA(I,:)A(I,:)A = A(:, J) A(:, J)^\dagger A = A(:, J) A(:, J)^\dagger A A(I,:)^\dagger A(I,:)

with failure probability decaying exponentially in oversampling parameter α\alpha (Cortinovis et al., 21 Feb 2024).

For perturbed A=XZY+EA = X Z Y^\top + E, with Eϵ\|E\| \leq \epsilon, the column-subset selection satisfies

AA(:,J)A(:,J)Aϵ(1+2ΔA1ΔC11+ΔA2+ΔC2)\|A - A(:, J)A(:, J)^\dagger A\| \leq \epsilon \cdot \left(1 + \sqrt{2}\Delta_A^{-1} \Delta_C^{-1} \sqrt{1+\Delta_A^2+\Delta_C^2} \right)

with ΔA1,ΔC1\Delta_A^{-1}, \Delta_C^{-1} prescribed by lower bounds on local singular values and sample sizes. The full CUR error (after symmetric row selection) is bounded by

ACUR=O(ϵnmkσ1σk(X+Y))\|A - C U R\| = O \left(\epsilon \sqrt{\frac{n m k}{\ell}}\frac{\sigma_1}{\sigma_k}(\|X^\dagger\|+\|Y^\dagger\|)\right)

with total failure probability O(keα)O(k e^{-\alpha}) under incoherence/sparsity assumptions (Cortinovis et al., 21 Feb 2024).

Pivoted QR Feature Selection

For persistence-feature matrix FF with mm pixels and nn samples, QRCP on FTF^T (row-pivoted QR) yields pivot rows S={p1,,pk}S = \{p_1,\ldots,p_k\} so that the spectral-norm projection error

FF(S,:)T(F(S,:))F21+k(nk)σk+1(F)\|F - F(S,:)^T (F(S,:))^{\dagger} F\|_2 \leq \sqrt{1 + k(n-k)} \,\sigma_{k+1}(F)

approaches the best rank-kk SVD error. In practice, the factor is often close to unity due to strong decay of singular values (Guo et al., 2017, Duersch et al., 2015).

Randomized Guarantees

If Ω\Omega is an ε\varepsilon-embedding and ff the sRRQR parameter, resulting error bounds are

MM(:,J)X2σk+1(M)1+1+ε1εf2k(nk)\|M - M(:, J)X\|_2 \leq \sigma_{k+1}(M) \sqrt{1 + \tfrac{1+\varepsilon}{1-\varepsilon}f^2 k(n-k)}

and the selected subset JJ achieves spectrum compression comparable to the best possible kk-column subset (Grigori et al., 24 Mar 2025, Fakih et al., 3 Sep 2025).

5. Algorithms and Computational Complexity

The general Sparse-TDA selection pipeline comprises:

  1. Random Sampling: Select a subset of rows/columns (for row/column subset selection) either uniformly or by leverage-score estimates.
  2. Sketching: Apply a sparse subspace embedding to obtain a compressed representation.
  3. sRRQR-based Selection: Apply (deterministic) sRRQR on the sketch.
  4. Pullback: Identify the corresponding columns/rows in AA.
  5. (Optional) Iterative Refinement: Alternate column and row selection for improved approximation.

Pseudocode instantiations and complexity summaries:

Step Complexity Notes
Sketching O(nnz(A)k)O(\text{nnz}(A) \cdot k) or O(mnlogm)O(m n \log m) Sparse OSE
sRRQR O(dnk+tfn(d+n))O(d n k + t_f n (d+n)) Sketch of size dd
Unpivoted QR O(2mnk)O(2 m n k) On m×nm \times n
Final update O(mnk)O(m n k) Optional

Iterative refinement may involve 2–4 passes; each pass further reduces the CUR error in practice (Cortinovis et al., 21 Feb 2024, Guo et al., 2017). Choices of sample size and oversampling (e.g., 0,a,b=Θ(μk)\ell_0,\ell_a,\ell_b = \Theta(\mu k), α=5\alpha = 5–10) depend on incoherence and desired failure probability.

For dense persistence-image feature matrices, Sparse-TDA's QR-based subset selection can be preceded by truncated SVD to determine numerical rank and accelerate subsequent QRCP (Guo et al., 2017). For large or sparse matrices, fully randomized or two-stage sketch/sRRQR algorithms (e.g., SE-QRCS, RQRCP) provide sample complexity and communication advantages (Fakih et al., 3 Sep 2025, Duersch et al., 2015, Duersch et al., 2020).

6. Practical Implementation in TDA and Empirical Results

Sparse-TDA serves key roles in modern TDA feature pipelines:

  • Persistent Homology Landmark Selection: For a boundary matrix BB, columns selected by pivoted QR/sRRQR correspond to persistent generator cycles. By restricting to kk columns with maximal residual variance, the selected features span the principal homological classes, controllably approximating Betti numbers and persistent cohomology with small algebraic error (Grigori et al., 24 Mar 2025).
  • Persistence Image Feature Dimensionality Reduction: Pivoted QR on the matrix of persistence images retains highly informative sparse sets of pixels, reducing classifier input dimensions, and dramatically accelerating training without significant loss in accuracy. Empirical results show that Sparse-TDA achieves accuracy competitive with kernel-based TDA at a fraction of computational cost and often improves over 1\ell_1-based SVM feature selection (Guo et al., 2017).
Method SHREC Syn SHREC Real Outex Texture
L1-SVM (linear) 89.6% 63.9% 55.1%
Sparse-TDA (linear) 91.5% 68.8% 62.6%
Kernel TDA 97.8% 65.3% 69.2%

Sparse-TDA also reduces training time by up to 45× compared to traditional feature pipelines, enabling real-time multi-way classification for topological features (Guo et al., 2017).

For extremely sparse or massive complexes, embedding via count-sketch or sparse JL and then applying sRRQR on the sketch ensures that pivot selection phase is streaming-efficient; only columns of the sketch and the selected subset of the original matrix need to be materialized (Duersch et al., 2020, Grigori et al., 24 Mar 2025, Fakih et al., 3 Sep 2025).

7. Extensions, Limitations, and Parameter Choices

Sparse-TDA methods can be adapted for general non-square or not particularly wide matrices via two-stage embeddings or pre-sketching both row and column spaces (Fakih et al., 3 Sep 2025). The theoretical guarantees are robust under modest over-sampling and unknown coherence/sparsity; over-sampling by a factor of 2–3 is advised if parameters are not precisely known. All core algorithms are backward-stable and avoid common pathological pivoting pathologies due to sRRQR's interlacing and block-norm controls (Cortinovis et al., 21 Feb 2024, Duersch et al., 2015).

A plausible implication is that, for large-scale persistent homology or high-dimensional classification tasks, Sparse-TDA provides a computationally tractable, theoretically justified, and interpretable mechanism for extracting the most topologically and statistically significant features via minimal, randomly guided, and QR-refined sampling.

Sparse-TDA's foundational approach is now at the core of scalable topological feature extraction, CUR-based surrogate modeling, and computational geometric learning in data-driven and high-throughput TDA contexts. Its guarantees represent the state of the art among randomized low-rank subset selection methods, especially in settings combining theoretical rigor and practical speed (Cortinovis et al., 21 Feb 2024, Grigori et al., 24 Mar 2025, Fakih et al., 3 Sep 2025, Duersch et al., 2020, Guo et al., 2017).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Sparse-TDA: Low-Rank Selection via Pivoted QR.