Nyström Approximation in Scalable Kernel Methods

Updated 3 April 2026

Nyström approximation is a randomized algorithm that constructs low-rank approximations for PSD matrices by sampling a subset of columns.
It leverages the pseudoinverse of a selected submatrix to achieve near-linear time and subquadratic memory, making large-scale kernel methods tractable.
The method offers strong spectral and Frobenius norm error guarantees, which enhance its applicability in machine learning, numerical analysis, and scientific computing.

The Nyström approximation is a fundamental randomized linear algebra technique for constructing low-rank approximations to positive semidefinite (PSD) matrices, with pivotal applications to kernel methods, scalable statistical estimation, numerical analysis, optimization, and scientific computing. By selecting a subset of columns (and, in some settings, rows), and leveraging the resulting submatrix to form a global surrogate, Nyström methods enable near-linear-time and subquadratic-memory algorithms for problems otherwise dominated by costly matrix decompositions. Ongoing research has established a rich theoretical landscape—spanning spectral-norm guarantees, sharp statistical learning rates, randomized and deterministic sampling schemes, adaptive and ensemble variants, as well as generalizations to infinite-dimensional operators and tensor-structured data.

1. Core Principles and Algorithmic Framework

Let $K\in\mathbb{R}^{n\times n}$ be a PSD matrix (typically, a kernel or Gram matrix). The classical Nyström method proceeds as follows:

Selection of Landmarks: Choose a subset $S\subset[n]$ , $|S|=m\ll n$ , via random sampling (uniform, leverage score, $k$ -means, etc.).
Formation of Submatrices:
- $C := K_{:,S}\in\mathbb{R}^{n\times m}$
- $W := K_{S,S}\in\mathbb{R}^{m\times m}$
Low-Rank Approximation:
- $\widehat{K} := C\,W^\dagger\,C^\top$ Here $W^\dagger$ denotes the Moore–Penrose pseudoinverse.

This procedure yields an at-most rank- $m$ surrogate. It is equivalently viewed as projecting $K$ onto the span of the selected column subspace, then extending via the Schur complement or via a feature map $S\subset[n]$ 0 such that $S\subset[n]$ 1.

Extensions include:

Fixed-rank truncation: Project $S\subset[n]$ 2 to rank $S\subset[n]$ 3 by EVD/SVD and propagate to $S\subset[n]$ 4.
Recursive sampling: Multi-stage, leverage-score-driven construction for improved theoretical guarantees (Musco et al., 2016).
Ensemble, boosting, and block-average schemes: Reduce variance and exploit block structure (Hamm et al., 2023, Garg et al., 21 Jun 2025).
Adaptive/online and continuous optimization variants: Update landmark sets dynamically or via relaxable continuous surrogates (Mathur et al., 2023, Si et al., 2018).

2. Theoretical Guarantees and Sampling Schemes

The approximation error for the Nyström method is characterized in spectral, Frobenius, and nuclear norms. Main results:

Spectral bounds:

For $S\subset[n]$ 5 columns sampled by ridge leverage scores (where $S\subset[n]$ 6), $S\subset[n]$ 7 with high probability (Musco et al., 2016, Garg et al., 21 Jun 2025). Two-sided spectral approximations ensure

$S\subset[n]$ 8

for $S\subset[n]$ 9.

Relative error for nuclear/Frobenius norm:

Given $|S|=m\ll n$ 0 as a target rank, and sampling $|S|=m\ll n$ 1 adaptive columns (Wang et al., 2013),

$|S|=m\ll n$ 2

where $|S|=m\ll n$ 3 is the best rank- $|S|=m\ll n$ 4 approximation.

Landmark selection:
- Uniform: Simple but suboptimal for highly coherent data.
- Ridge leverage scores: Optimal for near-minimal sample complexity, robust to spectrum and coherence (Musco et al., 2016, Garg et al., 21 Jun 2025).
- $|S|=m\ll n$ 5-means: Accelerates spectral decay, especially for RBF kernels (Martín-Baos et al., 2024, Carson et al., 2022).
- Continuous/greedy/DPP/adaptive: Further improve on standard selection, especially where i.i.d. sampling is suboptimal (Mathur et al., 2023, Hayakawa et al., 2023).
Block/ensembling/boosted strategies:

By mixing multiple small Nyström approximations, Block-Nyström achieves uniform $|S|=m\ll n$ 6 spectral approximation while reducing cost—critical for heavy-tailed spectra (Garg et al., 21 Jun 2025).

3. Computational Complexity, Memory, and Implementation

Standard Nyström: $|S|=m\ll n$ 7 for forming $|S|=m\ll n$ 8, $|S|=m\ll n$ 9 for inverting $k$ 0, $k$ 1 for evaluating final approximations.
Recursive Leverage/Nyström: $k$ 2 for leverage-score estimation; approaches near-linear time in $k$ 3 for practical $k$ 4 (Musco et al., 2016).
Block-Nyström: Overall $k$ 5 per block, but only $k$ 6 independent $k$ 7 matrix inversions, rather than cubic in total number of columns (Garg et al., 21 Jun 2025).
Low-precision/single-pass variants:

Efficient even with mixed numerical precision (Carson et al., 2022).

Memory reduction:

Only $k$ 8 storage required for $k$ 9 and $C := K_{:,S}\in\mathbb{R}^{n\times m}$ 0 for $C := K_{:,S}\in\mathbb{R}^{n\times m}$ 1, much less than $C := K_{:,S}\in\mathbb{R}^{n\times m}$ 2 for the kernel matrix. Column/row streaming and partitioned sketches further reduce requirements (Homrighausen et al., 2016).

4. Applications Across Scientific and Statistical Domains

Kernel learning and SVM/Logistic regression:

Nyström enables transformation of nonlinear kernel methods into tractable linear problems in low-dimensional feature spaces, e.g. SVM, kernel logistic regression (KLR), even for $C := K_{:,S}\in\mathbb{R}^{n\times m}$ 3– $C := K_{:,S}\in\mathbb{R}^{n\times m}$ 4 (Martín-Baos et al., 2024).

Principal component analysis and kernel PCA:

Directly approximates principal subspaces or KPCA decompositions with statistical guarantees: matches full-KPCA up to negligible error given $C := K_{:,S}\in\mathbb{R}^{n\times m}$ 5 (Sterge et al., 2021, Homrighausen et al., 2016, Arcolano et al., 2011).

Covariance estimation:

Provides shrinkage estimators with explicit bias/MSE formulas, suitable for high-dimensional settings and regularization (Arcolano et al., 2011).

Scientific and numerical computing:

Used for preconditioning, Hamiltonian simulation, and approximating high-dimensional operators or tensors (Garg et al., 21 Jun 2025, Rudi et al., 2018, Bucci et al., 2023, Persson et al., 2024).

Integration and quadrature in learning theory:

Underpins kernel quadrature via low-rank surrogates, with sharp error bounds in both i.i.d. and non-i.i.d. (e.g., DPP) landmark regimes (Hayakawa et al., 2023).

5. Variants, Enhancements, and Generalizations

Recursive Sampling and Continuous Optimization:

RLS-Nyström and continuous/SGD-formulated selection approximate the optimal combinatorial subset, yielding near-greedy approximation quality at scalable cost (Musco et al., 2016, Mathur et al., 2023).

Boosting and Ensemble Approaches:

Sequentially constructed “weak” Nyström approximations, aggregated adaptively, attain lower error and variance than parallel ensemble bagging (Hamm et al., 2023).

High-accuracy/Hierarchical refinement frameworks:

Progressive, alternating cross/skeleton-based methods achieve near-machine-precision error for a given rank, with fast error estimation and practical heuristics (Xia, 2023).

Multilinear/Tensor Structured Nyström:

Extends the method to tensors in Tucker format, enabling single-pass, streaming low-rank approximations for high-order data with robust error guarantees (Bucci et al., 2023).

Low-precision, memory, and streaming constraints:

Enables scalable deployment in data regimes and hardware settings otherwise prohibitive; error is controlled by precision and rank (Carson et al., 2022).

Infinite-dimensional extensions:

The randomized Nyström method extends with rigorous error bounds to non-negative self-adjoint trace-class operators in $C := K_{:,S}\in\mathbb{R}^{n\times m}$ 6; empirical errors track operator spectrum with stability guarantees (Persson et al., 2024, Hayakawa et al., 2023).

6. Empirical Benchmarks, Selection Guidelines, and Practical Impact

Landmark number ( $C := K_{:,S}\in\mathbb{R}^{n\times m}$ 7):

Typically, $C := K_{:,S}\in\mathbb{R}^{n\times m}$ 8 for spectral error $C := K_{:,S}\in\mathbb{R}^{n\times m}$ 9; $W := K_{S,S}\in\mathbb{R}^{m\times m}$ 0 suffices for subspace/statistical matching in PCA/KPCA (Sterge et al., 2021, Homrighausen et al., 2016). QR-Nyström is preferred for $W := K_{S,S}\in\mathbb{R}^{m\times m}$ 1 (Pourkamali-Anaraki et al., 2017).

Selection scheme:

Leverage scores yield smallest sample size for a target error, with recursive computation enabling scalability (Musco et al., 2016, Garg et al., 21 Jun 2025). $W := K_{S,S}\in\mathbb{R}^{m\times m}$ 2-means outperforms uniform in KLR and SVM (Martín-Baos et al., 2024).

Downstream performance:

Modern Nyström variants, including recursive, block, and boosted versions, consistently match or outperform random-feature projection at the same or lower computational/feature cost; error bounds translate to better classification, regression, and clustering accuracy under resource constraints (Hamm et al., 2023, Mathur et al., 2023).

Robustness:

Continuous and adaptive landmark selection is resilient to distributional and spectral structure; block-average strategies are effective for heavy-tailed spectra.

7. Extensions, Open Problems, and Ongoing Directions

Operator and functional approximations:

Nyström methodology for infinite-dimensional settings (e.g., integral, covariance, and spectral operators) is developed with sharp norm bounds in trace, Hilbert–Schmidt, and operator norms (Persson et al., 2024, Hayakawa et al., 2023).

Tensor and nonlinear structures:

Multilinear Nyström extends to high-order tensors with strong stability and streaming properties (Bucci et al., 2023).

Learning-theoretic and statistical optimality:

Recent works establish minimax optimality of Nyström-based estimators in kernel learning tasks and clarify the sample/approximation trade-offs relative to random features and column-sampling (Sterge et al., 2021, Homrighausen et al., 2016).

Adaptive/online learning:

Efficiently tracks changing subspaces and data distribution in streaming or online machine learning settings (Si et al., 2018).

Practical implementation questions:

Ongoing research addresses numerical stability under rounding/mixed-precision (Carson et al., 2022), landmark selection (combining greediness and continuous relaxation), and hybrid variants with nonlinear and deep architectures (Giffon et al., 2019).

Applications:

Expanding beyond kernel methods, Nyström features are used in neural networks (as trainable kernel layers), matrix completion, scalable quadrature, and scientific simulation problems (Giffon et al., 2019, Rudi et al., 2018, Fu, 2020).

Fundamentally, the Nyström approximation is a versatile and theoretically well-founded algorithmic primitive that enables scalable approximation and inference for a broad class of linear, nonlinear, and operator-valued problems in high dimensions. Its ongoing development at the intersection of theoretical computer science, numerical analysis, and machine learning continues to yield new algorithmic advances and performance guarantees.