Deterministic Ridge Leverage Score Sampling

Updated 11 April 2026

Deterministic Ridge Leverage Score Sampling is a data-dependent column selection method that balances low-rank approximation with regularization while ensuring strong worst-case guarantees.
It computes ridge leverage scores via SVD and sorts columns to select a subset until a cumulative score threshold is met, offering interpretability and repeatability.
The method guarantees tight spectral bounds and controlled risk inflation in regression, making it effective for feature selection, data valuation, and kernel approximations.

Deterministic Ridge Leverage Score Sampling is a data-dependent, subset selection technique for linear algebraic and statistical tasks that balances the goals of low-rank approximation and regularization. Unlike randomized variants, the deterministic approach provides interpretability, repeatability, and strong worst-case guarantees, making it particularly attractive for applications in scientific data analysis, regression, matrix approximation, and feature selection.

1. Formal Definition and Structural Properties

Given a matrix $A\in\mathbb R^{n\times d}$ and a regularization parameter $\lambda>0$ , the ridge leverage score for the $i$ th column $a_i$ is defined as

$\bar\tau_i(A) = a_i^T (A A^T + \lambda I_n)^+ a_i,$

where $(\cdot)^+$ denotes the Moore–Penrose pseudoinverse. In terms of the thin SVD $A=U\Sigma V^T$ , with singular values $\sigma_1\ge\cdots\ge\sigma_r>0$ and right singular vectors $V$ , this score becomes

$\bar\tau_i(A) = \sum_{j=1}^r \frac{\sigma_j^2}{\sigma_j^2+\lambda} V_{ij}^2.$

Ridge leverage scores interpolate between subspace leverage scores (unregularized) and their regularized counterparts, smoothly down-weighting directions associated with small singular values. This stabilization makes ridge leverage scores adaptive for both regularized regression and low-rank matrix approximations, simultaneously capturing the informative structure and dampening the effect of noise or degeneracy (McCurdy, 2018).

2. Deterministic Sampling Algorithms

Deterministic ridge leverage score sampling proceeds by ranking columns according to their ridge leverage scores and selecting the subset with maximal cumulative score. The canonical procedure is:

Compute all ridge leverage scores $\lambda>0$ 0.
Sort the column indices so that $\lambda>0$ 1.
Initialize an empty selection set $\lambda>0$ 2 and a partial sum $\lambda>0$ 3.
Iteratively add $\lambda>0$ 4 to $\lambda>0$ 5 and update $\lambda>0$ 6 until $\lambda>0$ 7, with $\lambda>0$ 8 for target rank $\lambda>0$ 9 and tolerance $i$ 0.
If $i$ 1, continue selecting the largest remaining scores to ensure the subset has at least $i$ 2 columns.
Construct the sampling matrix $i$ 3 and resulting sketch $i$ 4.

This routine yields an unweighted, deterministic subset of columns, with computational complexity $i$ 5 for score computation and $i$ 6 for sorting (McCurdy, 2018). For kernelized and feature-map settings, the deterministic variant is implemented by sorting data points by (kernel) ridge leverage scores and taking the top set (Schreurs et al., 2021, Chen et al., 2021).

3. Theoretical Guarantees

Deterministic ridge leverage score sampling provides strong spectral and prediction risk bounds:

Additive-multiplicative spectral bound: For $i$ 7 the selected column subset,

$i$ 8

where $i$ 9 and $a_i$ 0 is the best rank- $a_i$ 1 approximation (McCurdy, 2018).

Projection-cost preservation: For any orthogonal projector $a_i$ 2 of rank $a_i$ 3,

$a_i$ 4

with constant $a_i$ 5, preserving objectives such as low-rank approximation and $a_i$ 6-means cost up to $a_i$ 7 factors (McCurdy, 2018).

Risk inflation in regression: For ridge regression on the sketch $a_i$ 8 versus the full data, the statistical risk $a_i$ 9, where $\bar\tau_i(A) = a_i^T (A A^T + \lambda I_n)^+ a_i,$ 0 (McCurdy, 2018).
Sample complexity: When ridge leverage scores decay as a power law $\bar\tau_i(A) = a_i^T (A A^T + \lambda I_n)^+ a_i,$ 1, $\bar\tau_i(A) = a_i^T (A A^T + \lambda I_n)^+ a_i,$ 2, the deterministic subset size matches or exceeds the efficiency of randomized sampling, i.e., $\bar\tau_i(A) = a_i^T (A A^T + \lambda I_n)^+ a_i,$ 3 for $\bar\tau_i(A) = a_i^T (A A^T + \lambda I_n)^+ a_i,$ 4 (McCurdy, 2018).

4. Applications in Regression, Feature Selection, and Kernel Methods

Deterministic ridge leverage sampling is applicable and effective in a range of settings:

Ridge regression and classification: Using a sketch $\bar\tau_i(A) = a_i^T (A A^T + \lambda I_n)^+ a_i,$ 5 formed from deterministic RLS sampling, regression coefficients corresponding to non-selected columns are forced to zero, resulting in built-in feature selection with provable risk control. Selecting with respect to the regularized leverage scores yields a risk bound competitive with alternatives such as elastic net (McCurdy, 2018).
Design and data valuation: Ridge leverage scores measure marginal gain under A- and D-optimality criteria, and when normalized can serve as Shapley-like data value surrogates (Mendoza-Smith, 3 Nov 2025).
Active learning and data subset selection: In deterministic active learning, acquiring samples with the highest ridge leverage scores yields models whose test accuracy closely matches or exceeds classical uncertainty-based and geometric selection strategies (Mendoza-Smith, 3 Nov 2025).
Nyström approximations in kernel ridge regression: Deterministic selection of kernel landmarks by (approximate) ridge leverage yields Nyström approximations with the same in-sample risk as full KRR and near-linear computational complexity, especially in cases with stationary kernels (Chen et al., 2021).

5. Connections to Spectral Sparsification and Feature Selection

Deterministic feature selection via RLS connects closely to single-set spectral sparsification (BSS), wherein a greedy procedure selects rows with weights to spectrally approximate the Gram matrix $\bar\tau_i(A) = a_i^T (A A^T + \lambda I_n)^+ a_i,$ 6 up to $\bar\tau_i(A) = a_i^T (A A^T + \lambda I_n)^+ a_i,$ 7, ensuring that the risk of ridge regression on the reduced feature space inflates by at most $\bar\tau_i(A) = a_i^T (A A^T + \lambda I_n)^+ a_i,$ 8 relative to the original (Paul et al., 2015). Both methods yield deterministic, interpretable subset selectors with explicit sample complexity guarantees—typically $\bar\tau_i(A) = a_i^T (A A^T + \lambda I_n)^+ a_i,$ 9 for rank $(\cdot)^+$ 0 and error $(\cdot)^+$ 1.

6. Practical Considerations and Empirical Observations

Parameter selection: Good practice suggests setting $(\cdot)^+$ 2 via spectrum elbowing, $(\cdot)^+$ 3, and $(\cdot)^+$ 4 to balance sketch size and error (McCurdy, 2018).
Computational cost: Direct SVD or Cholesky is $(\cdot)^+$ 5 for dense problems, with further acceleration possible via randomized or approximate techniques for large $(\cdot)^+$ 6 (Schreurs et al., 2021, Chen et al., 2021).
Empirical efficacy: In applications such as multi-omic cancer data and deep-learning model training, deterministic RLS sampling yields compact, interpretable data sketches (small $(\cdot)^+$ 7) with negligible loss in predictive accuracy, and in GAN training, empirically corrects mode drop and improves rare-mode coverage (McCurdy, 2018, Schreurs et al., 2021).
Feature and landmark selection: Deterministic top- $(\cdot)^+$ 8 RLS selection matches or outperforms standard baselines (uncertainty, margin, entropy) in data-efficient regimes, particularly for high-dimensional or overparameterized models (Mendoza-Smith, 3 Nov 2025).

7. Extensions and Kernel Generalizations

Kernelized deterministic ridge leverage score sampling extends the theory and algorithmic guarantees to non-linear settings. In those contexts, the scores are computed in dual (kernel) or primal (feature) form and can leverage the structure of stationary kernels for efficient approximation. A one-dimensional integral formula, based on input density and kernel spectral density, enables linear-time computation (up to poly-log) of approximate scores. Sorting and selecting the top estimated scores yields a deterministic Nyström approximation matching the statistical risk of full-data solutions under regularity assumptions (Chen et al., 2021). Tuning of regularization, landmark count, and kernel hyperparameters directly impact both efficiency and downstream generalization.

Key references: (McCurdy, 2018, Schreurs et al., 2021, Mendoza-Smith, 3 Nov 2025, Paul et al., 2015, Chen et al., 2021).