High-Dimensional KRR: Spectrum & Sample Complexity

Updated 8 October 2025

High-Dimensional Kernel Ridge Regression (KRR) is a nonparametric method that estimates regression functions in reproducing kernel Hilbert spaces using regularized empirical risk minimization.
Its kernel spectrum inherits a power-law decay from anisotropic Gaussian data, creating distinct energy bands and an effective dimension that governs statistical risk.
When the regression target aligns with high-variance directions, KRR achieves optimal sample complexity, overcoming limitations of traditional isotropic analysis.

Kernel ridge regression (KRR) in high dimensions is a powerful nonparametric learning technique that estimates regression functions in reproducing kernel Hilbert spaces (RKHS) via regularized empirical risk minimization. Recent theoretical and methodological advances have focused on the impact of data covariance structure, particularly power-law anisotropy, on both the spectrum of nonlinear kernels and the corresponding statistical generalization properties. In the canonical power-law regime, the data distribution is non-isotropic, with input covariances decaying as $\sigma_j = C_{\alpha} j^{-\alpha}$ for some $\alpha > 0$ . This induces fundamentally different spectral behavior from the isotropic case and strongly influences statistical rates, sample complexity, and the effective capacity of kernel methods.

1. Kernel Spectrum Inheritance from Power-Law Data

A central technical contribution is the explicit characterization of the kernel integral operator’s spectrum under power-law anisotropic Gaussian inputs. Consider inner-product kernels of the form $k(x,x') = h(\langle x, x' \rangle)$ with a smooth expansion $h(t) = \sum_{m=0}^{\infty} h_m t^m$ . For mean-zero Gaussian data $x \in \mathbb{R}^d$ with diagonal covariance $\operatorname{diag}(\sigma_1, \ldots, \sigma_d)$ and $\sigma_j = C_\alpha j^{-\alpha}$ ( $C_\alpha>0$ , $j=1,\ldots,d$ ), the Mercer decomposition of the kernel allows eigenpairs to be indexed by multi-indices $\beta = (\beta_1, \ldots, \beta_d)$ , with the corresponding eigenvalues given by

$\lambda_\beta \asymp \mathrm{const} \cdot \prod_{j=1}^d \sigma_j^{\beta_j} = \mathrm{const} \cdot \exp\left({ -\alpha \sum_{j=1}^d \beta_j \log j } \right).$

For monomial or polynomial kernels of degree $D$ , the spectral structure exhibits two regimes:

For low-degree polynomials, the spectrum presents distinct “energy bands” (spectral gaps) between degrees.
For higher-degree terms, the spectrum fills in, with eigenvalues “overlapping” to form a near-continuum.

Specifically, for the monomial kernel $k(x, x') = \langle x, x' \rangle^D$ the $m$ th largest eigenvalue behaves as

$\lambda_{(m)} \asymp \frac{m^{-\alpha} \operatorname{polylog}(d)}{ r_0(\Sigma)^D }$

where the data effective dimension $r_0(\Sigma)$ is given by

$r_0(\Sigma) = \frac{\sum_{j=1}^d \sigma_j}{\max_j \sigma_j} \sim \begin{cases} d^{1-\alpha} & \alpha \in [0,1) \ \log d & \alpha = 1 \ O(1) & \alpha > 1 \end{cases}$

This analysis shows that for anisotropic (power-law) data, the kernel eigen-spectrum directly inherits the decay profile of the data covariance, in contrast to isotropic scenarios where degeneracies lead to flat energy levels.

2. Excess Risk and High-Dimensional Sample Complexity

The paper provides a non-asymptotic characterization of the excess risk for KRR under this spectral regime. The KRR estimator $\hat{f}_\lambda$ is written in terms of the eigenfunction expansion of the kernel operator, and the regression target $f^\star$ is decomposed into “low-frequency” (well-represented by the kernel eigenfunctions with large eigenvalues) and “high-frequency” components.

The main result is that, in the regime $n = O(d^\kappa)$ ( $\kappa > 0$ ), the generalization error asymptotically obeys

$R(\hat{f}_\lambda) = \left\| (I - S^{\mathsf{Low}(n)}) f^\star_{\mathsf{Low}(n)} \right\|^2_{L^2} + o_d(1)$

where $S^{\mathsf{Low}(n)}$ is a shrinkage operator (depending on $\lambda$ and the spectrum) acting on the “low-frequency” part of $f^\star$ . The set $\mathsf{Low}(n)$ , indexing the learned eigenfunctions, grows with $n$ —in structured power-law data, its cardinality is determined by the effective dimension $r_0(\Sigma)$ , rather than the full ambient dimension $d$ .

This implies that the sample complexity required to accurately estimate $f^\star$ (up to the desired bias) scales as a function of $r_0(\Sigma)$ :

For $\alpha \in [0,1)$ (mild decay), $r_0(\Sigma) \sim d^{1-\alpha}$ , so only $O(d^{1-\alpha})$ samples are needed for each degree of freedom.
For $\alpha=1$ (borderline decay), $r_0(\Sigma) \sim \log d$ .
For $\alpha>1$ (strong decay), $r_0(\Sigma) = O(1)$ , so the sample complexity does not grow with $d$ .

3. Statistically Optimal Regimes and Alignment

The statistical advantage of KRR on power-law data is critically dependent on the alignment of the regression target with the high-variance directions of the input. When $f^\star$ projects primarily onto kernel eigenfunctions associated with large data variances (i.e., low-index coordinates in the power-law basis), the bias and sample complexity are governed by $r_0(\Sigma)$ . In contrast, if $f^\star$ depends strongly on low-variance (high-index) input directions, the benefits of anisotropy disappear due to the vanishing spectral weights.

This effect resolves longstanding theoretical questions regarding the sharpness and optimality of KRR rates: sample complexity is dictated by the intrinsic effective dimension of the data, not the ambient input dimension, provided that the target function is appropriately aligned.

4. Comparison to Classical Source and Capacity Conditions

Traditional KRR theory often assumes power-law or exponential decay of the kernel operator’s spectrum (capacity condition) and makes smoothness assumptions on $f^\star$ relative to the RKHS (source condition). Previous analyses typically treated the data as isotropic, and imposed the power-law conditions directly on the kernel spectrum.

The novel aspect of this work is the rigorous derivation of how the data covariance, rather than the kernel function per se, determines the eigen-spectrum in high dimensions. Thus, data anisotropy (parameterized by $\alpha$ ) creates new generalization regimes and can reduce effective “function class complexity” even for rich, nonlinear kernels.

5. Theoretical and Methodological Implications

This analysis offers a precise technical framework for understanding why, in many modern tasks involving large structured data sets (e.g., with heavy-tailed or low-rank covariance), kernel and neural tangent kernel models display unexpectedly good generalization:

The rapid eigenvalue decay concentrates statistical capacity in a low-dimensional subspace.
Sample complexity is minimized when regression targets align with high-variance directions.
Spectral gaps persist for low-degree polynomials and facilitate accurate recovery of key signals even in non-asymptotic regimes.

The rigorous spectral description also provides a path for constructing optimally-adaptive KRR algorithms: tuning regularization and model selection strategies to the empirical effective dimension $r_0(\Sigma)$ and the associated data covariance spectrum, rather than the kernel alone.

6. Relevant Mathematical Expressions

Quantity	Description	Formula/Scaling
Data eigenvalue	Power-law decay of data covariance	$\sigma_j = C_\alpha\, j^{-\alpha}$
Kernel eigenvalue	For multi-index $\beta$	$\lambda_\beta \sim \prod_j \sigma_j^{\beta_j}$
Effective dimension	Sum-to-max ratio (data capacity scale)	$r_0(\Sigma) \sim d^{1-\alpha}$ ( $\alpha<1$ )
Bias/excess risk	Error on low-freq. components (learned eigenspace)	$R(\hat{f}_\lambda) = \Vert(I-S^{\mathsf{Low}})f^\star_{\mathsf{Low}}\Vert^2 + o_d(1)$
Spectral gap	Gap between degree- $m$ and $(m+1)$ polynomial kernel eigenvalues	Explicit; gap disappears at high $m$ for $\alpha>0$

7. Significance and Extensions

These results represent the first rigorous analysis of nonlinear kernel ridge regression on non-isotropic, power-law data. The spectral characterizations directly inform practitioners when and why to expect dimensionality reduction, accelerated training, and favorable generalization in modern high-dimensional, structured-data settings. They also clarify that exploiting data anisotropy—either via kernel methods or even infinitely wide neural “lazy” models—can yield statistically optimal learning rates not accessible under isotropic or adversarial scenarios.

Future work may involve extending this spectral approach to non-Gaussian or non-diagonal covariance models, quantifying benefits under weaker alignment, and empirically calibrating regularization strategies to estimated $r_0(\Sigma)$ for automated model selection.

In conclusion, this direction establishes a fundamental link between the high-dimensional geometry of the data and the inductive bias of kernel methods, with clear implications for statistical machine learning, feature learning, and modern overparameterized models (Wortsman et al., 6 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Kernel ridge regression under power-law data: spectrum and generalization (2025)

Follow Topic

Get notified by email when new papers are published related to High-Dimensional Kernel Ridge Regression (KRR).