Papers
Topics
Authors
Recent
2000 character limit reached

High-Dimensional KRR: Spectrum & Sample Complexity

Updated 8 October 2025
  • High-Dimensional Kernel Ridge Regression (KRR) is a nonparametric method that estimates regression functions in reproducing kernel Hilbert spaces using regularized empirical risk minimization.
  • Its kernel spectrum inherits a power-law decay from anisotropic Gaussian data, creating distinct energy bands and an effective dimension that governs statistical risk.
  • When the regression target aligns with high-variance directions, KRR achieves optimal sample complexity, overcoming limitations of traditional isotropic analysis.

Kernel ridge regression (KRR) in high dimensions is a powerful nonparametric learning technique that estimates regression functions in reproducing kernel Hilbert spaces (RKHS) via regularized empirical risk minimization. Recent theoretical and methodological advances have focused on the impact of data covariance structure, particularly power-law anisotropy, on both the spectrum of nonlinear kernels and the corresponding statistical generalization properties. In the canonical power-law regime, the data distribution is non-isotropic, with input covariances decaying as σj=Cαjα\sigma_j = C_{\alpha} j^{-\alpha} for some α>0\alpha > 0. This induces fundamentally different spectral behavior from the isotropic case and strongly influences statistical rates, sample complexity, and the effective capacity of kernel methods.

1. Kernel Spectrum Inheritance from Power-Law Data

A central technical contribution is the explicit characterization of the kernel integral operator’s spectrum under power-law anisotropic Gaussian inputs. Consider inner-product kernels of the form k(x,x)=h(x,x)k(x,x') = h(\langle x, x' \rangle) with a smooth expansion h(t)=m=0hmtmh(t) = \sum_{m=0}^{\infty} h_m t^m. For mean-zero Gaussian data xRdx \in \mathbb{R}^d with diagonal covariance diag(σ1,,σd)\operatorname{diag}(\sigma_1, \ldots, \sigma_d) and σj=Cαjα\sigma_j = C_\alpha j^{-\alpha} (Cα>0C_\alpha>0, j=1,,dj=1,\ldots,d), the Mercer decomposition of the kernel allows eigenpairs to be indexed by multi-indices β=(β1,,βd)\beta = (\beta_1, \ldots, \beta_d), with the corresponding eigenvalues given by

λβconstj=1dσjβj=constexp(αj=1dβjlogj).\lambda_\beta \asymp \mathrm{const} \cdot \prod_{j=1}^d \sigma_j^{\beta_j} = \mathrm{const} \cdot \exp\left({ -\alpha \sum_{j=1}^d \beta_j \log j } \right).

For monomial or polynomial kernels of degree DD, the spectral structure exhibits two regimes:

  • For low-degree polynomials, the spectrum presents distinct “energy bands” (spectral gaps) between degrees.
  • For higher-degree terms, the spectrum fills in, with eigenvalues “overlapping” to form a near-continuum.

Specifically, for the monomial kernel k(x,x)=x,xDk(x, x') = \langle x, x' \rangle^D the mmth largest eigenvalue behaves as

λ(m)mαpolylog(d)r0(Σ)D\lambda_{(m)} \asymp \frac{m^{-\alpha} \operatorname{polylog}(d)}{ r_0(\Sigma)^D }

where the data effective dimension r0(Σ)r_0(\Sigma) is given by

r0(Σ)=j=1dσjmaxjσj{d1αα[0,1) logdα=1 O(1)α>1r_0(\Sigma) = \frac{\sum_{j=1}^d \sigma_j}{\max_j \sigma_j} \sim \begin{cases} d^{1-\alpha} & \alpha \in [0,1) \ \log d & \alpha = 1 \ O(1) & \alpha > 1 \end{cases}

This analysis shows that for anisotropic (power-law) data, the kernel eigen-spectrum directly inherits the decay profile of the data covariance, in contrast to isotropic scenarios where degeneracies lead to flat energy levels.

2. Excess Risk and High-Dimensional Sample Complexity

The paper provides a non-asymptotic characterization of the excess risk for KRR under this spectral regime. The KRR estimator f^λ\hat{f}_\lambda is written in terms of the eigenfunction expansion of the kernel operator, and the regression target ff^\star is decomposed into “low-frequency” (well-represented by the kernel eigenfunctions with large eigenvalues) and “high-frequency” components.

The main result is that, in the regime n=O(dκ)n = O(d^\kappa) (κ>0\kappa > 0), the generalization error asymptotically obeys

R(f^λ)=(ISLow(n))fLow(n)L22+od(1)R(\hat{f}_\lambda) = \left\| (I - S^{\mathsf{Low}(n)}) f^\star_{\mathsf{Low}(n)} \right\|^2_{L^2} + o_d(1)

where SLow(n)S^{\mathsf{Low}(n)} is a shrinkage operator (depending on λ\lambda and the spectrum) acting on the “low-frequency” part of ff^\star. The set Low(n)\mathsf{Low}(n), indexing the learned eigenfunctions, grows with nn—in structured power-law data, its cardinality is determined by the effective dimension r0(Σ)r_0(\Sigma), rather than the full ambient dimension dd.

This implies that the sample complexity required to accurately estimate ff^\star (up to the desired bias) scales as a function of r0(Σ)r_0(\Sigma):

  • For α[0,1)\alpha \in [0,1) (mild decay), r0(Σ)d1αr_0(\Sigma) \sim d^{1-\alpha}, so only O(d1α)O(d^{1-\alpha}) samples are needed for each degree of freedom.
  • For α=1\alpha=1 (borderline decay), r0(Σ)logdr_0(\Sigma) \sim \log d.
  • For α>1\alpha>1 (strong decay), r0(Σ)=O(1)r_0(\Sigma) = O(1), so the sample complexity does not grow with dd.

3. Statistically Optimal Regimes and Alignment

The statistical advantage of KRR on power-law data is critically dependent on the alignment of the regression target with the high-variance directions of the input. When ff^\star projects primarily onto kernel eigenfunctions associated with large data variances (i.e., low-index coordinates in the power-law basis), the bias and sample complexity are governed by r0(Σ)r_0(\Sigma). In contrast, if ff^\star depends strongly on low-variance (high-index) input directions, the benefits of anisotropy disappear due to the vanishing spectral weights.

This effect resolves longstanding theoretical questions regarding the sharpness and optimality of KRR rates: sample complexity is dictated by the intrinsic effective dimension of the data, not the ambient input dimension, provided that the target function is appropriately aligned.

4. Comparison to Classical Source and Capacity Conditions

Traditional KRR theory often assumes power-law or exponential decay of the kernel operator’s spectrum (capacity condition) and makes smoothness assumptions on ff^\star relative to the RKHS (source condition). Previous analyses typically treated the data as isotropic, and imposed the power-law conditions directly on the kernel spectrum.

The novel aspect of this work is the rigorous derivation of how the data covariance, rather than the kernel function per se, determines the eigen-spectrum in high dimensions. Thus, data anisotropy (parameterized by α\alpha) creates new generalization regimes and can reduce effective “function class complexity” even for rich, nonlinear kernels.

5. Theoretical and Methodological Implications

This analysis offers a precise technical framework for understanding why, in many modern tasks involving large structured data sets (e.g., with heavy-tailed or low-rank covariance), kernel and neural tangent kernel models display unexpectedly good generalization:

  • The rapid eigenvalue decay concentrates statistical capacity in a low-dimensional subspace.
  • Sample complexity is minimized when regression targets align with high-variance directions.
  • Spectral gaps persist for low-degree polynomials and facilitate accurate recovery of key signals even in non-asymptotic regimes.

The rigorous spectral description also provides a path for constructing optimally-adaptive KRR algorithms: tuning regularization and model selection strategies to the empirical effective dimension r0(Σ)r_0(\Sigma) and the associated data covariance spectrum, rather than the kernel alone.

6. Relevant Mathematical Expressions

Quantity Description Formula/Scaling
Data eigenvalue Power-law decay of data covariance σj=Cαjα\sigma_j = C_\alpha\, j^{-\alpha}
Kernel eigenvalue For multi-index β\beta λβjσjβj\lambda_\beta \sim \prod_j \sigma_j^{\beta_j}
Effective dimension Sum-to-max ratio (data capacity scale) r0(Σ)d1αr_0(\Sigma) \sim d^{1-\alpha} (α<1\alpha<1)
Bias/excess risk Error on low-freq. components (learned eigenspace) R(f^λ)=(ISLow)fLow2+od(1)R(\hat{f}_\lambda) = \Vert(I-S^{\mathsf{Low}})f^\star_{\mathsf{Low}}\Vert^2 + o_d(1)
Spectral gap Gap between degree-mm and (m+1)(m+1) polynomial kernel eigenvalues Explicit; gap disappears at high mm for α>0\alpha>0

7. Significance and Extensions

These results represent the first rigorous analysis of nonlinear kernel ridge regression on non-isotropic, power-law data. The spectral characterizations directly inform practitioners when and why to expect dimensionality reduction, accelerated training, and favorable generalization in modern high-dimensional, structured-data settings. They also clarify that exploiting data anisotropy—either via kernel methods or even infinitely wide neural “lazy” models—can yield statistically optimal learning rates not accessible under isotropic or adversarial scenarios.

Future work may involve extending this spectral approach to non-Gaussian or non-diagonal covariance models, quantifying benefits under weaker alignment, and empirically calibrating regularization strategies to estimated r0(Σ)r_0(\Sigma) for automated model selection.


In conclusion, this direction establishes a fundamental link between the high-dimensional geometry of the data and the inductive bias of kernel methods, with clear implications for statistical machine learning, feature learning, and modern overparameterized models (Wortsman et al., 6 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to High-Dimensional Kernel Ridge Regression (KRR).