High-Dimensional KRR: Spectrum & Sample Complexity
- High-Dimensional Kernel Ridge Regression (KRR) is a nonparametric method that estimates regression functions in reproducing kernel Hilbert spaces using regularized empirical risk minimization.
- Its kernel spectrum inherits a power-law decay from anisotropic Gaussian data, creating distinct energy bands and an effective dimension that governs statistical risk.
- When the regression target aligns with high-variance directions, KRR achieves optimal sample complexity, overcoming limitations of traditional isotropic analysis.
Kernel ridge regression (KRR) in high dimensions is a powerful nonparametric learning technique that estimates regression functions in reproducing kernel Hilbert spaces (RKHS) via regularized empirical risk minimization. Recent theoretical and methodological advances have focused on the impact of data covariance structure, particularly power-law anisotropy, on both the spectrum of nonlinear kernels and the corresponding statistical generalization properties. In the canonical power-law regime, the data distribution is non-isotropic, with input covariances decaying as for some . This induces fundamentally different spectral behavior from the isotropic case and strongly influences statistical rates, sample complexity, and the effective capacity of kernel methods.
1. Kernel Spectrum Inheritance from Power-Law Data
A central technical contribution is the explicit characterization of the kernel integral operator’s spectrum under power-law anisotropic Gaussian inputs. Consider inner-product kernels of the form with a smooth expansion . For mean-zero Gaussian data with diagonal covariance and (, ), the Mercer decomposition of the kernel allows eigenpairs to be indexed by multi-indices , with the corresponding eigenvalues given by
For monomial or polynomial kernels of degree , the spectral structure exhibits two regimes:
- For low-degree polynomials, the spectrum presents distinct “energy bands” (spectral gaps) between degrees.
- For higher-degree terms, the spectrum fills in, with eigenvalues “overlapping” to form a near-continuum.
Specifically, for the monomial kernel the th largest eigenvalue behaves as
where the data effective dimension is given by
This analysis shows that for anisotropic (power-law) data, the kernel eigen-spectrum directly inherits the decay profile of the data covariance, in contrast to isotropic scenarios where degeneracies lead to flat energy levels.
2. Excess Risk and High-Dimensional Sample Complexity
The paper provides a non-asymptotic characterization of the excess risk for KRR under this spectral regime. The KRR estimator is written in terms of the eigenfunction expansion of the kernel operator, and the regression target is decomposed into “low-frequency” (well-represented by the kernel eigenfunctions with large eigenvalues) and “high-frequency” components.
The main result is that, in the regime (), the generalization error asymptotically obeys
where is a shrinkage operator (depending on and the spectrum) acting on the “low-frequency” part of . The set , indexing the learned eigenfunctions, grows with —in structured power-law data, its cardinality is determined by the effective dimension , rather than the full ambient dimension .
This implies that the sample complexity required to accurately estimate (up to the desired bias) scales as a function of :
- For (mild decay), , so only samples are needed for each degree of freedom.
- For (borderline decay), .
- For (strong decay), , so the sample complexity does not grow with .
3. Statistically Optimal Regimes and Alignment
The statistical advantage of KRR on power-law data is critically dependent on the alignment of the regression target with the high-variance directions of the input. When projects primarily onto kernel eigenfunctions associated with large data variances (i.e., low-index coordinates in the power-law basis), the bias and sample complexity are governed by . In contrast, if depends strongly on low-variance (high-index) input directions, the benefits of anisotropy disappear due to the vanishing spectral weights.
This effect resolves longstanding theoretical questions regarding the sharpness and optimality of KRR rates: sample complexity is dictated by the intrinsic effective dimension of the data, not the ambient input dimension, provided that the target function is appropriately aligned.
4. Comparison to Classical Source and Capacity Conditions
Traditional KRR theory often assumes power-law or exponential decay of the kernel operator’s spectrum (capacity condition) and makes smoothness assumptions on relative to the RKHS (source condition). Previous analyses typically treated the data as isotropic, and imposed the power-law conditions directly on the kernel spectrum.
The novel aspect of this work is the rigorous derivation of how the data covariance, rather than the kernel function per se, determines the eigen-spectrum in high dimensions. Thus, data anisotropy (parameterized by ) creates new generalization regimes and can reduce effective “function class complexity” even for rich, nonlinear kernels.
5. Theoretical and Methodological Implications
This analysis offers a precise technical framework for understanding why, in many modern tasks involving large structured data sets (e.g., with heavy-tailed or low-rank covariance), kernel and neural tangent kernel models display unexpectedly good generalization:
- The rapid eigenvalue decay concentrates statistical capacity in a low-dimensional subspace.
- Sample complexity is minimized when regression targets align with high-variance directions.
- Spectral gaps persist for low-degree polynomials and facilitate accurate recovery of key signals even in non-asymptotic regimes.
The rigorous spectral description also provides a path for constructing optimally-adaptive KRR algorithms: tuning regularization and model selection strategies to the empirical effective dimension and the associated data covariance spectrum, rather than the kernel alone.
6. Relevant Mathematical Expressions
| Quantity | Description | Formula/Scaling |
|---|---|---|
| Data eigenvalue | Power-law decay of data covariance | |
| Kernel eigenvalue | For multi-index | |
| Effective dimension | Sum-to-max ratio (data capacity scale) | () |
| Bias/excess risk | Error on low-freq. components (learned eigenspace) | |
| Spectral gap | Gap between degree- and polynomial kernel eigenvalues | Explicit; gap disappears at high for |
7. Significance and Extensions
These results represent the first rigorous analysis of nonlinear kernel ridge regression on non-isotropic, power-law data. The spectral characterizations directly inform practitioners when and why to expect dimensionality reduction, accelerated training, and favorable generalization in modern high-dimensional, structured-data settings. They also clarify that exploiting data anisotropy—either via kernel methods or even infinitely wide neural “lazy” models—can yield statistically optimal learning rates not accessible under isotropic or adversarial scenarios.
Future work may involve extending this spectral approach to non-Gaussian or non-diagonal covariance models, quantifying benefits under weaker alignment, and empirically calibrating regularization strategies to estimated for automated model selection.
In conclusion, this direction establishes a fundamental link between the high-dimensional geometry of the data and the inductive bias of kernel methods, with clear implications for statistical machine learning, feature learning, and modern overparameterized models (Wortsman et al., 6 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free