Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 34 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 83 tok/s Pro

Kimi K2 180 tok/s Pro

GPT OSS 120B 440 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Polynomial Convergence for Gaussian KRR

Updated 18 August 2025

The paper establishes explicit polynomial convergence rates for Gaussian KRR by linking bias–variance decompositions with polynomial eigenvalue decay.
It details both L2 and uniform error bounds, emphasizing the roles of smoothness, source conditions, and saturation effects in estimation accuracy.
Empirical validations and scalable algorithms illustrate that fixed-width Gaussian KRR can efficiently handle high-dimensional nonparametric regression.

Polynomial convergence rates for Gaussian kernel ridge regression (KRR) describe the rate at which the KRR estimator’s prediction error decays as a function of the sample size under polynomial eigenvalue decay scenarios. For the Gaussian kernel, which is infinitely smooth, recent results have precisely quantified these rates for both $L^{2}$ and uniform norms. This encompasses classical bias-variance decompositions, saturation effects, alignment phenomena, and the interplay with spectral and statistical characteristics. The topic is central to theoretical nonparametric regression, distributed algorithms, scalable solvers, and the statistical learning theory of kernel methods.

1. Theoretical Foundations and Frameworks

Kernel ridge regression estimates a target function $f_{0}$ by minimizing the regularized empirical risk functional in a reproducing kernel Hilbert space (RKHS) $\mathcal{H}_{K}$ generated by a positive-definite kernel $K$ . For data $\{(x_{i}, y_{i})\}_{i=1}^{n}$ , the estimator

$\hat{f}_{\lambda} = \arg\min_{f \in \mathcal{H}_{K}} \frac{1}{n} \sum_{i=1}^{n} (y_{i} - f(x_{i}))^{2} + \lambda \|f\|_{\mathcal{H}_{K}}^{2}$

is characterized by first-order optimality conditions and Mercer decompositions. In the context of Gaussian kernels (and more generally radial kernels), the eigenvalues $\{\mu_{j}\}$ of the associated integral operator play a fundamental role, with rates often depending on their polynomial decay $\mu_{j} \sim j^{-t}$ (Zhang et al., 2013).

Error analysis is typically based on a bias–variance decomposition, with the squared $L^{2}$ error controlled by the regularization parameter $\lambda$ , the spectral decay, and the smoothness of the target function: $\mathbb{E} \| \hat{f}_{\lambda} - f_{0} \|_{L^2}^{2} \lesssim O(\lambda \|f_{0}\|_{\mathcal{H}_K}^{2}) + O(\text{effective dimension}/n)$ where the effective dimension is $\sum_j 1/(1 + \lambda/\mu_j)$ (Zhang et al., 2013).

2. Polynomial Rates for Gaussian Kernel Ridge Regression

Recent advances have established explicit polynomial convergence rates under fixed Gaussian kernel hyperparameters, closing historical gaps in both $L^{2}$ and uniform norms (Dommel et al., 15 Aug 2025). Under the assumption that $f_{0} \in H^{s}(\mathcal{X})$ on $\mathcal{X} = [0,1]^d$ with $s \geq 12d$ , the estimator obeys: $\| \hat{f}_{\lambda} - f_{0} \|_{L^{2}}^{2} \leq C_{L^{2}} (1 + \tau^{2}) \cdot n^{-s/(2d+s)}$ for a regularization sequence $\lambda(n) = \exp(-n^{-2/(2d+s)})$ and with probability at least $1 - 8 e^{-\tau}$ (Dommel et al., 15 Aug 2025). This quantifies the polynomial rate $n^{-s/(2d + s)}$ for smooth $f_{0}$ .

For uniform convergence: $\|\hat{f}_{\lambda} - f_{0}\|_{C(\mathcal{X})} \leq A_{\infty}(\delta) (1 + \tau^{2})^{1/2} n^{- [s/(2(2d+s))] \cdot s_{\delta} }$ with $s_{\delta} = (s - 14d - 4d\delta)/(s - 2d - 4d\delta)$ , under stronger smoothness conditions ( $s > 14d$ ) and additional decay assumptions on the expansion coefficients (Dommel et al., 15 Aug 2025).

These bounds provide theoretical justification for using fixed-width Gaussian KRR in nonparametric regression, correcting prior beliefs that only sub-polynomial or logarithmic rates were possible for fixed bandwidths.

3. Role of Smoothness, Source Condition, and Saturation

The convergence rate for Gaussian KRR is sensitive to the interplay between the kernel eigenvalue decay and the smoothness of the target function. The source condition is typically formulated as $f_{0} \in H^{s}$ , an interpolation space with smoothness $s > 0$ ; the polynomial rate becomes $n^{-t \cdot \min(s,2)}$ . For $s \leq 2$ , the estimator is minimax optimal (matches the lower bound); for $s > 2$ , the rate “saturates” at $n^{-2t}$ , reflecting the fact that further smoothness does not yield better rates—a phenomenon known as saturation (Long et al., 24 Feb 2024).

In high-dimensional settings, with sample size $n \asymp d^{\gamma}$ , one observes periodic plateau behavior and multiple descent phenomena: the error rate remains constant over intervals of $\gamma$ , then drops sharply as $\gamma$ increases (Zhang et al., 2 Jan 2024), elucidating non-monotonic phases in the learning curve. This analysis unifies results from several previous works by allowing interpolation parameter $s$ to vary freely.

4. Connections to Gaussian Process Regression and Capacity-Dependent Analysis

The optimal convergence rates for Gaussian KRR align closely with those of Gaussian process (GP) regression, especially when the imposed kernel is smoother than the underlying true function (Wang et al., 2021). GP sample paths typically have smoothness $m_{0}(f)$ , and if the KRR kernel has smoothness $m \geq m_{0}$ , then both regression procedures achieve the minimax rate $n^{-m_{0}/(2m_{0} + d)}$ , with the Gaussian kernel’s effective dimension modulating the capacity and the learning rate.

Capacity-dependent analysis addresses the scenario when the true regression function does not lie in the RKHS. The rates then depend explicitly on a regularity source parameter $\zeta$ and an effective dimension exponent $\gamma$ , yielding (Lin et al., 2018): $\mathbb{E} \| Sg - f_{\rho}\|_{\rho}^{2} \lesssim N^{-2\zeta/(2\zeta+\gamma)}$ where $\lambda \sim N^{-1/(2\zeta+\gamma)}$ optimally balances bias and variance.

5. Computational and Algorithmic Aspects

Polynomial rates interact with computational complexity in scalable KRR solvers. Partition-based approaches decompose the estimation error into approximation, bias, variance, and regularization components; distributed algorithms attain minimax optimal rates as long as partitioning preserves effective dimensionality (Zhang et al., 2013, Tandon et al., 2016). Sparse approximations (Nyström, SVGP) enable polynomial rates with dramatically reduced cost:

SE kernel: $m = O((\log n)^{d+2})$ , rate $O(1/\sqrt{n})$
Matérn kernel: $m = O(n^{d/(2\nu+d)})$ , rate $O(n^{-\nu/(2\nu+d)})$ (Vakili et al., 2022)

Randomized preconditioners (RPCholesky, KRILL) decouple convergence rates from the size/condition of the kernel matrix, ensuring rapid, condition-number-independent CG convergence when the spectrum decays polynomially (Díaz et al., 2023). Linear convergence of full KRR with scalable solvers such as ASkotch is achieved via Nyström preconditioners of rank comparable to the effective dimension (Rathore et al., 14 Jul 2024).

6. Alignment, Truncation, and Transient Phenomena

Alignment between the target function and the kernel spectrum can induce faster polynomial rates, particularly under spectral truncation (TKRR). If the target’s expansion coefficients decay as $i^{-2\gamma\alpha -1}$ and the kernel eigenvalues as $i^{-\alpha}$ , alignment parameter $\gamma > 1$ produces accelerated rate $(\sigma^2/n)^{2\gamma\alpha/(2\gamma\alpha+1)}$ for TKRR—surpassing the standard KRR rate $(\sigma^2/n)^{2\alpha/(2\alpha+1)}$ in “over-aligned” regimes (Amini et al., 2022). Truncation also induces multiple descent and non-monotonic learning curve phenomena, especially when the target’s spectrum is bandlimited.

7. Empirical Validation and Implications

Extensive numerical experiments verify theoretical polynomial rates for both noiseless and noisy KRR estimators under varying smoothness, dimension, and kernel choices. The empirical risk matches polynomial bounds, confirming minimax optimality for $s \leq 2$ and showing saturation for $s > 2$ (Long et al., 24 Feb 2024, Saber et al., 2023). Distributed and partitioned estimators have demonstrated computational superiority while retaining optimal rates (Tandon et al., 2016).

These results justify the use of fixed Gaussian kernel ridge regression in large-scale, high-dimensional regression, under mild smoothness and noise conditions, and provide concrete guidance for selecting regularization and approximation parameters to achieve predictable polynomial error decay.

References

Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates (Zhang et al., 2013)
Kernel Ridge Regression via Partitioning (Tandon et al., 2016)
Optimal Rates of Kernel Ridge Regression under Source Condition in Large Dimensions (Zhang et al., 2 Jan 2024)
Uniform convergence for Gaussian kernel ridge regression (Dommel et al., 15 Aug 2025)
Optimal Rates and Saturation for Noiseless Kernel Ridge Regression (Long et al., 24 Feb 2024)
Spectrum of inner-product kernel matrices in the polynomial regime and multiple descent phenomenon in kernel ridge regression (Misiakiewicz, 2022)
Sharp Asymptotics of Kernel Ridge Regression Beyond the Linear Regime (Hu et al., 2022)
Target alignment in truncated kernel ridge regression (Amini et al., 2022)
Kernel Ridge Regression via Partitioning (Tandon et al., 2016)
Robust, randomized preconditioning for kernel ridge regression (Díaz et al., 2023)
Have ASkotch: A Neat Solution for Large-scale Kernel Ridge Regression (Rathore et al., 14 Jul 2024)
A Comprehensive Analysis on the Learning Curve in Kernel Ridge Regression (Cheng et al., 23 Oct 2024)
On the Improved Rates of Convergence for Matérn-type Kernel Ridge Regression, with Application to Calibration of Computer Models (Tuo et al., 2020)
Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms (Lin et al., 2018)
Convergence of Gaussian process regression: Optimality, robustness, and relationship with kernel ridge regression (Wang et al., 2021)
Improved Convergence Rates for Sparse Approximation Methods in Kernel-Based Learning (Vakili et al., 2022)
A Distribution Free Truncated Kernel Ridge Regression Estimator and Related Spectral Analyses (Saber et al., 2023)