Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
88 tokens/sec
Gemini 2.5 Pro Premium
43 tokens/sec
GPT-5 Medium
24 tokens/sec
GPT-5 High Premium
25 tokens/sec
GPT-4o
91 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
464 tokens/sec
Kimi K2 via Groq Premium
248 tokens/sec
2000 character limit reached

Polynomial Convergence for Gaussian KRR

Updated 18 August 2025
  • The paper establishes explicit polynomial convergence rates for Gaussian KRR by linking bias–variance decompositions with polynomial eigenvalue decay.
  • It details both L2 and uniform error bounds, emphasizing the roles of smoothness, source conditions, and saturation effects in estimation accuracy.
  • Empirical validations and scalable algorithms illustrate that fixed-width Gaussian KRR can efficiently handle high-dimensional nonparametric regression.

Polynomial convergence rates for Gaussian kernel ridge regression (KRR) describe the rate at which the KRR estimator’s prediction error decays as a function of the sample size under polynomial eigenvalue decay scenarios. For the Gaussian kernel, which is infinitely smooth, recent results have precisely quantified these rates for both L2L^{2} and uniform norms. This encompasses classical bias-variance decompositions, saturation effects, alignment phenomena, and the interplay with spectral and statistical characteristics. The topic is central to theoretical nonparametric regression, distributed algorithms, scalable solvers, and the statistical learning theory of kernel methods.

1. Theoretical Foundations and Frameworks

Kernel ridge regression estimates a target function f0f_{0} by minimizing the regularized empirical risk functional in a reproducing kernel Hilbert space (RKHS) HK\mathcal{H}_{K} generated by a positive-definite kernel KK. For data {(xi,yi)}i=1n\{(x_{i}, y_{i})\}_{i=1}^{n}, the estimator

f^λ=argminfHK1ni=1n(yif(xi))2+λfHK2\hat{f}_{\lambda} = \arg\min_{f \in \mathcal{H}_{K}} \frac{1}{n} \sum_{i=1}^{n} (y_{i} - f(x_{i}))^{2} + \lambda \|f\|_{\mathcal{H}_{K}}^{2}

is characterized by first-order optimality conditions and Mercer decompositions. In the context of Gaussian kernels (and more generally radial kernels), the eigenvalues {μj}\{\mu_{j}\} of the associated integral operator play a fundamental role, with rates often depending on their polynomial decay μjjt\mu_{j} \sim j^{-t} (Zhang et al., 2013).

Error analysis is typically based on a bias–variance decomposition, with the squared L2L^{2} error controlled by the regularization parameter λ\lambda, the spectral decay, and the smoothness of the target function: Ef^λf0L22O(λf0HK2)+O(effective dimension/n)\mathbb{E} \| \hat{f}_{\lambda} - f_{0} \|_{L^2}^{2} \lesssim O(\lambda \|f_{0}\|_{\mathcal{H}_K}^{2}) + O(\text{effective dimension}/n) where the effective dimension is j1/(1+λ/μj)\sum_j 1/(1 + \lambda/\mu_j) (Zhang et al., 2013).

2. Polynomial Rates for Gaussian Kernel Ridge Regression

Recent advances have established explicit polynomial convergence rates under fixed Gaussian kernel hyperparameters, closing historical gaps in both L2L^{2} and uniform norms (Dommel et al., 15 Aug 2025). Under the assumption that f0Hs(X)f_{0} \in H^{s}(\mathcal{X}) on X=[0,1]d\mathcal{X} = [0,1]^d with s12ds \geq 12d, the estimator obeys: f^λf0L22CL2(1+τ2)ns/(2d+s)\| \hat{f}_{\lambda} - f_{0} \|_{L^{2}}^{2} \leq C_{L^{2}} (1 + \tau^{2}) \cdot n^{-s/(2d+s)} for a regularization sequence λ(n)=exp(n2/(2d+s))\lambda(n) = \exp(-n^{-2/(2d+s)}) and with probability at least 18eτ1 - 8 e^{-\tau} (Dommel et al., 15 Aug 2025). This quantifies the polynomial rate ns/(2d+s)n^{-s/(2d + s)} for smooth f0f_{0}.

For uniform convergence: f^λf0C(X)A(δ)(1+τ2)1/2n[s/(2(2d+s))]sδ\|\hat{f}_{\lambda} - f_{0}\|_{C(\mathcal{X})} \leq A_{\infty}(\delta) (1 + \tau^{2})^{1/2} n^{- [s/(2(2d+s))] \cdot s_{\delta} } with sδ=(s14d4dδ)/(s2d4dδ)s_{\delta} = (s - 14d - 4d\delta)/(s - 2d - 4d\delta), under stronger smoothness conditions (s>14ds > 14d) and additional decay assumptions on the expansion coefficients (Dommel et al., 15 Aug 2025).

These bounds provide theoretical justification for using fixed-width Gaussian KRR in nonparametric regression, correcting prior beliefs that only sub-polynomial or logarithmic rates were possible for fixed bandwidths.

3. Role of Smoothness, Source Condition, and Saturation

The convergence rate for Gaussian KRR is sensitive to the interplay between the kernel eigenvalue decay and the smoothness of the target function. The source condition is typically formulated as f0Hsf_{0} \in H^{s}, an interpolation space with smoothness s>0s > 0; the polynomial rate becomes ntmin(s,2)n^{-t \cdot \min(s,2)}. For s2s \leq 2, the estimator is minimax optimal (matches the lower bound); for s>2s > 2, the rate “saturates” at n2tn^{-2t}, reflecting the fact that further smoothness does not yield better rates—a phenomenon known as saturation (Long et al., 24 Feb 2024).

In high-dimensional settings, with sample size ndγn \asymp d^{\gamma}, one observes periodic plateau behavior and multiple descent phenomena: the error rate remains constant over intervals of γ\gamma, then drops sharply as γ\gamma increases (Zhang et al., 2 Jan 2024), elucidating non-monotonic phases in the learning curve. This analysis unifies results from several previous works by allowing interpolation parameter ss to vary freely.

4. Connections to Gaussian Process Regression and Capacity-Dependent Analysis

The optimal convergence rates for Gaussian KRR align closely with those of Gaussian process (GP) regression, especially when the imposed kernel is smoother than the underlying true function (Wang et al., 2021). GP sample paths typically have smoothness m0(f)m_{0}(f), and if the KRR kernel has smoothness mm0m \geq m_{0}, then both regression procedures achieve the minimax rate nm0/(2m0+d)n^{-m_{0}/(2m_{0} + d)}, with the Gaussian kernel’s effective dimension modulating the capacity and the learning rate.

Capacity-dependent analysis addresses the scenario when the true regression function does not lie in the RKHS. The rates then depend explicitly on a regularity source parameter ζ\zeta and an effective dimension exponent γ\gamma, yielding (Lin et al., 2018): ESgfρρ2N2ζ/(2ζ+γ)\mathbb{E} \| Sg - f_{\rho}\|_{\rho}^{2} \lesssim N^{-2\zeta/(2\zeta+\gamma)} where λN1/(2ζ+γ)\lambda \sim N^{-1/(2\zeta+\gamma)} optimally balances bias and variance.

5. Computational and Algorithmic Aspects

Polynomial rates interact with computational complexity in scalable KRR solvers. Partition-based approaches decompose the estimation error into approximation, bias, variance, and regularization components; distributed algorithms attain minimax optimal rates as long as partitioning preserves effective dimensionality (Zhang et al., 2013, Tandon et al., 2016). Sparse approximations (Nyström, SVGP) enable polynomial rates with dramatically reduced cost:

  • SE kernel: m=O((logn)d+2)m = O((\log n)^{d+2}), rate O(1/n)O(1/\sqrt{n})
  • Matérn kernel: m=O(nd/(2ν+d))m = O(n^{d/(2\nu+d)}), rate O(nν/(2ν+d))O(n^{-\nu/(2\nu+d)}) (Vakili et al., 2022)

Randomized preconditioners (RPCholesky, KRILL) decouple convergence rates from the size/condition of the kernel matrix, ensuring rapid, condition-number-independent CG convergence when the spectrum decays polynomially (Díaz et al., 2023). Linear convergence of full KRR with scalable solvers such as ASkotch is achieved via Nyström preconditioners of rank comparable to the effective dimension (Rathore et al., 14 Jul 2024).

6. Alignment, Truncation, and Transient Phenomena

Alignment between the target function and the kernel spectrum can induce faster polynomial rates, particularly under spectral truncation (TKRR). If the target’s expansion coefficients decay as i2γα1i^{-2\gamma\alpha -1} and the kernel eigenvalues as iαi^{-\alpha}, alignment parameter γ>1\gamma > 1 produces accelerated rate (σ2/n)2γα/(2γα+1)(\sigma^2/n)^{2\gamma\alpha/(2\gamma\alpha+1)} for TKRR—surpassing the standard KRR rate (σ2/n)2α/(2α+1)(\sigma^2/n)^{2\alpha/(2\alpha+1)} in “over-aligned” regimes (Amini et al., 2022). Truncation also induces multiple descent and non-monotonic learning curve phenomena, especially when the target’s spectrum is bandlimited.

7. Empirical Validation and Implications

Extensive numerical experiments verify theoretical polynomial rates for both noiseless and noisy KRR estimators under varying smoothness, dimension, and kernel choices. The empirical risk matches polynomial bounds, confirming minimax optimality for s2s \leq 2 and showing saturation for s>2s > 2 (Long et al., 24 Feb 2024, Saber et al., 2023). Distributed and partitioned estimators have demonstrated computational superiority while retaining optimal rates (Tandon et al., 2016).

These results justify the use of fixed Gaussian kernel ridge regression in large-scale, high-dimensional regression, under mild smoothness and noise conditions, and provide concrete guidance for selecting regularization and approximation parameters to achieve predictable polynomial error decay.

References

  • Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates (Zhang et al., 2013)
  • Kernel Ridge Regression via Partitioning (Tandon et al., 2016)
  • Optimal Rates of Kernel Ridge Regression under Source Condition in Large Dimensions (Zhang et al., 2 Jan 2024)
  • Uniform convergence for Gaussian kernel ridge regression (Dommel et al., 15 Aug 2025)
  • Optimal Rates and Saturation for Noiseless Kernel Ridge Regression (Long et al., 24 Feb 2024)
  • Spectrum of inner-product kernel matrices in the polynomial regime and multiple descent phenomenon in kernel ridge regression (Misiakiewicz, 2022)
  • Sharp Asymptotics of Kernel Ridge Regression Beyond the Linear Regime (Hu et al., 2022)
  • Target alignment in truncated kernel ridge regression (Amini et al., 2022)
  • Kernel Ridge Regression via Partitioning (Tandon et al., 2016)
  • Robust, randomized preconditioning for kernel ridge regression (Díaz et al., 2023)
  • Have ASkotch: A Neat Solution for Large-scale Kernel Ridge Regression (Rathore et al., 14 Jul 2024)
  • A Comprehensive Analysis on the Learning Curve in Kernel Ridge Regression (Cheng et al., 23 Oct 2024)
  • On the Improved Rates of Convergence for Matérn-type Kernel Ridge Regression, with Application to Calibration of Computer Models (Tuo et al., 2020)
  • Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms (Lin et al., 2018)
  • Convergence of Gaussian process regression: Optimality, robustness, and relationship with kernel ridge regression (Wang et al., 2021)
  • Improved Convergence Rates for Sparse Approximation Methods in Kernel-Based Learning (Vakili et al., 2022)
  • A Distribution Free Truncated Kernel Ridge Regression Estimator and Related Spectral Analyses (Saber et al., 2023)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)