Universality of kernel random matrices and kernel regression in the quadratic regime (2408.01062v1)

Published 2 Aug 2024 in stat.ML, cs.LG, math.PR, math.ST, and stat.TH

Abstract: Kernel ridge regression (KRR) is a popular class of machine learning models that has become an important tool for understanding deep learning. Much of the focus has been on studying the proportional asymptotic regime, $n \asymp d$, where $n$ is the number of training samples and $d$ is the dimension of the dataset. In this regime, under certain conditions on the data distribution, the kernel random matrix involved in KRR exhibits behavior akin to that of a linear kernel. In this work, we extend the study of kernel regression to the quadratic asymptotic regime, where $n \asymp d^2$. In this regime, we demonstrate that a broad class of inner-product kernels exhibit behavior similar to a quadratic kernel. Specifically, we establish an operator norm approximation bound for the difference between the original kernel random matrix and a quadratic kernel random matrix with additional correction terms compared to the Taylor expansion of the kernel functions. The approximation works for general data distributions under a Gaussian-moment-matching assumption with a covariance structure. This new approximation is utilized to obtain a limiting spectral distribution of the original kernel matrix and characterize the precise asymptotic training and generalization errors for KRR in the quadratic regime when $n/d^2$ converges to a non-zero constant. The generalization errors are obtained for both deterministic and random teacher models. Our proof techniques combine moment methods, Wick's formula, orthogonal polynomials, and resolvent analysis of random matrices with correlated entries.

Summary

The paper introduces a quadratic kernel approximation that mimics inner-product kernel behavior in high-dimensional settings.
It demonstrates that the empirical spectral distribution converges to a deformed Marchenko–Pastur law when n scales quadratically with dimension.
The study reveals that kernel ridge regression attains deterministic limits in training error and generalization performance in the quadratic regime.

Essay on "Universality of kernel random matrices and kernel regression in the quadratic regime"

The paper "Universality of kernel random matrices and kernel regression in the quadratic regime" by Parthe Pandit, Zhichao Wang, and Yizhe Zhu extends the comprehension of Kernel Ridge Regression (KRR) within the high-dimensional statistical limits. This paper pivots from the extensively examined proportional asymptotic regime, where the sample size ( $n$ ) aligns with the data dimension ( $d$ ), to a quadratic regime where $n \asymp d^2$ .

Problem and Methodology

The authors aim to decipher the asymptotics of kernel regression matrices in this quadratic regime. Specifically, they concentrate on inner-product kernels, investigating how these kernels, when applied to high-dimensional data sets, can be approximated by simpler, quadratic kernels. The paper's foundation lies in establishing the equivalence between the behavior of inner-product kernel matrices and a quadratic kernel matrix under high-dimensional constraints.

Main Results

Quadratic Kernel Approximation

The authors introduce a critical approximation: that the kernel matrix $K$ can be approximated by a quadratic kernel matrix $K^{(2)}$ , denoted as:

$K^{(2)} = a_0 11^\top + a_1 XX^\top + a_2 (XX^\top)^{\odot 2} + a I,$

where $a_0$ , $a_1$ , $a_2$ , and $a$ are constants derived from the kernel function and data covariance. This form encapsulates a low-rank linear term, a nonlinear Hadamard product term, and a regularization term. Theorem 1 substantiates this approximation by demonstrating a non-asymptotic concentration bound, asserting that $K$ and $K^{(2)}$ are close in the spectral norm with high probability.

Limiting Spectral Distribution

In Theorem 2, the authors leverage their quadratic kernel approximation to describe the limiting spectral distribution of the kernel matrix. They identify that as $n, d \to \infty$ , with $n \asymp d^2$ , the empirical spectral distribution of the normalized kernel matrix converges to a deformed Marchenko-Pastur law. Specifically, this law depends on the aspect ratio $\alpha = \lim_{d \to \infty}\frac{d^2}{2n}$ and the data covariance structure.

Kernel Ridge Regression Performance

The implications of these spectral properties are profound for KRR. In Theorem 3, the authors articulate that in the quadratic regime, the training error converges to a deterministic limit:

$\mathcal{E}_{\text{train}}(\lambda) \to \lambda^2 \int \frac{\frac{c_2^2}{\alpha} x + \sigma_\varepsilon^2}{\left(\frac{f''(0)}{4\alpha}x + a^* + \lambda\right)^2} d\mu_{\alpha, \Sigma^{(2)}}(x).$

This result underscores that the kernel model can completely fit the target function's linear components even for $\lambda > 0$ .

In exploring the generalization errors in Theorems 4 and 5, the paper contrasts random and deterministic quadratic teachers. The random teacher result stipulates that the generalization error asymptotically matches the ridge regression's variance and bias terms derived from the kernel spectrum:

$\mathcal{R}(\lambda) - \sigma_\varepsilon^2 (\lambda_*) - \mathcal{B}(\lambda_*) \to 0,$

where $\lambda_*$ is derived from the fixed-point equation involving the spectrum's Stieltjes transform. Remarkably, for deterministic teachers, KRR's bias vanishes asymptotically, showcasing the robust learning capacity in the quadratic regime.

Technical Contributions

Moment Method and Random Matrix Theory: The authors employ orthogonal polynomials, resolvent analysis, and Wick's formula in deriving their concentration inequalities and approximations. These technical tools enable precise asymptotic characterizations of kernel matrices beyond linear regimes.

General Polynomial Regimes: The paper sets the stage for extending analysis to polynomial regimes $n \asymp d^\ell$ , presenting a more comprehensive framework for high-dimensional statistics in various asymptotic confines.

Future Directions

This work encourages further exploration in several avenues:

Optimal Moment Conditions: Refining the moment matching conditions to generalize the analysis beyond Gaussian-like distributions.
Higher-Order Regimes: Extending the results to cases where $n \asymp d^\ell$ for $\ell > 2$ , further dissecting the complexity of kernel and random matrices.
Practical Implications: Translating these theoretical results into empirical insights, particularly for training deep learning models with kernel-based methods.

Conclusion

This paper fundamentally advances the understanding of KRR in high-dimensional settings, specifically the quadratic regime, offering a rigorous approximation of kernel behaviors and their implications on learning performance. Through meticulous random matrix theory and statistical mechanics, it paves the way for new insights and methodologies in high-dimensional data analysis.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Yizhezhu_/status/1825599112626188673

https://twitter.com/StatMLPapers/status/1820309743820259688