Kernel Regime in Neural Networks

Updated 21 April 2026

Kernel Regime is defined by training dynamics where minimal parameter updates keep the tangent kernel fixed, enabling a linearized approach in a potentially infinite-dimensional RKHS.
It transforms complex nonlinear optimization into convex kernel ridge regression, yielding a unique minimum-norm solution with quantifiable generalization via spectral bias.
This regime contrasts sharply with feature-learning dynamics, with explicit scaling laws and phase transitions delineating its applicability and limitations in practical networks.

The kernel regime encompasses a set of asymptotic, dynamical, and statistical limits in which nonlinear models—classically overparameterized neural networks—undergo training dynamics that are effectively equivalent to linearized gradient descent in a (potentially infinite-dimensional) feature space defined by a deterministic kernel, typically the neural tangent kernel (NTK) or closely related constructs. In the kernel regime, the parameter update is sufficiently small relative to the initialization that the tangent kernel remains essentially constant throughout training; consequently, optimization reduces to a convex problem in reproducing kernel Hilbert space (RKHS) and the learned predictor is the minimum-norm solution subject to interpolation constraints. This regime is in sharp contrast with the "rich" or "feature-learning" regimes, where the kernel evolves notably during training and the network acquires solution biases outside the fixed RKHS norm. The kernel regime is central to contemporary analyses of generalization, expressivity, and phase transitions in high-dimensional statistics and machine learning, providing tractable, non-asymptotic, and precise quantitative results across a diverse set of models and methodologies (Woodworth et al., 2020).

1. Definition and Dynamical Criteria of the Kernel Regime

The kernel regime—often called the "lazy" regime—obtains when the training dynamics of a nonlinear model can be well-approximated by a first-order Taylor expansion at initialization: $f(\theta_t, x) \approx f(\theta_0, x) + \nabla_\theta f(\theta_0, x)^\top (\theta_t - \theta_0)$ with the parameter displacement $\|\theta_t - \theta_0\|$ remaining small. The corresponding tangent kernel,

$K_\theta(x, x') = \nabla_\theta f(\theta, x)^\top \nabla_\theta f(\theta, x'),$

remains effectively fixed at its initial value $K_0(x, x') = K_{\theta_0}(x, x')$ . Training is then mathematically equivalent to kernel ridge regression in the associated RKHS. This regime is precisely characterized by the width (number of parameters tending to infinity) and/or by initialization scaling, as in $D$ -homogeneous networks with initialization $\theta(0) = \alpha\, \theta_0$ and $\alpha \to \infty$ (Woodworth et al., 2020).

In classic fully-connected deep networks, the NTK, defined as

$\Theta(x, x') = \left\langle \nabla_{\theta} f(x ;\theta_0),\; \nabla_{\theta} f(x' ;\theta_0) \right\rangle,$

converges almost surely to a deterministic kernel in the infinite-width limit, and the ensuing training dynamics are linear by construction (Nitta, 2018). During training, the dynamic NTK $K_t(x, x')$ stays “frozen” at $K_0$ , ensuring the kernel regime persists until parameter drift becomes $\|\theta_t - \theta_0\|$ 0 (Woodworth et al., 2020, Bowman et al., 2022).

2. Characteristic Properties and Solution Structure

Within the kernel regime, optimization under gradient flow with squared loss converges to the minimum-norm interpolant in RKHS: $\|\theta_t - \theta_0\|$ 1 where $\|\theta_t - \theta_0\|$ 2 is the fixed NTK Gram matrix on training inputs. The solution is convex and unique, eliminating spurious local minima (Nitta, 2018). Generalization and sample complexity are therefore governed by classical kernel theory: the Rademacher complexity, spectral bias, and norm in RKHS critically depend on the kernel's eigenstructure and the input distribution (Woodworth et al., 2020, Bowman et al., 2022). Specifically, learning is biased towards top eigenfunctions (“spectral bias”) and occurs exponentially faster in high-eigenvalue directions, irrespective of the target $\|\theta_t - \theta_0\|$ 3 (Bowman et al., 2022). The theory encompasses not only fully connected networks but also models with convolutional and residual architectures, provided the scaling required for the kernel regime holds (Bowman et al., 2022).

3. Scaling Laws, Phase Transitions, and Model Transitions

The kernel regime is controlled by explicit scaling laws in width, initialization, and depth. For $\|\theta_t - \theta_0\|$ 4-homogeneous models, large initialization ( $\|\theta_t - \theta_0\|$ 5) suppresses feature learning, and the evolution remains linear in parameters, maintaining the kernel regime (Woodworth et al., 2020). For finite (but large) width, the regime is approximate, with corrections that can be analyzed quantitatively (Bowman et al., 2022, Shilton et al., 2024). As initialization or width is reduced, or depth increased, networks can transition to the “rich” regime, where feature learning and non-RKHS biases arise. In this latter regime, optimization is no longer characterized by the minimum RKHS norm; instead, implicit $\|\theta_t - \theta_0\|$ 6, nuclear norm, or other structured inductive biases may emerge (Woodworth et al., 2020).

Empirical results confirm this transition in practical settings, for instance, in deep ReLU nets and matrix completion. For depth- $\|\theta_t - \theta_0\|$ 7 diagonal models, the kernel regime yields $\|\theta_t - \theta_0\|$ 8-type implicit regularization, whereas as initialization decreases, the model's bias interpolates towards $\|\theta_t - \theta_0\|$ 9, with the speed of transition increasing with depth (Woodworth et al., 2020). The scaling condition for matrix factorization models requires $K_\theta(x, x') = \nabla_\theta f(\theta, x)^\top \nabla_\theta f(\theta, x'),$ 0 to maintain kernel behavior, and its breakdown leads to a sharp oscillation in generalization and exact recovery (Woodworth et al., 2020).

4. Generalization Bounds and Spectral Bias

In the kernel regime, generalization error can be bounded as in classical kernel methods. For instance: $K_\theta(x, x') = \nabla_\theta f(\theta, x)^\top \nabla_\theta f(\theta, x'),$ 1 provided the number of samples $K_\theta(x, x') = \nabla_\theta f(\theta, x)^\top \nabla_\theta f(\theta, x'),$ 2 exceeds the kernel rank $K_\theta(x, x') = \nabla_\theta f(\theta, x)^\top \nabla_\theta f(\theta, x'),$ 3, or stronger conditions for sparse targets (Woodworth et al., 2020). The bias towards learning high-NTK-eigenvalue components first—the spectral bias—is inherent to the NTK flow and yields exponentially faster fitting along "smooth" modes determined by architecture and input distribution, not the target function (Bowman et al., 2022). This architectural and distributional dependence is independent of $K_\theta(x, x') = \nabla_\theta f(\theta, x)^\top \nabla_\theta f(\theta, x'),$ 4, meaning the generalization and learning properties follow from the kernel and data, not the task (Bowman et al., 2022).

5. Contrast with the Rich Regime and Beyond-Kernel Phenomena

The "rich" or “active” regime arises when feature dynamics are not negligible; this occurs at small initialization or width (or large depth). In this regime, gradient flow no longer produces the minimal RKHS norm solution—networks can approximate implicit optimization over $K_\theta(x, x') = \nabla_\theta f(\theta, x)^\top \nabla_\theta f(\theta, x'),$ 5, nuclear norm, or other non-kernel regularizers, and can outperform the kernel regime on tasks with underlying sparsity or low rank (Woodworth et al., 2020).

Intermediate and beyond-kernel approaches aim to capture these richer dynamics, as in the evolving-kernel and silent alignment frameworks (Atanasov et al., 2021, Shilton et al., 2024). In these settings, networks may first “silently align” the NTK to the data before expanding its overall scale without significant training loss change, eventually converging to kernel regression with the learned, data-dependent final kernel (Atanasov et al., 2021). Comprehensive extensions include matrix-valued kernels and representor theorems capturing perturbations of finite-width, finite-step-size networks beyond the classical NTK limit (Shilton et al., 2024). These models introduce higher-order, cross-term, and kernel-warping corrections essential to faithfully describe finite-size, non-lazy training.

6. Extensions Across Methodologies and Applications

The kernel regime is a unifying paradigm across diverse statistical and learning models. In self-supervised learning, it underpins closed-form analysis of representation learning via fixed kernels, decorrelating negatives and correlating positives, and enabling bounds on generalization in terms of induced kernels (Kiani et al., 2022). In high-dimensional PCA, the kernel regime leads to phase transitions governed by data and kernel nonlinearity, where detectability and support recovery depend on critical parameters analogous to the BBP threshold, but modulated by sparsity and kernel structure (Feldman et al., 2024). In kernel ridge regression, the regime determines the risk landscape—including multiple descent phenomena—in proportional, quadratic, or polynomial sample-to-dimension scalings, with critical points and scaling laws that explain transitions in test error behavior (Misiakiewicz, 2022, Pandit et al., 2024). The regime also governs phase transitions in kernel density estimation in high-dimensional statistics, delimiting when estimators behave classically or exhibit extreme-value, glassy, or stable-law statistics, and informing bandwidth selection (Biroli et al., 2024).

7. Practical Implications and Limitations

The kernel regime yields powerful, mathematically explicit predictions for optimization, generalization, and phase transitions, but with clear limitations. The regime is exact only in the infinite-width limit with appropriate initialization; finite-width corrections can be significant in practice, and richer feature-learning dynamics—central to deep learning's empirical successes—generally arise outside the kernel regime, especially as parameters deviate substantially from initialization. The analytical tractability of the kernel regime, however, provides a baseline for understanding both its own predictions and the boundary where genuinely non-kernel behaviors emerge (Woodworth et al., 2020, Atanasov et al., 2021, Shilton et al., 2024).