Neural Tangent Kernel Regime

Updated 8 October 2025

The Neural Tangent Kernel regime is defined as the infinite-width limit of neural networks where training dynamics linearize and are described by a deterministic kernel.
In this regime, network training follows kernel gradient descent with exponential convergence and simplified loss landscapes aligned with RKHS theory.
Despite its theoretical appeal, the NTK framework has limitations as it cannot capture feature learning in finite-width networks, motivating further research.

The Neural Tangent Kernel (NTK) regime refers to the infinite-width limit of artificial neural networks, in which the training dynamics of overparameterized models linearize around their initialization and are exactly described by a deterministic kernel, the NTK. In this regime, neural networks with certain parameterizations and initialization schemes evolve under gradient descent according to kernel gradient flow, enabling the application of kernel methods to analyze neural network learning, generalization, and loss landscape properties. The NTK regime is instrumental in establishing a theoretical correspondence between neural networks and reproducing kernel Hilbert spaces (RKHS), providing insights into convergence guarantees, feature learning limitations, and the relationship between neural and kernel-based approaches.

1. Definition and Mathematical Foundations

The NTK regime is realized by taking the width of each hidden layer in a neural network to infinity (with parameters initialized independently from a normal distribution). In this regime, for a neural network with parameters θ and output $f(x; \theta)$ , the NTK is defined as

$\Theta_t(x, x') = \nabla_\theta f(x; \theta_t)^\top \nabla_\theta f(x'; \theta_t)$

At initialization, as the width $n \to \infty$ , $\hat{\Theta}_t(x, x')$ converges in probability to a deterministic and time-independent kernel $\Theta(x, x')$ , provided the network is parameterized and initialized appropriately ("NTK parameterization") (Golikov et al., 2022).

In the NTK regime, the network's outputs evolve under gradient flow as: $f_t(x) = f_0(x) - \Theta(x, \vec{x}) \, \Theta^{-1}(\vec{x}, \vec{x}) (f_0(\vec{x}) - \vec{y})$ where $\vec{x}, \vec{y}$ denote the training inputs and labels. Thus, training corresponds to kernel ridge regression in the RKHS induced by $\Theta$ .

2. Model Behavior: Linearization and Training Dynamics

The NTK regime linearizes the learning dynamics of wide neural networks. When the widths are infinite and the standard NTK parameterization is used, the network function $f(x; \theta_t)$ can be accurately approximated by its first-order Taylor expansion in the parameters around initialization: $f(\theta) \approx f(\theta_0) + \nabla_\theta f(\theta_0) (\theta - \theta_0)$ The NTK remains constant during training, so the network outputs follow linear kernel regression dynamics. For overparameterized (wide) networks under mild conditions, gradient descent converges exponentially to global minimizers if the Gram matrix $H = \Theta(\vec{x}, \vec{x})$ is positive definite, with convergence rate determined by the smallest eigenvalue $\lambda$ : $\| \vec{y} - f_t(\vec{x}) \|_2 \le e^{-\lambda t} \| \vec{y} - f_0(\vec{x}) \|_2$ (Golikov et al., 2022, Nitta, 2018). This linearization is sometimes described as a "lazy training" regime.

3. Loss Landscape and Optimization Properties

In the NTK regime, the non-convex loss surfaces of deep neural networks become benign, with strong theoretical guarantees about absence of spurious local minima for specific architectures and activations. For deep ReLU nets, activation patterns become independent of weights and inputs, and every local minimum is a global minimum; all other critical points are saddles (Nitta, 2018):

Activation probability for any path: $P([Z_i]_{(j,p)} = 1) = 1/2^H$
Independence of activation from weights and data at infinite width
Corollary: every local minimum is global; non-global critical points are saddles with negative Hessian eigenvalues

This favorable optimization geometry is a direct result of the parameter dynamics staying close to their initial values in the infinite-width limit.

4. Generalization, Convergence Rates, and RKHS Theory

NTK theory connects neural network training to classical nonparametric statistics and kernel methods. In supervised learning, the network output converges to the kernel regression predictor in the NTK-induced RKHS. Excess risk analysis shows that for overparameterized two-layer networks (or neural operators), stochastic gradient descent (SGD) achieves minimax-optimal rates: $\mathbb{E}[\|g_{\overline{\Theta}^{(T)}} - g_\rho\|^2_{L_2}] \lesssim T^{-2r\beta/(2r\beta + 1)}$ when the kernel integral operator has eigenvalues $\lambda_i \sim i^{-\beta}$ and the target function satisfies a source condition of regularity $r$ (Nitanda et al., 2020, Nguyen et al., 23 Dec 2024). These results rigorously link the fast convergence and generalization of (overparameterized) wide neural nets in the NTK regime to well-understood kernel regression theory.

In recent extensions, the NTK concept is generalized to operator-valued settings, enabling minimax-optimal learning of solution operators to PDEs using neural operators (Nguyen et al., 23 Dec 2024). For neural operators, the NTK is operator-valued, and excess error rates as a function of the number of gradient steps and network width mimic the standard kernel regime.

5. Role of Initialization, Activation, and Architecture

The asymptotic properties of the NTK—its expressivity, rate of degeneration with depth, and test error—are highly sensitive to the initialization hyperparameters and activation function. For example (Hayou et al., 2019):

At the Edge of Chaos (EOC), NTK degeneration with depth is polynomial ( $\Theta(\log L/L)$ ) rather than exponential (in the “Ordered” or “Chaotic” phase).
For deep ReLU or tanh nets, the limiting NTK becomes trivial (rank one) as depth $\to \infty$ , with expressivity controlled by hyperparameters.

For ResNets, NTK degenerates more slowly with depth, suggesting architectural choices can preserve useful “signal propagation” and NTK expressivity.

6. Finite-Width Networks, Feature Learning, and Corrections Beyond the NTK Regime

While the NTK regime is exact in the infinite-width limit, for finite-width (practical) networks, the NTK is not constant during training. Corrections of order $1/n$ and higher arise, described by a hierarchy of higher-order tensors and differential equations (Huang et al., 2019, Golikov et al., 2022, Hanin et al., 2019):

Finite-width effects cause the NTK to evolve, enabling weak feature learning and partially explaining the observed performance gap between finite-width networks and strict NTK-based kernel regression.
In deep and wide networks where both depth $d$ and width $n$ go to infinity but $d/n$ is non-negligible, the NTK remains random (variance $\sim$ $\exp(5 d/n)$ ), and early feature learning becomes possible (Hanin et al., 2019).

The NTK expansion for finite-width networks includes explicit label-dependent corrections, and these corrections can in principle improve generalization over the static (“fixed-feature”) NTK regime.

7. Methodological Extensions and Applications

The NTK regime has been extended to a wide variety of architectures and applications:

Recurrent networks: the Recurrent NTK (RNTK) treats infinite-width RNNs (Alemohammad et al., 2020, Alemohammad et al., 2020), incorporating time-dependence and pooling.
Graph neural networks: NTK informs the design of graph shift operators via alignment analysis, optimizing learning speed and generalization (Khalafi et al., 2023).
Quantum-classical hybrid models: the NTK regime enables analytically tractable quantum kernel design and leverages both quantum feature extraction and efficient classical training (Nakaji et al., 2021).
Neural operators: operator-valued NTKs characterize generalization for PDE solution operators, with minimax-optimal convergence under early stopping (Nguyen et al., 23 Dec 2024).

In practice, NTK-based predictors are applied as kernel ridge regression methods for small to moderate datasets, for use in neural architecture search (via Gram matrix condition number), matrix completion, image inpainting, and federated learning (Golikov et al., 2022).

8. Limitations and Open Problems

The NTK regime is subject to significant limitations:

The convergence of the NTK to a constant kernel depends on the network being parameterized and initialized in a specific way, which differs from many “practical” implementations.
Infinite-width NTK models cannot capture feature learning, as real neural networks do when weights deviate significantly from initialization (Seleznova et al., 2020).
Computational costs scale poorly with dataset size, limiting direct application to large problems.
For finite-width networks (especially those with large depth-to-width ratios or “chaotic” initialization), NTK theory does not reliably predict training or generalization dynamics (Seleznova et al., 2020).
Generalizing NTK theory to realistic settings including convolutional networks, data-dependent initialization, and large deviations from random initialization remains a challenge.

9. Representative Mathematical Expressions

NTK Definition	Linearized Output Evolution	Excess Risk Rate
$\Theta(x, x') = \nabla_\theta f(x; \theta)^T \nabla_\theta f(x'; \theta)$	$f_t(x) = f_0(x) - \Theta(x,\vec{x})\,\Theta^{-1}(\vec{x},\vec{x}) (f_0(\vec{x}) - \vec{y})$	$\mathbb{E}[\\|g_{\overline{\Theta}^{(T)}} - g_\rho\\|^2_{L_2}] \lesssim T^{-2r\beta/(2r\beta + 1)}$

These formulas highlight the kernel gradient descent, the NTK linearization of network evolution, and the statistical rate for excess risk when using NTK-based learning.

The NTK regime establishes a foundational theoretical correspondence between infinitely wide neural networks and kernel methods, enabling the transfer of statistical learning theory to the analysis of deep learning. It provides a tractable framework for understanding loss surfaces, convergence, and generalization; at the same time, its limitations—in particular, its inability to capture feature learning and its dependence on idealized infinite-width assumptions—highlight the necessity of more comprehensive theories for finite-width, practical neural networks.