Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Tangent Kernel (NTK)

Updated 30 June 2025

The Neural Tangent Kernel (NTK) is a mathematical construct that describes the functional evolution of artificial neural networks (ANNs) during training with gradient descent, particularly in the regime where all hidden layer widths tend to infinity. Originally formalized by Jacot, Gabriel, and Hongler in 2018, the NTK provides a precise kernel-based perspective on both the convergence properties and generalization behavior of wide neural networks. In this framework, the NTK acts as a bridge between neural networks and classical kernel methods, allowing for rigorous analysis in function space rather than parameter space.

1. Definition and Mathematical Formulation

The NTK is defined for a parameterized neural network function fθ:Rn0RnLf_\theta: \mathbb{R}^{n_0} \to \mathbb{R}^{n_L}, with parameter vector θRP\theta \in \mathbb{R}^P. The NTK at parameter configuration θ\theta is given by

Θ(L)(θ)=p=1PθpF(L)(θ)θpF(L)(θ),\Theta^{(L)}(\theta) = \sum_{p=1}^P \partial_{\theta_p} F^{(L)}(\theta) \otimes \partial_{\theta_p} F^{(L)}(\theta),

where F(L)F^{(L)} maps θ\theta to functions fθf_\theta, and the sum ranges over all network parameters. The kernel Θ(L)(θ)\Theta^{(L)}(\theta) captures how infinitesimal parameter changes affect network outputs, including the cross-couplings of such effects across different inputs.

During training by gradient descent, the network’s output evolution can be described as

ddtfθ(t)=Θ(L)(θ(t))Cfθ(t),\frac{d}{dt} f_{\theta(t)} = -\nabla_{\Theta^{(L)}(\theta(t))} C |_{f_{\theta(t)}},

where CC is the loss functional and KCf\nabla_K C|_f denotes the kernel gradient with respect to KK at function ff. This equation formalizes how, in function space, gradient descent takes the form of a kernel gradient flow.

2. Behavior in the Infinite-Width Limit

A central result is that as all hidden layer widths n1,,nL1n_1, \ldots, n_{L-1} \to \infty, the NTK Θ(L)(θ)\Theta^{(L)}(\theta) converges in probability to a deterministic and constant kernel Θ(L)\Theta^{(L)}_\infty: Θ(L)(θ)nΘ(L)InL.\Theta^{(L)}(\theta) \xrightarrow[n_\ell\to\infty]{} \Theta^{(L)}_\infty \otimes I_{n_L}. The limiting kernel is given recursively as

Θ(1)(x,x)=Σ(1)(x,x) Θ(L+1)(x,x)=Θ(L)(x,x)Σ˙(L+1)(x,x)+Σ(L+1)(x,x),\begin{aligned} \Theta_\infty^{(1)}(x, x') &= \Sigma^{(1)}(x, x') \ \Theta_\infty^{(L+1)}(x, x') &= \Theta_\infty^{(L)}(x, x') \, \dot{\Sigma}^{(L+1)}(x, x') + \Sigma^{(L+1)}(x, x'), \end{aligned}

where Σ(1)(x,x)=1n0xTx+β2\Sigma^{(1)}(x, x') = \frac{1}{n_0} x^T x' + \beta^2, and

Σ˙(L+1)(x,x)=EfN(0,Σ(L))[σ˙(f(x))σ˙(f(x))].\dot{\Sigma}^{(L+1)}(x, x') = \mathbb{E}_{f\sim \mathcal{N}(0, \Sigma^{(L)})}[\dot{\sigma}(f(x))\dot{\sigma}(f(x'))].

Here, σ\sigma is the network nonlinearity and β\beta comprises bias terms.

In this limit:

  • The network output at initialization is a realization from a Gaussian process.
  • The NTK Θ(L)\Theta^{(L)}_\infty remains constant throughout training—i.e., it does not change as network parameters are updated by gradient descent.
  • Training with gradient descent becomes equivalent to performing kernel (ridge) regression in function space, using the NTK as the kernel.

3. Relationship to Gaussian Processes and Kernel Methods

At random initialization, an infinite-width neural network’s output function is a Gaussian process with covariance determined by the kernel Σ(L)\Sigma^{(L)}. As training proceeds, in the infinite-width limit, the evolution of the output function is governed by kernel gradient descent with respect to the constant NTK Θ(L)\Theta^{(L)}_\infty. For the square loss,

ddtft=ΦK(fft,pin),\frac{d}{dt} f_t = \Phi_{K}(\langle f^* - f_t, \cdot \rangle_{p^{in}}),

where ΦK\Phi_{K} is the feature map defined by KK. The solution follows a linear differential equation, and convergence is governed by the eigendecomposition of the NTK gram matrix; convergence along directions of large kernel eigenvalues is fast, and along directions of small eigenvalues is slow. This directly mirrors the properties of classical kernel regression and Gaussian process inference.

4. Convergence, Positive-Definiteness, and Generalization

A crucial requirement for the well-posedness of kernel gradient descent (and thus for guaranteed convergence in function space) is the positive-definiteness of the limiting NTK Θ(L)\Theta^{(L)}_\infty. The paper proves that for data supported on the unit sphere and for any non-polynomial Lipschitz nonlinearity σ\sigma, the NTK is positive-definite for deep (L2L \geq 2) fully connected networks. Positive-definiteness ensures that:

  • The Gram matrix on training data is invertible.
  • Gradient descent in function space converges for convex loss functionals.
  • The least-squares regression solution exactly matches kernel regression with Θ(L)\Theta^{(L)}_\infty.
  • Generalization properties derive from properties of the NTK as a kernel.

5. Spectral Perspective and Early Stopping

The spectrum of the limiting NTK Π\Pi (the operator associated with Θ(L)\Theta^{(L)}_\infty) underlies the convergence dynamics in function space. If Π\Pi has eigenfunctions f(i)f^{(i)} with eigenvalues λi\lambda_i, then during training: ft=f+ietλiΔfi,f_t = f^* + \sum_{i} e^{-t\lambda_i} \Delta^i_f, where Δfi\Delta^i_f are initial projections onto the eigencomponents. This spectral perspective provides a theoretical basis for early stopping as a regularization strategy: convergence is rapid in directions corresponding to kernel principal components with large eigenvalues (low-complexity or “signal” components), while in directions associated with small eigenvalues (typically “noisy” or high-frequency components), convergence is slow. Early stopping naturally biases the learning process toward low-noise features.

6. Numerical Observations and Empirical Regime

Empirical analysis confirms that even for moderate network widths (hundreds to thousands of units per layer), the observed NTK at initialization closely matches the infinite-width Θ(L)\Theta^{(L)}_\infty. During training, for sufficiently wide networks, the NTK remains nearly constant, validating the infinite-width theory. Furthermore, function outputs at convergence are distributed according to predictions from the kernel regression model. Deviations (such as small “inflations” of the NTK during training) decrease as the width increases. In both artificial and real data experiments, such as on points sampled from a circle or on MNIST, observed convergence and generalization behaviors closely track the theoretical predictions based on spectral analysis of the NTK.

7. Significance and Theoretical Implications

The NTK framework recasts the training of wide neural networks as a linear kernel regression problem in function space, with the kernel structure entirely determined by the architecture and nonlinearity. This provides a unified account of the random function behavior at initialization and the deterministic learning trajectory during training. The framework explains why highly overparameterized (wide) networks are reliably trainable: the function-space loss landscape becomes convex under the NTK’s induced metric. Early stopping’s effectiveness is justified via the NTK’s spectral decomposition, providing a natural explanation for regularization phenomena observed in deep learning.

The significance of this framework is further enhanced by rigorous proofs of NTK convergence, positive-definiteness, and matching of regression solutions. The approach has substantial implications for theoretical analyses of generalization and for guiding practical choices in architecture and training of deep, wide neural networks.


Table: Key NTK Formulas

Quantity Formula
NTK at θ\theta Θ(L)(θ)=pθpF(L)(θ)θpF(L)(θ)\Theta^{(L)}(\theta) = \sum_p \partial_{\theta_p} F^{(L)}(\theta) \otimes \partial_{\theta_p} F^{(L)}(\theta)
Kernel gradient of loss KCf0(x)=1Nj=1NK(x,xj)df0(xj)\nabla_K C|_{f_0}(x) = \frac{1}{N}\sum_{j=1}^N K(x,x_j) d|_{f_0}(x_j)
Differential equation (least squares) ddtft=ΦK(fft,pin)\frac{d}{dt} f_t = \Phi_K(\langle f^* - f_t, \cdot\rangle_{p^{in}})
NTK recursion (infinite width) Θ(L+1)(x,x)=Θ(L)(x,x)Σ˙(L+1)(x,x)+Σ(L+1)(x,x)\Theta_\infty^{(L+1)}(x, x') = \Theta_\infty^{(L)}(x, x') \dot{\Sigma}^{(L+1)}(x, x') + \Sigma^{(L+1)}(x, x')
Σ˙\dot{\Sigma} definition Σ˙(L+1)(x,x)=EfN(0,Σ(L))[σ˙(f(x))σ˙(f(x))]\dot{\Sigma}^{(L+1)}(x, x') = \mathbb{E}_{f \sim \mathcal{N}(0, \Sigma^{(L)})}[ \dot{\sigma}(f(x)) \dot{\sigma}(f(x')) ]

The neural tangent kernel formalism has become a foundational tool in the theoretical analysis of deep learning, providing actionable insight into the convergence and generalization properties of wide neural networks and revealing deep connections between modern deep learning and classical kernel methods.