Neural Tangent Kernel (NTK) Regime

Updated 29 January 2026

The NTK regime is a framework that describes infinitely wide neural networks through linearized dynamics and a static kernel approximation.
It establishes an explicit correspondence with kernel regression, elucidating both optimization and generalization characteristics.
Insights from the NTK regime guide practical strategies in network initialization, learning rate scaling, and understanding finite-width effects.

The Neural Tangent Kernel (NTK) regime provides a precise analytic framework for understanding the functional dynamics of neural networks in the infinite-width limit. In this regime, training a sufficiently wide neural network with appropriate scaling of initialization and learning rate causes the network's output to evolve linearly in a feature space defined by the so-called neural tangent feature map. The NTK governs both optimization and generalization properties of the resulting model and enables explicit correspondence with kernel regression. Crucially, NTK theory reveals when deep learning reduces effectively to a classical kernel method and, conversely, clarifies the circumstances under which this reduction fails due to finite width or distributional nonstationarity (Liu et al., 21 Jul 2025, Mysore et al., 9 Dec 2025, Seleznova et al., 2020).

1. Formal Definition and Theoretical Foundations

Given a parametric neural network $f(x;\theta)$ with parameters $\theta \in \mathbb{R}^P$ and input $x \in \mathcal{X}$ , the Neural Tangent Kernel is defined as

$\Theta(x,x') = \nabla_\theta f(x;\theta_0)^T \nabla_\theta f(x';\theta_0),$

where $\theta_0$ denotes the initial parameters. In the limit where all hidden layer widths tend to infinity and under so-called NTK parameterization (variance $\sim 1/\text{width}$ ) with sufficiently small learning rate $\eta \propto 1/\text{width}$ , the following linearization holds: $f(x;\theta) \approx f(x;\theta_0) + \nabla_\theta f(x;\theta_0)^T (\theta - \theta_0),$ and the NTK $\Theta(x, x')$ remains effectively static during training (Liu et al., 21 Jul 2025, Mysore et al., 9 Dec 2025).

This regime, often referred to as "lazy training" or "kernel regime," thus reduces the training dynamics to those of kernel gradient descent: $\frac{d}{dt} f(x, t) = -\frac{1}{n} \sum_{i=1}^n \Theta(x, x_i) (f(x_i, t) - y_i),$ where $\theta \in \mathbb{R}^P$ 0 are target values. The solution is expressible in closed form via the kernel Gram matrix on the training data, establishing equivalence to kernel ridge regression in the NTK reproducing kernel Hilbert space (RKHS) (Liu et al., 21 Jul 2025, Mysore et al., 9 Dec 2025, Seleznova et al., 2020, Nitanda et al., 2020).

2. Characterization of the NTK Regime

The NTK regime is characterized by invariance of the kernel throughout training and minimal movement of the parameters away from initialization. This necessitates:

Infinite width (or large finite width): Ensures deterministic initialization and constancy of the NTK.
Appropriate learning rate scaling: $\theta \in \mathbb{R}^P$ 1 to prevent large parameter updates.
Single, stationary data distribution: Distributional invariance; dynamic or nonstationary settings destroy the static-kernel approximation.

Empirical and mean-field analyses reveal that NTK theory only applies (i) when gradients do not explode (ordered phase) and (ii) for network depths $\theta \in \mathbb{R}^P$ 2 much less than widths $\theta \in \mathbb{R}^P$ 3 in the chaotic phase (at the so-called edge of chaos) (Seleznova et al., 2020, Hayou et al., 2019). In practical finite-width settings, or under distribution shifts, the NTK can change considerably during training, entering the feature learning regime (Liu et al., 21 Jul 2025, Huang et al., 2019).

3. Dynamics Beyond the Static-Kernel Limit

In practical networks of finite width, especially under standard (e.g., Kaiming) initialization or larger learning rates, the NTK evolves appreciably during training. Performance improvements over infinite-width NTK kernel regression predictors are routinely attributed to this kernel evolution (Huang et al., 2019).

The process of NTK evolution is captured in the so-called Neural Tangent Hierarchy (NTH), a system of coupled ODEs for $\theta \in \mathbb{R}^P$ 4 and higher-order “kernels” $\theta \in \mathbb{R}^P$ 5: $\theta \in \mathbb{R}^P$ 6 where the truncation order $\theta \in \mathbb{R}^P$ 7 controls the tradeoff between approximation error—decaying as $\theta \in \mathbb{R}^P$ 8 for width $\theta \in \mathbb{R}^P$ 9—and computational tractability (Huang et al., 2019). Allowing for NTK evolution enables “feature learning” and accounts for the empirically observed generalization gap between infinite-width and large-but-finite-width neural networks.

4. Experimental Probes and Scaling Laws

The constancy of the NTK in the lazy regime and its departure beyond that regime is quantifiable via several spectral and alignment metrics:

Kernel spectral norm: $x \in \mathcal{X}$ 0, measuring maximal mode strength.
CKA kernel distance: $x \in \mathcal{X}$ 1.
Normalized Frobenius difference: $x \in \mathcal{X}$ 2.
Kernel velocity: $x \in \mathcal{X}$ 3, quantifies instantaneous kernel change.
Kernel-label alignment: $x \in \mathcal{X}$ 4 for label vector $x \in \mathcal{X}$ 5.

In continual (nonstationary) learning, abrupt "reactivation" phenomena occur at task boundaries, with transient collapse and subsequent recovery of kernel norm and alignment, even for wide networks in the NTK regime (Liu et al., 21 Jul 2025). Such observations indicate that a fixed NTK cannot account for forgetting, interference, or rapid adaptation in dynamic environments.

5. Implications for Optimization and Generalization

The NTK spectrum determines the explicit convergence rates of gradient descent, with convergence along an eigen-direction $x \in \mathcal{X}$ 6 set by the associated eigenvalue $x \in \mathcal{X}$ 7: $x \in \mathcal{X}$ 8 Bounding eigenvalue growth via architectural or algorithmic interventions (e.g., NTK-Eigenvalue-Controlled Residual Network, stochastic depth, Fourier embedding) provides explicit control over convergence stability and generalization error (Mysore et al., 9 Dec 2025). The generalization bound typically reads: $x \in \mathcal{X}$ 9 with $\Theta(x,x') = \nabla_\theta f(x;\theta_0)^T \nabla_\theta f(x';\theta_0),$ 0 quantifying finite-width fluctuations.

In the NTK regime, averaged stochastic gradient descent (ASGD) achieves minimax-optimal rates for regression in the RKHS determined by the NTK, with explicit exponents determined by smoothness $\Theta(x,x') = \nabla_\theta f(x;\theta_0)^T \nabla_\theta f(x';\theta_0),$ 1 and kernel eigen-decay $\Theta(x,x') = \nabla_\theta f(x;\theta_0)^T \nabla_\theta f(x';\theta_0),$ 2: $\Theta(x,x') = \nabla_\theta f(x;\theta_0)^T \nabla_\theta f(x';\theta_0),$ 3 (Nitanda et al., 2020). For neural operators, an analogous NTK regime provides minimax rates in operator learning with explicit sample complexity and width requirements (Nguyen et al., 2024).

6. Limitations and Failure Modes

Several empirical and theoretical analyses delineate the limits of the NTK regime:

Depth-width constraints: Finite-width NTK matches the infinite-width limit only when $\Theta(x,x') = \nabla_\theta f(x;\theta_0)^T \nabla_\theta f(x';\theta_0),$ 4 or at the edge of chaos for certain initialization; otherwise, kernels can be highly random or drift during training (Seleznova et al., 2020, Hayou et al., 2019).
Nonstationarity: Shifts in the data distribution cause the NTK to evolve significantly, invalidating the static kernel approximation and leading to qualitative phenomena such as reactivation at task boundaries (Liu et al., 21 Jul 2025).
Feature learning: The NTK regime precludes exploration of "good" directions in parameter space critical for learning sparse, high-degree, or compositional structure (Nichani et al., 2022).
Practical mismatch: Experimentally, the widths/depths required for classical NTK predictions to hold are often orders of magnitude beyond those used in practice, rendering NTK-based algorithmic choices unreliable in realistic settings (Wenger et al., 2023).

7. Extensions, Modifications, and Theoretical Outlook

To address the breakdown of NTK regime assumptions, several theoretical developments have emerged:

Neural Tangent Hierarchy (NTH): Systematic $\Theta(x,x') = \nabla_\theta f(x;\theta_0)^T \nabla_\theta f(x';\theta_0),$ 5 corrections by including higher-order tensors $\Theta(x,x') = \nabla_\theta f(x;\theta_0)^T \nabla_\theta f(x';\theta_0),$ 6, enabling precise quantitative modeling of finite-width networks (Huang et al., 2019).
NTK-Eigenvalue-Controlled Residual Networks: Architectural modifications to stabilize kernel spectra and generalization (Mysore et al., 9 Dec 2025).
Integrated frameworks: Unified theories interpolating between deterministic NTK dynamics and full Bayesian NNGP posteriors, clarifying distinct timescales and phases of learning (Avidan et al., 2023).
Regularized NTK dynamics: Incorporating explicit regularization to sustain the lazy regime facilitates PAC-Bayesian analysis and uniform convergence results (Clerico et al., 2023).

A central theoretical imperative is to explicitly model training-induced evolution of $\Theta(x,x') = \nabla_\theta f(x;\theta_0)^T \nabla_\theta f(x';\theta_0),$ 7 as a function of both initialization and the data-distribution trajectory. Such models promise to yield new algorithmic strategies for harnessing feature reactivation and mitigating catastrophic forgetting in non-stationary or continual learning (Liu et al., 21 Jul 2025, Mysore et al., 9 Dec 2025).