Neural Tangent Kernel (NTK) Overview

Updated 13 December 2025

Neural Tangent Kernel is a mathematical framework that models the training dynamics of wide neural networks using gradient flow and kernel methods.
It linearizes the training process in the infinite-width regime, linking network optimization closely with kernel ridge regression.
NTK insights reveal practical implications for feature learning, kernel alignment, and neural collapse while inspiring scalable approximations in deep learning.

The Neural Tangent Kernel (NTK) framework provides a rigorous, operator-theoretic description of the training dynamics of wide neural networks under gradient flow. The NTK governs the evolution of neural network outputs, predicts the speed and nature of convergence, and serves as a powerful mathematical link between neural network optimization and kernel methods. In the infinite-width regime, the NTK becomes deterministic and typically remains constant during training, resulting in linearized dynamics that can be analyzed using tools from functional analysis and random matrix theory. This framework has profound implications for understanding generalization, the emergence of geometric structures such as Neural Collapse, feature learning via kernel alignment, finite-width corrections, and the limits of kernelized models in practical deep learning systems.

1. Definition and Structural Principles of the Neural Tangent Kernel

The NTK is defined for a parameterized neural network $f_\theta: \mathbb{R}^d \to \mathbb{R}^k$ with parameter vector $\theta\in\mathbb{R}^P$ as

$\Theta(x, x') = \nabla_\theta f_\theta(x)^\top \nabla_\theta f_\theta(x') \in \mathbb{R}^{k\times k}$

(Jacot et al., 2018, Yang, 2020). In practical settings, the empirical NTK matrix $K_\mathrm{NTK}$ evaluated on a dataset $X = \{x_i\}$ is

$K_\mathrm{NTK}(x_i, x_j) = \nabla_\theta f_\theta(x_i)^\top \nabla_\theta f_\theta(x_j)$

(Ge et al., 1 Oct 2025, Engel et al., 2022). The NTK quantifies the functional interdependence between outputs for pairs of inputs induced by shared network parameters.

In the infinite-width limit (width $\to\infty$ in all hidden layers), and under random Gaussian initialization, the NTK converges to a deterministic, architecture-specific kernel $\Theta_\infty(x,x')$ (Jacot et al., 2018, Yang, 2020, Mysore et al., 9 Dec 2025). Under gradient flow on the loss (e.g., squared error), the output evolution is governed by: $\frac{d}{dt}f_t(x) = -\eta \sum_j \Theta_\infty(x, x_j) [f_t(x_j) - y_j]$ (Jacot et al., 2018, Mysore et al., 9 Dec 2025), generalizing to vector-valued and multi-output settings.

The NTK matrix can be constructed recursively for standard architectures (MLPs, CNNs, GNNs) and extended, via the Tensor Program formalism, to essentially any parametric network, including recurrent and attention-based models. For ReLU architectures, the NTK has explicit recursions ((Yang, 2020), see Section 4), leveraging layerwise statistics of preactivation covariances and their derivatives.

2. Infinite-Width Dynamics and Kernel Linearization

In the infinite-width regime, the NTK remains constant during training ("lazy training"), and learning reduces to linearized dynamics: $f_t(x) = f_0(x) + \int_0^t -\Theta_\infty(x, X)[f_s(X) - y] ds$ (Jacot et al., 2018, Mysore et al., 9 Dec 2025). This renders training equivalent to kernel gradient descent in a reproducing kernel Hilbert space (RKHS) governed by $\Theta_\infty$ . For regression with mean squared error, convergence occurs along the principal components of $\Theta_\infty$ , with each mode decaying at a rate proportional to its eigenvalue (see Theorem 2.2 in (Mysore et al., 9 Dec 2025)): $f_i(t) - y_i = (1 - \eta\,\lambda_i)^t (f_i(0) - y_i)$ where $\lambda_i$ are the eigenvalues of $\Theta_\infty$ on the training data.

Positive-definiteness of $\Theta_\infty$ ensures global convergence to zero training loss and, in the zero-regularization limit, training reduces to kernel ridge regression (with the NTK as the kernel) (Jacot et al., 2018, Ge et al., 1 Oct 2025). Additionally, random initialization creates an equivalence between wide neural networks at initialization and Gaussian processes, with NTK encoding the time evolution of learning, while the NNGP kernel describes the output covariance (Avidan et al., 2023).

3. Kernel Alignment, Feature Learning, and Specialization

While infinite-width theory treats the NTK as static, practical finite-width networks exhibit "kernel alignment": progressive adaptation of the empirical NTK to strengthen correlation with label-relevant directions during training (Shan et al., 2021). Key results include:

Alignment metric: $A(t)=\langle K(t), yy^T \rangle_F / (\|K(t)\|_F \|yy^T\|_F)$ quantifies how much the learned NTK aligns with label structure (Shan et al., 2021).
Alignment accelerates the decay of errors along label directions and induces faster, selective generalization.
In deep linear and two-layer ReLU networks, analytical results show that NTK alignment and "kernel specialization" (where each output-channel's subkernel aligns preferentially to its target) are necessary and sufficient for feature learning beyond the fixed-kernel regime (Shan et al., 2021, Khalafi et al., 2023).
Empirical data confirms that alignment and specialization are stronger in deep, nonlinear, clustered, and multi-output settings.

4. Block Structure and Neural Collapse in Classification

A canonical empirical observation is the emergence of block structure in the NTK during late-phase training of classification networks: stronger intra-class correlations and weaker inter-class correlations (Seleznova et al., 2023). Under the assumption that the NTK develops three-valued block structure: $\Theta(x, x) = \lambda_d I_C,\quad \Theta(x_i^c, x_j^c) = \lambda_c I_C\ (\text{same class, } i\neq j),\quad \Theta(x_i^c, x_j^{c'}) = \lambda_n I_C\ (c \neq c')$ with $\lambda_d > \lambda_c > \lambda_n \ge 0$ , the dynamics of last-layer features, class means, and weights decouple into tractable eigenspaces. When a dynamical invariant vanishes, the dynamics provably result in Neural Collapse: within-class variability collapses, class means form ETF, classifier weights become self-dual to class means, and the nearest-class-center rule holds (Seleznova et al., 2023). This analysis provides the first end-to-end NTK-based derivation of Neural Collapse under mean squared error training.

5. Extensions: Generalization, Scalability, and Empirical Phenomena

The NTK framework has been extended to numerous settings:

Graph Neural Networks: Alignment of NTK eigenvectors and input-output cross-covariance determines optimal graph shift operators; GNNs with cross-covariance graphs achieve improved convergence and generalization (Khalafi et al., 2023).
Quantum-Enhanced Networks: NTK formalism naturally extends to quantum-classical hybrid networks by incorporating quantum kernels as the base layer in infinite-width NTK recursions, providing closed-form training dynamics and Gaussian process covariance structure (Nakaji et al., 2021).
Neural Operators: For operator learning (e.g., PDE surrogates), two-layer neural operators inherit a vector-valued NTK (vvNTK) structure that controls early-stopped gradient descent rates, minimax-optimal generalization, and sharp bias-variance tradeoffs (Nguyen et al., 23 Dec 2024).

Computational Challenges: Exact NTK computation is quadratic or cubic in sample size, motivating approximate methods. Efficient random feature and sketching methods yield randomized low-dimensional NTK feature maps that allow linear-time evaluation and provable uniform approximation guarantees; such approaches scale NTK to practical datasets without loss of accuracy (Zandieh, 2021, Zandieh et al., 2021).

Empirical Gaps and Limitations: Recent critical works have established that the infinite-width NTK regime requires unrealistically large widths (several orders of magnitude larger than depth) to guarantee constancy and linearization (Wenger et al., 2023, Liu et al., 19 Jan 2025). For realistic finite widths,

The empirical NTK can evolve significantly during training, leading to feature learning beyond the kernel regime (Huang et al., 2019, Guillen et al., 15 Aug 2025).
Equivalence between kernel regression with the static NTK and neural network training breaks down, and finite-width networks often outperform the corresponding kernel model (Liu et al., 19 Jan 2025).
Algorithms relying on NTK-derived properties (optimization rates, uncertainty quantification, continual learning bounds) fail to yield predicted benefits for practical model widths (Wenger et al., 2023).

6. Advanced Topics: Finite-Width Corrections and Architectural Control

Finite-width corrections to NTK statistics can be rigorously captured using a $1/n$ expansion (where $n$ is width), systematically organized via Feynman diagrammatics (Guillen et al., 15 Aug 2025). Key findings include:

At infinite width, NTK is deterministic and "frozen"; finite-width corrections yield NTK evolution, enabling feature learning.
At "criticality" (suitable scaling of weights and activations), higher-order cumulants and NTK corrections remain stable with increasing depth.
For scale-invariant activations (e.g., ReLU), no finite-width correction appears on the NTK diagonal, and empirically the diagonal entries of the NTK remain exact at arbitrary width.

Architectural choices directly affect NTK spectral properties and, thus, optimization and generalization:

Fourier feature embeddings inject high-frequency components, improving learnability for oscillatory targets (Mysore et al., 9 Dec 2025).
Residual scaling and stochastic depth regularize the NTK spectrum, preventing eigenvalue explosion and ensuring optimization stability.
The evolution and conditioning of NTK eigenvalues dictate mode-wise convergence and generalization error bounds.

7. Applications and Practical Implications

NTK-driven methodologies find application in federated learning, where NTK-based updates can be used for convergence analysis and algorithmic improvements in both centralized (FedAvg) and decentralized (DFL) settings (Huang et al., 2021, Thompson et al., 2 Oct 2024). In genetic risk modeling, embedding the empirical NTK into classical statistical models enables both accuracy and interpretability: variance-component estimation and heritability decomposition in linear mixed models directly leverage the NTK (NTK-LMM), combining nonlinear representation power with the advantages of convex statistical estimation (Ge et al., 1 Oct 2025).

The NTK has motivated scalable software implementations such as torchNTK (Engel et al., 2022), providing practical layerwise decomposition and optimization diagnostics. The NTK-FL and NTK-DFL frameworks achieve communication-efficient federated learning by combining kernel-based analytical updates with model-averaging, particularly effective in heterogeneous data regimes.

References

(Jacot et al., 2018) Jacot, Gabriel, and Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks"
(Seleznova et al., 2023) Goldt et al., "Neural (Tangent Kernel) Collapse"
(Shan et al., 2021) Lampinen and Ganguli, "A Theory of Neural Tangent Kernel Alignment and Its Influence on Training"
(Wenger et al., 2023) Zech et al., "On the Disconnect Between Theory and Practice of Neural Networks: Limits of the NTK Perspective"
(Engel et al., 2022) Engel et al., "TorchNTK: A Library for Calculation of Neural Tangent Kernels of PyTorch Models"
(Nakaji et al., 2021) Cerezo et al., "Quantum-enhanced neural networks in the neural tangent kernel framework"
(Guillen et al., 15 Aug 2025) Srivastava et al., "Finite-Width Neural Tangent Kernels from Feynman Diagrams"
(Nguyen et al., 23 Dec 2024) Kabanikhin et al., "Optimal Convergence Rates for Neural Operators"
(Mysore et al., 9 Dec 2025) Benigni and Paquette, "Mathematical Foundations of Neural Tangents and Infinite-Width Networks"
(Ge et al., 1 Oct 2025) Park et al., "Neural Tangent Kernels for Complex Genetic Risk Prediction: Bridging Deep Learning and Kernel Methods in Genomics"
(Huang et al., 2021) Xie et al., "FL-NTK: A Neural Tangent Kernel-based Framework for Federated Learning Convergence Analysis"
(Thompson et al., 2 Oct 2024) Dinh et al., "NTK-DFL: Enhancing Decentralized Federated Learning in Heterogeneous Settings via Neural Tangent Kernel"
(Yang, 2020) Yang, "Tensor Programs II: Neural Tangent Kernel for Any Architecture"
(Khalafi et al., 2023) Becker et al., "Neural Tangent Kernels Motivate Graph Neural Networks with Cross-Covariance Graphs"
(Huang et al., 2019) Huang and Yau, "Dynamics of Deep Neural Networks and Neural Tangent Hierarchy"
(Liu et al., 19 Jan 2025) Liu et al., "Issues with Neural Tangent Kernel Approach to Neural Networks"
(Zandieh, 2021) Chen et al., "Learning with Neural Tangent Kernels in Near Input Sparsity Time"
(Zandieh et al., 2021) Chen et al., "Scaling Neural Tangent Kernels via Sketching and Random Features"
(Avidan et al., 2023) Schnaack et al., "Connecting NTK and NNGP: A Unified Theoretical Framework for Wide Neural Network Learning Dynamics"

This confluence of kernel methods, random matrix theory, computational physics, and statistical learning theory continues to drive both theoretical progress and practical tools in deep learning research.