Deep ReLU Networks: NTK Dynamics

Updated 3 December 2025

Deep ReLU networks are multilayer feedforward architectures that use ReLU activations and are analyzed via neural tangent kernel (NTK) theory to understand function-space evolution.
The NTK framework uses explicit kernel recursions to capture dynamics such as frozen training at infinite width and evolving feature learning at finite width.
Randomized feature maps and the neural tangent hierarchy enable efficient approximation and quantify improvements in generalization, spectral bias, and computational scaling.

Deep ReLU networks are multilayer feedforward architectures with rectified linear unit (ReLU) activations, typically employing widths and depths large enough to access the asymptotic regimes in which neural tangent kernel (NTK) theory provides an accurate description of both function-space evolution and generalization. These systems exhibit a range of behaviors—frozen kernel training at infinite width, slow kernel evolution and feature learning at finite width, and alignment and specialization dynamics under gradient descent—that have been characterized via explicit kernel recursions, infinite ordinary differential equation hierarchies, and rigorous spectral and statistical analyses.

1. Definition and Infinite-Width Limit of the Neural Tangent Kernel

In a deep fully-connected ReLU network, each hidden layer applies $x^{(\ell)} = (1/\sqrt{m}) \, \sigma(W^{(\ell)} x^{(\ell-1)})$ for width $m$ , depth $H$ , and activation $\sigma(\cdot) = \max(0, \cdot)$ . The output is $f(x;\theta) = a^\top x^{(H)}$ . Under gradient flow, the empirical NTK at any time $t$ is given by: $\Theta_t(x, x') = \langle \nabla_\theta f(x; \theta_t), \nabla_\theta f(x'; \theta_t) \rangle$ This kernel governs the evolution of the network output via: $\dot f_t(x_i) = -\frac{1}{n} \sum_{j=1}^n \Theta_t(x_i, x_j) \, (f_t(x_j) - y_j)$ In the infinite-width limit ( $m \to \infty$ ), $\Theta_t$ concentrates to a deterministic kernel $\Theta_\infty$ that remains constant throughout training, and function-space dynamics become linearized (Jacot et al., 2018): $\dot f_t = -\Theta_\infty (f_t - y)$ The solution coincides with kernel regression in RKHS( $\Theta_\infty$ ), and positive definiteness implies exponential convergence to zero loss. The limiting infinite-width NTK for ReLU networks is constructed recursively via the arc-cosine kernel (Zandieh, 2021, Han et al., 2021): $K_{\mathrm{ntk}}^{(L)}(x,x') = \|x\|\|x'\| K_\mathrm{relu}^{(L)} \left( \frac{\langle x, x' \rangle}{\|x\|\|x'\|} \right)$ where $K_\mathrm{relu}^{(L)}$ is built from closed-form $\kappa_0$ and $\kappa_1$ functions via nested recurrences.

2. Explicit Recursion and Positive Definiteness

The NTK for depth $L$ networks is defined by a recursion: $\Theta_\infty^{(1)}(x, y) = \frac{1}{n_0} x^\top y + \beta^2$

$\Theta_\infty^{(\ell+1)}(x, y) = \Theta_\infty^{(\ell)}(x, y) \dot\Sigma^{(\ell+1)}(x, y) + \Sigma^{(\ell+1)}(x, y)$

with $\Sigma^{(\ell+1)}$ and $\dot\Sigma^{(\ell+1)}$ corresponding to GP covariance and activation derivative statistics. For all standard non-polynomial activations—including ReLU—the infinite-width NTK is strictly positive definite for any set of distinct inputs (Carvalho et al., 19 Apr 2024). This positivity is ensured at all depths and under minimal smoothness on $\sigma$ , providing exponential convergence of gradient descent to zero loss for any target (Jacot et al., 2018).

3. Finite-Width Corrections and the Neural Tangent Hierarchy

At finite width, the NTK evolves during training. The precise dynamics can be captured via a hierarchy of ODEs—the neural tangent hierarchy (NTH)—for higher derivatives of the network output with respect to the parameters (Huang et al., 2019): $\frac{d}{dt} K_t^{(1)}(x_i) = -\frac{1}{n} \sum_{j=1}^n K_t^{(2)}(x_i, x_j) (K_t^{(1)}(x_j) - y_j)$

$\frac{d}{dt} K_t^{(r)}(x_{i_1},...,x_{i_r}) = -\frac{1}{n} \sum_{j=1}^n K_t^{(r+1)}(x_{i_1},...,x_{i_r},x_j) (K_t^{(1)}(x_j) - y_j)$

It is possible to truncate this hierarchy at level $p$ and rigorously bound the error for times up to $T_p \approx m^{p/(2p+2)}/(\ln m)^A$ , with errors $O(t^{p-1} m^{-p/2})$ for kernel values, and $O(\sqrt{n} t^{p-1} m^{-p/2})$ for outputs. The NTK evolves at $O((\ln m)^2/m)$ , enabling quantification of generalization improvements due to feature learning, which is absent in the infinite-width (static kernel) limit.

4. Computational Scaling and Fast Feature Maps

Exact evaluation of the NTK for deep networks is computationally intractable on large datasets, with naive complexity $O(n^2 d^2)$ for a dataset of size $n$ and input dimension $d$ . To address this, randomized feature maps leveraging polynomial expansions and TensorSketch-type sketching enable near input-sparsity time approximation (Zandieh, 2021, Han et al., 2021, Zandieh et al., 2021):

Polynomially expand arc-cosine kernels underlying deep NTK recursions.
Sketch high-degree tensor products via randomized embeddings.
Achieve multiplicative error $\epsilon$ on all entries with feature dimensions $O(\epsilon^{-2}\log(1/\delta))$ and time per input $O(\text{nnz}(x)\,\text{poly}(L,1/\epsilon))$ .
Convolutional NTKs (CNTK) can be similarly sketched for CNNs, preserving locality and yielding $O(\text{pixels})$ runtime (Zandieh, 2021). Empirically, NTK sketches outperform Nyström approximations in speed and accuracy, scale to million-sized datasets, and achieve top performance for small-scale data (Zandieh, 2021, Han et al., 2021, Zandieh et al., 2021).

5. Alignment, Feature Learning, and Specialization Dynamics

For finite-width deep ReLU networks, dynamic evolution of the empirical NTK under training reflects feature learning and kernel alignment with targets (Huang et al., 2019, Shan et al., 2021). The NTK Gram matrix aligns with label structure, developing block-wise or directional enhancements:

Alignment: The NTK increases its projection onto the label Gram $yy^T$ , accelerating loss decay in relevant directions.
Specialization: In multi-output and multiclass settings, subkernels preferentially align to each output's target function, yielding kernel specialization.
These alignment effects, absent in the infinite-width regime, are argued to underlie superior generalization and faster convergence of realistic-width networks relative to their infinite-width (frozen-kernel) counterparts (Shan et al., 2021).

6. Generalization, Spectral Bias, and Operator Extensions

Deep ReLU NTK theory underpins generalization analyses, explaining spectral bias—the preferential learning of lower-complexity (e.g., low-frequency) functions—and offering minimax-optimal convergence rates for regression under operator and physics-informed settings (Gan et al., 14 Mar 2025, Nguyen et al., 23 Dec 2024):

Spectral bias is unaffected by insertion of differential operators into the loss; the eigenvalue decay rate of the NTK does not accelerate under such modifications (Gan et al., 14 Mar 2025).
For neural operator learning, early-stopped gradient descent in the NTK regime attains rates $O(n^{-r/(2r+b)})$ depending on source/regularity and effective dimension parameters, matching RKHS optimality for high-dimensional regression (Nguyen et al., 23 Dec 2024).
Surrogate gradient training for non-differentiable activations (e.g. spiking networks) is accurately characterized by a generalized surrogate NTK, enabling correct kernel-theoretic predictions (Eilers et al., 24 May 2024).

7. Physical Limits, Construction, and Mode Decoupling

Any positive-semidefinite dot-product kernel on the unit sphere can be exactly realized as the NTK of a single-hidden-layer network with suitable Hermite-series activation (Simon et al., 2021). The NTK of matrix product states (MPS) converges to a site-wise kernel in the infinite bond dimension limit, is strictly positive definite, and enables exact closed-form training dynamics for both supervised and unsupervised settings (Guo et al., 2021). Modifications of the NTK formalism, such as using cross-covariance structure in GNNs, can optimize alignment and convergence (Khalafi et al., 2023).

Deep ReLU networks thus provide an analytically tractable laboratory for the interplay among depth, width, kernel dynamics, feature learning, generalization, and efficient computation. Kernel theory, especially NTK and its generalizations, offers foundational machinery for rigorous analysis, design, and practical implementation of very wide and deep network architectures. The emergence of alignment and specialization at finite width connects kernel-induced function-space optimality with observed generalization phenomena in practical deep learning.