Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
118 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
34 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

NTK Stability in Deep Learning

Updated 30 July 2025
  • NTK stability is defined as the consistency or evolution of the Neural Tangent Kernel during training, reflecting the accuracy of kernel-based approximations in deep networks.
  • Research shows that while infinite-width networks maintain a fixed, deterministic NTK, finite-width networks exhibit dynamic kernel evolution that drives data-adaptive feature learning.
  • The stability of the NTK is modulated by depth-to-width ratio and initialization parameters, significantly impacting training efficiency and generalization performance.

Neural Tangent Kernel (NTK) stability refers to the behavior of the NTK under various network widths, depths, training regimes, and problem settings. It determines to what extent the NTK is constant or evolves during training, and how faithfully NTK-based theory captures the learning dynamics and generalization of actual deep neural networks. The topic is central for understanding the extent to which the NTK regime approximates deep learning, when finite-width corrections matter, how feature learning arises, and under which specific circumstances the NTK remains stable or develops structured, data-adaptive changes. The cumulative body of research delineates a highly nuanced picture: NTK stability is asymptotically exact only under specific conditions (width much larger than depth, hyperparameters in the “ordered” phase, near-lazy training), but practical networks frequently operate outside this idealized regime, causing the NTK to change dynamically with distinct theoretical and empirical signatures.

1. NTK Stability in Infinite-Width, Ordered Regimes

In the infinite-width limit of standard feedforward, convolutional, and recurrent architectures, the NTK converges almost surely to a deterministic, time-invariant kernel throughout training (Huang et al., 2019, Yang, 2020). This convergence is underpinned by rigorous tensor program analyses, showing that for any architecture respecting certain initialization and regularity conditions (e.g., gradient independence, “null avoid,” and “rank stability”), the NTK’s distribution concentrates tightly around its mean and remains constant under gradient flow. In this regime, the network’s output dynamics are governed by linear (kernel) equations and the learned function is equivalent to kernel regression under the limiting NTK. The constancy of the NTK is a direct consequence of the law of large numbers in high dimensions and underpins the stability of the “NTK regime” (Yang, 2020).

The classic NTK formula for a network with parameters θ is

Θ(x,x)=θfθ(x),θfθ(x),\Theta(x, x') = \langle \nabla_\theta f_\theta(x), \nabla_\theta f_\theta(x') \rangle,

which, for infinite width, admits a recursive and architecture-specific deterministic limit. Under these conditions, training preserves the NTK, and the NTK-based theory accurately describes training and generalization, as confirmed by empirical studies on wide networks with shallow or moderate depth (Alemohammad et al., 2020, Geifman et al., 2020). In this regime, the predicted decay of eigenvalues and the learning rates of modes (spectral bias) are well-matched between NTK and its classical analogues (e.g., Laplace kernel), indicating deep theoretical stability (Geifman et al., 2020).

2. Evolution of the NTK at Finite Width and Finite Depth

For finite-width networks, the NTK no longer remains fixed during training. The kernel evolves due to changes in the parameter-space gradients as the network adapts under gradient descent (Huang et al., 2019). This evolution is captured mathematically by an “infinite neural tangent hierarchy” (NTH) of coupled ODEs: tKt(r)(x1,,xr)=1nβ=1nKt(r+1)(x1,,xr,xβ)(f(xβ,θt)yβ),\partial_t K^{(r)}_t(x_1, \ldots, x_r) = -\frac{1}{n}\sum_{\beta=1}^n K^{(r+1)}_t(x_1, \ldots, x_r, x_\beta) (f(x_\beta, \theta_t) - y_\beta), for r2r \geq 2. This hierarchy describes how parameter-dependent NTK corrections propagate to higher-order kernel-like objects. Under truncation to finite order pp, the approximation error vanishes polynomially in width as

f(t)f~(t)2(1+t)tp1nmp/2min{t,n/λmin},\lVert f(t) - \tilde{f}(t) \rVert_2 \lesssim \frac{(1+t) t^{p-1} \sqrt{n}}{m^{p/2} \cdot \min\{t, n/\lambda_\mathrm{min}\}},

with λmin\lambda_\mathrm{min} the smallest eigenvalue of K0(2)K^{(2)}_0 (Huang et al., 2019).

These finite-width NTK changes are O(1/m)O(1/m), but can be sufficient to induce data-dependent feature learning, causing network outputs and generalization behavior to deviate systematically from their fixed-kernel, infinite-width counterparts (Samarin et al., 2020). Empirical studies confirm that, except for very large widths, the NTK of conventional architectures (AlexNet, LeNet) changes non-trivially during training and is significantly more expressive and harder for kernel methods to emulate, especially on complex datasets.

3. Critical Dependence on Depth-to-Width Ratio and Initialization

NTK stability is sharply modulated by depth-to-width ratio (λ=L/M\lambda = L/M) and initialization hyperparameters. Research identifies three “phases” based on the parameter a=σw2/2a = \sigma_w^2 / 2 (Seleznova et al., 2022, Seleznova et al., 2020):

  • Ordered phase (a<1a < 1): Gradients vanish, NTK is deterministic at init, remains constant during training.
  • Edge of chaos (EOC, a=1a = 1): Gradients and activations propagate neutrally; NTK variance grows exponentially with depth-to-width ratio but slower than in the chaotic regime.
  • Chaotic phase (a>1a > 1): Gradients explode, NTK’s variance becomes exponentially large in depth, and the kernel can change rapidly during training.

The normalized dispersion indicator

E[Θ2(x,x)](E[Θ(x,x)])2\frac{\mathbb{E}[\Theta^2(x,x)]}{(\mathbb{E}[\Theta(x,x)])^2}

is approximately one in the ordered phase, but grows like e5λe^{5\lambda} in the chaotic and EOC phases (Seleznova et al., 2022). In the ordered phase, NTK theory is valid and the kernel is stable; in the other phases, it becomes highly random and mutable, limiting the applicability of NTK-based linear approximations. Training in the chaotic or EOC regime (or with high depth-to-width ratio) requires exponentially small learning rates to preserve NTK constancy, rendering the theory practically inapplicable for common settings (Seleznova et al., 2020, Seleznova et al., 2022).

4. Empirical Deviations, Feature Learning, and Generalization Gaps

Empirical analyses on standard architectures and real datasets demonstrate that finite-width networks outperform their associated NTK regression baselines by several percentage points in accuracy or scaling exponents (Samarin et al., 2020, Vyas et al., 2022). This “performance gap” reflects that as the network trains, its NTK adapts so as to align with salient data structures—especially under large learning rates or at the "edge of stability" (Jiang et al., 17 Jul 2025, Long, 2021).

This dynamic evolution is often block-structured, with intra-class covariance in NTK matrices becoming larger than inter-class entries. Blockwise “NTK collapse” tightly predicts the emergence of neural collapse phenomena (classwise mean collapse, equiangularity, and classifier alignment), showing that structured NTK evolution is critical for the symmetry and geometry of learned representations (Seleznova et al., 2023). Larger learning rates can further drive the leading NTK eigenvectors to align with the target labels, enhancing generalization by “rotating” the NTK in task-relevant directions (Jiang et al., 17 Jul 2025). These effects are fundamentally absent in the infinite-width, lazy NTK regime, confirming a crucial link between mutable NTK structure and feature learning.

5. Stability under Noise, Regularization, and Physics-Informed Losses

NTK stability can be extended beyond the standard “frozen” regime to certain regularized and noisy training settings (Chen et al., 2020). In a mean-field analysis, the network’s weight distribution, when measured in Wasserstein or KL divergence (not just Euclidean norm), can remain stably close to initialization even with weight decay and gradient noise. This ensures that the kernel-induced dynamics (i.e., the generalized NTK) continue to offer valid approximations up to a statistical error floor set by regularization and kernel invertibility.

In the context of physics-informed neural networks (PINNs), where the loss applies linear differential operators TT to the output, the resulting NTK is modified as KT(x,x)=TxTxKNT(x,x)K_T(x,x') = T_x T_{x'} K^{NT}(x,x') (Gan et al., 14 Mar 2025). Theoretical and empirical evidence shows that the spectral bias of the NTK (rate of eigenvalue decay) is not dramatically worsened by TT, and stability is largely maintained if standard conditions are met.

6. Extensions to Specialized Architectures and Dynamic Regimes

For recurrent networks, the Recurrent NTK (RNTK) extends NTK theory to sequential and variable-length inputs, preserving kernel stability in the infinite-width limit under suitable initialization (Alemohammad et al., 2020). In deep equilibrium (DEQ) models—effectively “infinite-depth” fixed-point networks with weight sharing—the NTK remains deterministic even as both width and depth go to infinity, provided mild contraction and regularity conditions hold. The NTK for DEQs can be efficiently computed via root-finding and fixed-point equations, bypassing degenerate or random kernel behavior (Feng et al., 2023).

In regimes with nontrivial optimization dynamics—such as training near or at the "edge of stability," characterized by the NTK's largest eigenvalue oscillating near 2/η2/\eta—the NTK is highly dynamic. The eigenstructure shifts so that leading eigenvectors become more aligned with the targets, and the kernel's adaptation under gradient descent is crucial for efficient feature learning and improved generalization (Agarwala et al., 2022, Jiang et al., 17 Jul 2025).

Under continual learning and task shifts, empirical evidence shows that the NTK can change abruptly at task boundaries, even in wide, lazy-trained models, contradicting the static NTK assumption. Reactivation of feature learning is observed as a spike in kernel velocity and a “drop and recovery” in NTK norms, indicating that continual adaptation of the NTK is a necessary mechanism for handling non-stationary data (Liu et al., 21 Jul 2025).

7. Limitations, Generalized Theories, and Practical Implications

While NTK theory provides a powerful lens for understanding overparameterized training dynamics, it has clear limitations:

  • It fails to predict the superior scaling exponents and sample efficiency of finite-width neural networks on realistic datasets, due to its inability to encode evolving data-dependent representations (Vyas et al., 2022).
  • Its static, target-independent nature in infinite-width settings is suboptimal compared to multiple kernel learning approaches that iteratively optimize data-dependent kernel weights, as shown for convex “gated ReLU” networks (Dwaraknath et al., 2023).
  • Many practical architectures and regimes (e.g., deep, narrow, or nonstationary networks; networks trained with label noise) operate beyond what can be accurately captured by NTK-based linearization.
  • Stability with respect to generalization is best understood and achieved when connecting algorithmic stability, shortest trajectory interpolants, and early stopping rather than solely relying on a kernel-induced regularization framework (Richards et al., 2021).

Future research directions include bridging the finite-width, feature-learning regime and the NTK regime through higher-order analysis (e.g., neural tangent hierarchy), studying task-adaptive NTK evolution, and developing refined theories for continual learning and non-stationary tasks that properly account for kernel reactivation and dynamic adaptation.


The synthesis above integrates the principal axes of NTK stability as revealed by the cited corpus, highlighting both the theoretical conditions for NTK constancy and the empirical/practical domains where adaptive kernel evolution is essential for modern deep learning phenomena.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)