Lazy ODE Regime Overview

Updated 15 September 2025

The lazy ODE regime is defined by models whose dynamics remain near initialization, following a linear ODE that approximates training behavior.
It enables rapid, predictable convergence by leveraging kernel-like dynamics with explicit error bounds linked to scaling parameters.
Despite its analytical strengths, the lazy regime limits adaptive feature learning, often resulting in suboptimal generalization in complex tasks.

The Lazy ODE Regime is a foundational concept characterizing systems—particularly in modern machine learning and optimization—where the evolution of parameters during training or solution is dominated by linear, tangent or kernel-like dynamics. In this regime, either by design (via explicit scaling) or due to the limit of strong overparameterization, the system follows an ordinary differential equation (ODE) that tracks its linearization or first-order Taylor approximation around initialization. The resulting training or optimization typically yields rapid, analytically tractable, and predictable convergence, but at the expense of adaptivity and expressive (nonlinear) feature learning. The lazy regime is pervasive across domains: from the training of neural networks and quantum circuits, to temporal-difference learning, continuous-time optimization, and even approaches to regularize or analyze non-local and high-dimensional ODEs.

1. Mathematical Structure and Theoretical Foundations

The lazy ODE regime is grounded in the observation that, under strong overparameterization and/or large scaling, a parameterized model $h: \mathbb{R}^p \to F$ with loss $R:F\to\mathbb{R}_+$ can be trained so that the parameter vector $w$ remains close to its initialization $w_0$ . In this setting, $h$ is well-approximated by its linearization:

$\bar{h}(w) = h(w_0) + Dh(w_0) (w - w_0),$

and the corresponding linearized objective is

$\bar{F}(w) = R(\bar{h}(w)).$

A key principle is that if the scaling parameter $\alpha \gg 1$ is introduced (e.g., by solving $F_\alpha(w) = \frac{1}{\alpha^2} R(\alpha h(w))$ ), the dynamics of $w_\alpha(t)$ under gradient flow—

$\frac{d}{dt} w_\alpha(t) = -\nabla F_\alpha(w_\alpha(t))$

—remain close to those of the linearized dynamics for all $t \in [0, T]$ , with explicit bounds of the form:

$\sup_{t \in [0, T]} \|w_\alpha(t) - w_0\| = O(1/\alpha), \qquad \sup_{t \in [0, T]} \|w_\alpha(t) - \bar{w}_\alpha(t)\| = O(1/\alpha^2).$

In this regime, $Dh$ is nearly constant, and the dynamics can often be recast as learning in a reproducing kernel Hilbert space (RKHS) with the "tangent kernel" (also known as the Neural Tangent Kernel, NTK):

$K(x,x') = Dh(w_0,x) Dh(w_0,x')^T.$

Training is then equivalent to kernel regression in this RKHS.

2. Criteria, Scaling, and Transition Mechanisms

The lazy regime does not arise solely from overparameterization, but rather from the relative scaling between the model, the parameterization, and the task. The critical control parameter for the onset of laziness is often a scale $\alpha$ , the initialization variance $\tau^2$ , or a parameter $\gamma_0$ linking feature evolution and network width.

Key quantitative criteria characterize the lazy regime. For the squared loss and a model $h$ :

$\kappa_h(w_0) = \|h(w_0) - y^*\| \cdot \frac{\|D^2 h(w_0)\|}{\|D h(w_0)\|^2}.$

If $\kappa_h(w_0) \ll 1$ , then the gradient step changes the model output significantly in loss but negligibly in feature space, confirming laziness.

The transition between lazy and rich learning is controlled by the scaling law:

$\alpha^* = O(h^{-1/2}),$

where $h$ is the network width. For $\sqrt{h}\alpha \gg 1$ , the network is lazy; for $\sqrt{h}\alpha \ll 1$ , it is in the feature-learning regime (Geiger et al., 2019). Similar laws occur across architectures, reinforcement learning settings, and other models: for instance, in continual learning, the parameter $\gamma_0$ controls this continuum, with $\gamma_0 \to 0$ corresponding to laziness (Graldi et al., 20 Jun 2025).

3. Dynamics, Convergence, and Expressivity

In the lazy ODE regime, the system's trajectory follows that of a linear dynamical system. In the neural network context, this is the NTK gradient flow; in temporal-difference (TD) learning, it is the linearized TD update (Agazzi et al., 2019). In optimization, analogous reductions occur: e.g., gradient descent is described as a discretization of $y'(t) = -\nabla f(y(t))$ (Orvieto et al., 2019), and the difference between discrete and continuous time vanishes as step-size $h\to 0$ (supported via shadowing theory).

Convergence under lazy dynamics is generally exponential when the loss is locally strongly convex:

$\|V^* - \alpha V_{w(t)}\|_0^2 \leq \|V^* - \alpha V_{w(0)}\|_0^2 \exp(-((1-\gamma)/(2\kappa^2)) t).$

Global convergence is achieved for overparameterized systems (full-rank Jacobian $D h(w_0)$ ), while only local or projected convergence is possible otherwise (Chizat et al., 2018, Agazzi et al., 2019).

However, the expressivity in the lazy regime is fundamentally limited. In nonlinear models, training never deviates far from the initial tangent space, so the capacity to adapt to new features or to perform complex learning is restricted. For example, in overparameterized tensor decomposition, lazy regression can only approximate a random rank-1 tensor with $m = \Omega(d^{l-1})$ components (much higher than the true rank $r$ ), which can be improved to $m = O^*(r^{2.5l}\log d)$ by escaping the lazy regime (Wang et al., 2020).

4. Practical Implications, Benefits, and Limitations

Empirical studies show that while lazy training simplifies analysis, it often degrades generalization, particularly in high-dimensional, highly structured tasks such as computer vision. For instance, in deep CNNs (e.g., VGG-11 or ResNet) on CIFAR-10, entering the lazy regime by increasing $\alpha$ or initialization variance reduces test accuracy and "freezes" activations at initialization (Chizat et al., 2018).

The lazy regime is robust and supports rapid convergence, but it suppresses the network’s ability to adaptively learn representations. Models trained lazily behave as kernel machines using fixed random features, a representation that is provably suboptimal for tasks demanding hierarchical or compositional features.

In reinforcement learning, the lazy regime supports stability (e.g., converging TD learning dynamics), but at the acknowledged cost of expressivity versus the mean-field regime, which, while more difficult to analyze, allows global minimization and full utilization of network expressivity (Agazzi et al., 2019).

5. Relation to Kernel Methods, Layerwise Linear Models, and Unifying Perspectives

The lazy ODE regime explicitly bridges differentiable programming and classical kernel methods. In the limit of infinite width and appropriate scaling, neural networks become equivalent to kernel ridge regression with the tangent kernel (Chizat et al., 2018, Geiger et al., 2019). This insight unifies the analysis of classical kernel machines and modern overparameterized networks.

Layerwise linear models serve as minimal settings that make the lazy dynamics analytically transparent (Nam et al., 28 Feb 2025). In such models, even multilayer architectures reduce to linear ODE evolution in the lazy regime, with the output parameter (e.g., $\theta$ ) following equations of the form:

$\frac{d\theta}{dt} = -\lambda\,(\theta - S)$

which directly admits exponential convergence, in contrast to the nonlinear, staged convergence and feature learning of the "rich" regime. Mixed regimes—where some modes are lazy and some are active—give rise to more intricate phase diagrams, exemplified by recent results for linear networks (Tu et al., 27 May 2024).

6. Extensions Beyond Neural Networks

The lazy ODE regime is a general paradigm and appears in a variety of non-neural contexts. In quantum machine learning, parameterized quantum circuits with geometrically local structure and large qubit numbers enter a lazy regime, admitting a linearized ODE reduction characterized by a time-independent tangent kernel (Abedi et al., 2022). In nonlocal PDEs, the "lazy regime" refers to reductions of complex, nonlocal equations into infinite systems of decoupled classical ODEs via spectral decomposition, admitting the use of Hamiltonians, Wronskians, and other ODE tools (Ao et al., 2019).

In optimization theory, lazy ODE analysis underpins frameworks for understanding algorithmic convergence: e.g., in the $O(s^r)$ -resolution ODEs, the lazy regime corresponds to capturing only the leading-order (O(1)) behavior, which suffices for linear convergence under strongly monotone problems but is inadequate for more nuanced situations such as minimax or adversarial problems, where higher-order corrections are essential (Lu, 2020).

7. Theoretical and Empirical Challenges; Transition to Feature Learning

While the lazy regime is mathematically appealing, recent research identifies its limitations and the necessity of moving beyond it for rich feature learning, improved generalization, and adaptive behavior (Chizat et al., 2018, Geiger et al., 2019, George et al., 2022, Graldi et al., 20 Jun 2025).

The transition from lazy to rich regimes—typically governed by scaling laws or other tuning parameters—is linked to major neural phenomena such as grokking, curriculum learning, and the trade-off between plasticity and stability in continual learning. Analyses reveal that in the lazy regime, learning is uniform and “example agnostic,” while rich dynamics emerge when feature learning and alignment with the task structure are enabled, giving rise to curriculum-style prioritization and improved performance (George et al., 2022, Kumar et al., 2023, Graldi et al., 20 Jun 2025).

Summary Table: Distinctive Properties of the Lazy ODE Regime

Aspect	Lazy ODE Regime	Feature Learning/Rich Regime
Training Dynamics	Linear, kernel-like, fixed features	Nonlinear, adaptive features
Convergence	Exponential, analytic	Non-uniform, staged/greedy
Expressivity	Limited to initial tangent space	Adapts to task structure
Generalization	Often suboptimal	Superior for high-dimensional tasks
Kernel Equivalence	Yes (NTK)	No (Kernel evolves)
Implementation Robustness	High	Lower

References

(Chizat et al., 2018) for theoretical structure and convergence properties of lazy training in differentiable programming
(Agazzi et al., 2019, Geiger et al., 2019) for scaling laws and distinction between lazy and feature-learning regimes
(Abedi et al., 2022) for quantum analogues
(Ao et al., 2019) for ODE reductions in non-local PDEs
(Orvieto et al., 2019, Lu, 2020) for ODE perspectives in optimization algorithms
(Wang et al., 2020) for tensor decomposition and the expressivity limit of lazy regimes
(George et al., 2022, Kumar et al., 2023) for the impact on curriculum and grokking phenomena
(Graldi et al., 20 Jun 2025) for implications in continual learning

The lazy ODE regime thus provides both a powerful analytic framework for understanding the early or overparameterized behavior of modern models and a lens through which to appreciate the necessity of escaping linearized dynamics for harnessing the full power of feature learning.