Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 12 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 81 tok/s Pro
Kimi K2 231 tok/s Pro
GPT OSS 120B 435 tok/s Pro
Claude Sonnet 4 33 tok/s Pro
2000 character limit reached

Lazy ODE Regime Overview

Updated 15 September 2025
  • The lazy ODE regime is defined by models whose dynamics remain near initialization, following a linear ODE that approximates training behavior.
  • It enables rapid, predictable convergence by leveraging kernel-like dynamics with explicit error bounds linked to scaling parameters.
  • Despite its analytical strengths, the lazy regime limits adaptive feature learning, often resulting in suboptimal generalization in complex tasks.

The Lazy ODE Regime is a foundational concept characterizing systems—particularly in modern machine learning and optimization—where the evolution of parameters during training or solution is dominated by linear, tangent or kernel-like dynamics. In this regime, either by design (via explicit scaling) or due to the limit of strong overparameterization, the system follows an ordinary differential equation (ODE) that tracks its linearization or first-order Taylor approximation around initialization. The resulting training or optimization typically yields rapid, analytically tractable, and predictable convergence, but at the expense of adaptivity and expressive (nonlinear) feature learning. The lazy regime is pervasive across domains: from the training of neural networks and quantum circuits, to temporal-difference learning, continuous-time optimization, and even approaches to regularize or analyze non-local and high-dimensional ODEs.

1. Mathematical Structure and Theoretical Foundations

The lazy ODE regime is grounded in the observation that, under strong overparameterization and/or large scaling, a parameterized model h:Rp→Fh: \mathbb{R}^p \to F with loss R:F→R+R:F\to\mathbb{R}_+ can be trained so that the parameter vector ww remains close to its initialization w0w_0. In this setting, hh is well-approximated by its linearization:

hˉ(w)=h(w0)+Dh(w0)(w−w0),\bar{h}(w) = h(w_0) + Dh(w_0) (w - w_0),

and the corresponding linearized objective is

Fˉ(w)=R(hˉ(w)).\bar{F}(w) = R(\bar{h}(w)).

A key principle is that if the scaling parameter α≫1\alpha \gg 1 is introduced (e.g., by solving Fα(w)=1α2R(αh(w))F_\alpha(w) = \frac{1}{\alpha^2} R(\alpha h(w))), the dynamics of wα(t)w_\alpha(t) under gradient flow—

ddtwα(t)=−∇Fα(wα(t))\frac{d}{dt} w_\alpha(t) = -\nabla F_\alpha(w_\alpha(t))

—remain close to those of the linearized dynamics for all t∈[0,T]t \in [0, T], with explicit bounds of the form:

sup⁥t∈[0,T]∄wα(t)−w0∄=O(1/α),sup⁥t∈[0,T]∄wα(t)−wˉα(t)∄=O(1/α2).\sup_{t \in [0, T]} \|w_\alpha(t) - w_0\| = O(1/\alpha), \qquad \sup_{t \in [0, T]} \|w_\alpha(t) - \bar{w}_\alpha(t)\| = O(1/\alpha^2).

In this regime, DhDh is nearly constant, and the dynamics can often be recast as learning in a reproducing kernel Hilbert space (RKHS) with the "tangent kernel" (also known as the Neural Tangent Kernel, NTK):

K(x,xâ€Č)=Dh(w0,x)Dh(w0,xâ€Č)T.K(x,x') = Dh(w_0,x) Dh(w_0,x')^T.

Training is then equivalent to kernel regression in this RKHS.

2. Criteria, Scaling, and Transition Mechanisms

The lazy regime does not arise solely from overparameterization, but rather from the relative scaling between the model, the parameterization, and the task. The critical control parameter for the onset of laziness is often a scale α\alpha, the initialization variance τ2\tau^2, or a parameter Îł0\gamma_0 linking feature evolution and network width.

Key quantitative criteria characterize the lazy regime. For the squared loss and a model hh:

Îșh(w0)=∄h(w0)−y∗∄⋅∄D2h(w0)∄∄Dh(w0)∄2.\kappa_h(w_0) = \|h(w_0) - y^*\| \cdot \frac{\|D^2 h(w_0)\|}{\|D h(w_0)\|^2}.

If Îșh(w0)â‰Ș1\kappa_h(w_0) \ll 1, then the gradient step changes the model output significantly in loss but negligibly in feature space, confirming laziness.

The transition between lazy and rich learning is controlled by the scaling law:

α∗=O(h−1/2),\alpha^* = O(h^{-1/2}),

where hh is the network width. For hα≫1\sqrt{h}\alpha \gg 1, the network is lazy; for hαâ‰Ș1\sqrt{h}\alpha \ll 1, it is in the feature-learning regime (Geiger et al., 2019). Similar laws occur across architectures, reinforcement learning settings, and other models: for instance, in continual learning, the parameter Îł0\gamma_0 controls this continuum, with Îł0→0\gamma_0 \to 0 corresponding to laziness (Graldi et al., 20 Jun 2025).

3. Dynamics, Convergence, and Expressivity

In the lazy ODE regime, the system's trajectory follows that of a linear dynamical system. In the neural network context, this is the NTK gradient flow; in temporal-difference (TD) learning, it is the linearized TD update (Agazzi et al., 2019). In optimization, analogous reductions occur: e.g., gradient descent is described as a discretization of yâ€Č(t)=−∇f(y(t))y'(t) = -\nabla f(y(t)) (Orvieto et al., 2019), and the difference between discrete and continuous time vanishes as step-size h→0h\to 0 (supported via shadowing theory).

Convergence under lazy dynamics is generally exponential when the loss is locally strongly convex:

∄V∗−αVw(t)∄02≀∄V∗−αVw(0)∄02exp⁥(−((1−γ)/(2Îș2))t).\|V^* - \alpha V_{w(t)}\|_0^2 \leq \|V^* - \alpha V_{w(0)}\|_0^2 \exp(-((1-\gamma)/(2\kappa^2)) t).

Global convergence is achieved for overparameterized systems (full-rank Jacobian Dh(w0)D h(w_0)), while only local or projected convergence is possible otherwise (Chizat et al., 2018, Agazzi et al., 2019).

However, the expressivity in the lazy regime is fundamentally limited. In nonlinear models, training never deviates far from the initial tangent space, so the capacity to adapt to new features or to perform complex learning is restricted. For example, in overparameterized tensor decomposition, lazy regression can only approximate a random rank-1 tensor with m=Ω(dl−1)m = \Omega(d^{l-1}) components (much higher than the true rank rr), which can be improved to m=O∗(r2.5llog⁥d)m = O^*(r^{2.5l}\log d) by escaping the lazy regime (Wang et al., 2020).

4. Practical Implications, Benefits, and Limitations

Empirical studies show that while lazy training simplifies analysis, it often degrades generalization, particularly in high-dimensional, highly structured tasks such as computer vision. For instance, in deep CNNs (e.g., VGG-11 or ResNet) on CIFAR-10, entering the lazy regime by increasing α\alpha or initialization variance reduces test accuracy and "freezes" activations at initialization (Chizat et al., 2018).

The lazy regime is robust and supports rapid convergence, but it suppresses the network’s ability to adaptively learn representations. Models trained lazily behave as kernel machines using fixed random features, a representation that is provably suboptimal for tasks demanding hierarchical or compositional features.

In reinforcement learning, the lazy regime supports stability (e.g., converging TD learning dynamics), but at the acknowledged cost of expressivity versus the mean-field regime, which, while more difficult to analyze, allows global minimization and full utilization of network expressivity (Agazzi et al., 2019).

5. Relation to Kernel Methods, Layerwise Linear Models, and Unifying Perspectives

The lazy ODE regime explicitly bridges differentiable programming and classical kernel methods. In the limit of infinite width and appropriate scaling, neural networks become equivalent to kernel ridge regression with the tangent kernel (Chizat et al., 2018, Geiger et al., 2019). This insight unifies the analysis of classical kernel machines and modern overparameterized networks.

Layerwise linear models serve as minimal settings that make the lazy dynamics analytically transparent (Nam et al., 28 Feb 2025). In such models, even multilayer architectures reduce to linear ODE evolution in the lazy regime, with the output parameter (e.g., Ξ\theta) following equations of the form:

dΞdt=−λ (ξ−S)\frac{d\theta}{dt} = -\lambda\,(\theta - S)

which directly admits exponential convergence, in contrast to the nonlinear, staged convergence and feature learning of the "rich" regime. Mixed regimes—where some modes are lazy and some are active—give rise to more intricate phase diagrams, exemplified by recent results for linear networks (Tu et al., 27 May 2024).

6. Extensions Beyond Neural Networks

The lazy ODE regime is a general paradigm and appears in a variety of non-neural contexts. In quantum machine learning, parameterized quantum circuits with geometrically local structure and large qubit numbers enter a lazy regime, admitting a linearized ODE reduction characterized by a time-independent tangent kernel (Abedi et al., 2022). In nonlocal PDEs, the "lazy regime" refers to reductions of complex, nonlocal equations into infinite systems of decoupled classical ODEs via spectral decomposition, admitting the use of Hamiltonians, Wronskians, and other ODE tools (Ao et al., 2019).

In optimization theory, lazy ODE analysis underpins frameworks for understanding algorithmic convergence: e.g., in the O(sr)O(s^r)-resolution ODEs, the lazy regime corresponds to capturing only the leading-order (O(1)) behavior, which suffices for linear convergence under strongly monotone problems but is inadequate for more nuanced situations such as minimax or adversarial problems, where higher-order corrections are essential (Lu, 2020).

7. Theoretical and Empirical Challenges; Transition to Feature Learning

While the lazy regime is mathematically appealing, recent research identifies its limitations and the necessity of moving beyond it for rich feature learning, improved generalization, and adaptive behavior (Chizat et al., 2018, Geiger et al., 2019, George et al., 2022, Graldi et al., 20 Jun 2025).

The transition from lazy to rich regimes—typically governed by scaling laws or other tuning parameters—is linked to major neural phenomena such as grokking, curriculum learning, and the trade-off between plasticity and stability in continual learning. Analyses reveal that in the lazy regime, learning is uniform and “example agnostic,” while rich dynamics emerge when feature learning and alignment with the task structure are enabled, giving rise to curriculum-style prioritization and improved performance (George et al., 2022, Kumar et al., 2023, Graldi et al., 20 Jun 2025).

Summary Table: Distinctive Properties of the Lazy ODE Regime

Aspect Lazy ODE Regime Feature Learning/Rich Regime
Training Dynamics Linear, kernel-like, fixed features Nonlinear, adaptive features
Convergence Exponential, analytic Non-uniform, staged/greedy
Expressivity Limited to initial tangent space Adapts to task structure
Generalization Often suboptimal Superior for high-dimensional tasks
Kernel Equivalence Yes (NTK) No (Kernel evolves)
Implementation Robustness High Lower

References

The lazy ODE regime thus provides both a powerful analytic framework for understanding the early or overparameterized behavior of modern models and a lens through which to appreciate the necessity of escaping linearized dynamics for harnessing the full power of feature learning.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Lazy ODE Regime.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube