Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 218 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams (2509.10167v1)

Published 12 Sep 2025 in cs.LG

Abstract: We study the gradient-based training of large-depth residual networks (ResNets) from standard random initializations. We show that with a diverging depth $L$, a fixed embedding dimension $D$, and an arbitrary hidden width $M$, the training dynamics converges to a Neural Mean ODE training dynamics. Remarkably, the limit is independent of the scaling of $M$, covering practical cases of, say, Transformers, where $M$ (the number of hidden units or attention heads per layer) is typically of the order of $D$. For a residual scale $\Theta_D\big(\frac{\alpha}{LM}\big)$, we obtain the error bound $O_D\big(\frac{1}{L}+ \frac{\alpha}{\sqrt{LM}}\big)$ between the model's output and its limit after a fixed number gradient of steps, and we verify empirically that this rate is tight. When $\alpha=\Theta(1)$, the limit exhibits complete feature learning, i.e. the Mean ODE is genuinely non-linearly parameterized. In contrast, we show that $\alpha \to \infty$ yields a \lazy ODE regime where the Mean ODE is linearly parameterized. We then focus on the particular case of ResNets with two-layer perceptron blocks, for which we study how these scalings depend on the embedding dimension $D$. We show that for this model, the only residual scale that leads to complete feature learning is $\Theta\big(\frac{\sqrt{D}}{LM}\big)$. In this regime, we prove the error bound $O\big(\frac{1}{L}+ \frac{\sqrt{D}}{\sqrt{LM}}\big)$ between the ResNet and its limit after a fixed number of gradient steps, which is also empirically tight. Our convergence results rely on a novel mathematical perspective on ResNets : (i) due to the randomness of the initialization, the forward and backward pass through the ResNet behave as the stochastic approximation of certain mean ODEs, and (ii) by propagation of chaos (that is, asymptotic independence of the units) this behavior is preserved through the training dynamics.

Summary

  • The paper establishes that deep ResNets converge to a Neural Mean ODE in the infinite-depth limit, independent of how the hidden width scales.
  • It provides non-asymptotic error bounds that combine discretization (O(1/L)) and sampling errors (O(α/√(LM)), validated through extensive numerical experiments.
  • Phase diagrams delineate regimes for complete feature learning versus lazy training, offering principled guidelines for hyperparameter selection in large-scale architectures.

Tight Error Bounds and Phase Diagrams for Deep ResNets

Overview and Motivation

This paper provides a rigorous analysis of the training dynamics of deep residual networks (ResNets) under standard random initializations, focusing on the interplay between depth (LL), hidden width (MM), and embedding dimension (DD). The central result is that, as depth diverges (LL \to \infty), the training dynamics of ResNets converge to a Neural Mean ODE, regardless of how MM scales with DD. This limit is shown to be independent of the scaling of MM, which is significant for practical architectures such as Transformers, where MM is typically of the order of DD. The paper establishes tight non-asymptotic error bounds between finite ResNets and their infinite-depth/width limit, and characterizes the regimes of feature learning and lazy training via phase diagrams. Figure 1

Figure 1: Forward pass (1D projection, fixed input) of trained ResNets with two-layer-perceptron blocks, varying depths LL and hidden width M=1M=1. The red curve shows the corresponding forward pass for the limit model.

Theoretical Results: Error Bounds and Regimes

Neural Mean ODE Limit

The paper formalizes the convergence of ResNet training dynamics to a Neural Mean ODE as LL \to \infty, with arbitrary scaling of MM. The forward pass recursion for a ResNet is given by: h^(x,z)=h^1(x,z)+αLMi=1Mϕ(h^1(x,z),zi,)\hat h^{\ell}(x,z) = \hat h^{\ell-1}(x,z) + \frac{\alpha}{LM} \sum_{i=1}^M \phi(\hat h^{\ell-1}(x,z), z^{i,\ell}) where ϕ\phi is a generic block (e.g., MLP, attention head).

The corresponding limit model is described by the ODE: sh(s,x,Z)=αE[ϕ(h(s,x,Z),Z(s))]\partial_s h(s,x,Z) = \alpha \mathbb{E}[\phi(h(s,x,Z), Z(s))] with ZZ a stochastic process over depth.

Tight Error Bounds

The main quantitative result is a non-asymptotic error bound between the output of a finite ResNet and its Neural Mean ODE limit after KK gradient steps: OD,K(1L+αLM)O_{D,K}\left(\frac{1}{L} + \frac{\alpha}{\sqrt{LM}}\right) This bound consists of a depth-discretization error (O(1/L)O(1/L), Euler method) and a sampling error (O(α/LM)O(\alpha/\sqrt{LM}), Monte Carlo rate), with LMLM acting as an effective width. Empirical results confirm the tightness of these rates. Figure 2

Figure 2

Figure 2: Error on output vs depth LL.

Feature Learning vs Lazy Regimes

  • Feature Learning Regime (α=Θ(1)\alpha = \Theta(1)): The limit exhibits complete feature learning, with nonlinear parameterization and nontrivial evolution of features.
  • Lazy ODE Regime (α\alpha \to \infty): The limit is linearly parameterized (Neural Tangent ODE), and the error bound becomes:

OD,K(1α+1L+αLM)O_{D,K}\left(\frac{1}{\alpha} + \frac{1}{L} + \frac{\alpha}{\sqrt{LM}}\right)

In this regime, parameter updates are suppressed, and feature learning is minimal. Figure 3

Figure 3

Figure 3: Evolution of the square loss, indicating convergence of the Mean ODE model.

Explicit Scaling Laws for Two-Layer Perceptron Blocks

The analysis is extended to ResNets with two-layer perceptron (2LP) blocks, tracking explicit dependencies on DD. The critical residual scale for complete feature learning is shown to be Θ(D/(LM))\Theta(\sqrt{D}/(LM)). The error bound in this regime is: O(1L+DLM)O\left(\frac{1}{L} + \frac{\sqrt{D}}{\sqrt{LM}}\right) This scaling is validated both theoretically and empirically, and is shown to be necessary for nontrivial feature evolution in practical settings where MDM \approx D. Figure 4

Figure 4

Figure 4

Figure 4: Fluctuations vs α\alpha, illustrating the sampling error scaling with α\alpha and DD.

Phase Diagrams and Hyperparameter Implications

The paper synthesizes these results into phase diagrams that classify the regimes of ResNet training as a function of residual scale, depth, width, and embedding dimension. The only residual scale that leads to complete feature learning for arbitrary architectures with D=O(LM)D = O(LM) is Θ(D/(LM))\Theta(\sqrt{D}/(LM)). Larger scales yield lazy training, while smaller scales result in semi-complete regimes with limited feature diversity.

The analysis provides principled guidance for hyperparameter selection in large-scale architectures, including Transformers, and clarifies the trade-offs between depth, width, and initialization scale.

Mathematical Techniques

The proofs leverage stochastic approximation theory and propagation of chaos, showing that the forward and backward passes through a randomly initialized ResNet behave as stochastic approximations of mean ODEs, and that this behavior is preserved through training. The error analysis combines discretization and sampling errors, and is robust to various block types (MLP, attention).

Empirical Validation

Extensive numerical experiments confirm the theoretical predictions, demonstrating tight agreement between observed errors and theoretical bounds across a range of LL, MM, DD, and residual scales. The experiments also illustrate the regularity of the limit dynamics and the transition between feature learning and lazy regimes.

Implications and Future Directions

The results have direct implications for the design and scaling of deep architectures, especially in settings where width and depth are both large but not infinite. The identification of LMLM as the effective width and the explicit scaling laws for feature learning provide actionable guidelines for practitioners. The phase diagrams unify previous mean-field and Neural ODE analyses, and clarify the conditions under which feature learning is possible.

Theoretically, the work suggests further exploration of stochastic approximation methods for other architectures and training algorithms, and motivates the paper of dynamical phenomena beyond initialization. Practically, the scaling laws and error bounds can inform automated hyperparameter tuning and architecture search in large-scale deep learning.

Conclusion

This paper establishes tight non-asymptotic error bounds for the convergence of deep ResNets to their Neural Mean ODE limit, identifies the critical scaling laws for feature learning, and provides phase diagrams that organize the hyperparameter space. The results are validated both theoretically and empirically, and have significant implications for the design and analysis of large-scale neural architectures. The mathematical framework developed here is broadly applicable and sets the stage for further advances in the theory and practice of deep learning.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (1)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 7 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com