Papers
Topics
Authors
Recent
2000 character limit reached

Depth-muP: Scaling Deep Neural Networks

Updated 25 December 2025
  • Depth-muP is a scaling framework that extends maximal update parametrization to both infinite-width and infinite-depth regimes, ensuring nontrivial feature learning.
  • It prescribes precise initialization and learning rate scaling rules that yield Super Consistency and reliable hyperparameter transfer across various deep residual architectures.
  • The framework identifies and addresses feature learning collapse in multi-layer blocks with depth-aware corrections, offering a principled path to robust model optimization.

Depth-muP refers to the “depth-extended Maximal Update Parametrization” (depth-μP), a formal scaling rule for neural network initialization and learning-rate parametrization designed to ensure nontrivial feature learning in deep, wide networks as both depth LL and width nn diverge. The depth-muP framework generalizes and extends the μP (“maximal update parametrization”) approach, which originated for width scaling, into the depth direction, especially for residual architectures. The depth-muP regime enables remarkable invariance properties for Hessian eigenstructure (“Super Consistency”), supports reliable hyperparameter transfer across model scales, and reveals structural mechanisms behind the breakdown of feature learning in ultra-deep architectures. This makes depth-muP a central object of study for understanding scaling laws and optimization stability in deep neural architectures (Yao et al., 24 Dec 2025, Noci et al., 27 Feb 2024).

1. Maximal Update Parametrization and Depth-muP Formalism

The original μP rule prescribes a scaling of weight matrices and learning rates in standard neural networks, such that the “feature learning” regime is maintained as width nn\to\infty. In residual networks with LL blocks, the depth-muP parametrization prescribes:

  • Each residual branch weight is initialized as W()N(0,I)W^{(\ell)}\sim \mathcal{N}(0, I) and enters the forward pass as W()/(nL)W^{(\ell)}/(\sqrt{nL}).
  • The readout layer uses W(L)/nW^{(L)}/\sqrt{n}.
  • In SGD, a base learning rate η0\eta_{0} is scaled with width: ηeff=η0n\eta_{\text{eff}} = \eta_0 n.

The update for any block weight is thus (for pre-activation ResNets with single-layer blocks):

  • Forward: h+1=h+1nLW()ϕ(h())h_{\ell+1} = h_{\ell} + \frac{1}{\sqrt{n}\sqrt{L}}\, W^{(\ell)}\phi(h^{(\ell)})
  • Backward update: ΔW()=(η0/n)(1/L)/W()\Delta W^{(\ell)} = -(\eta_{0}/\sqrt{n})\cdot (1/\sqrt{L})\cdot \partial \ell/\partial W^{(\ell)}

This scaling guarantees that all hidden representations, gradients, and weight updates remain O(1)\mathcal{O}(1) as both n,Ln,L\to\infty (Noci et al., 27 Feb 2024).

2. Neural Feature Dynamics and Infinite-Depth Limit

In the joint infinite-width, infinite-depth regime (n,Ln,L\to\infty), the dynamics of feature learning in residual networks under depth-muP can be described by a pair of coupled stochastic differential equations (SDEs) for the hidden features hth_t and backpropagated gradients gtg_t. Specifically, the continuum limit for the layer index \ell is rewritten as time (t=Δtt = \ell \cdot \Delta t with Δt=T/L\Delta t = T/L), yielding:

  • Forward SDE: dht=E[ϕ(ht)2]dWtd h_t = \sqrt{\mathbb{E}[\phi(h_t)^2]} \, dW_t
  • Backward SDE: dgt=E[ϕ(ht)2E[(gt)2]]dBtd g_t = \sqrt{\mathbb{E}[\phi'(h_t)^2 \mathbb{E}[(g_t)^2]]}\, dB_t

Training under SGD results in the evolution of these SDEs, where the coupled forward–backward process captures the emergent nonlinear and stochastic feature-learning regime unreachable in the NTK (Neural Tangent Kernel) “lazy training” limit. The depth-muP scaling also restores the so-called Gradient Independence Assumption (GIA) in the LL \to \infty limit, even though it fails at any finite LL (Yao et al., 24 Dec 2025).

3. Super Consistency and Hyperparameter Transfer

An essential empirical property of depth-muP is “Super Consistency”—the invariance of critical spectral properties of the loss landscape, including the Hessian sharpness (largest eigenvalue), to width and depth scaling. Under depth-muP, the learning rate required for stability and rapid convergence can be chosen on a small model and transferred almost unchanged to very large models. The spectra of both the Hessian and Neural Tangent Kernel remain consistent as (n,L)(n, L) vary, resulting in robust optimization dynamics and providing a structural explanation for the success of deep, wide neural networks in practice (Noci et al., 27 Feb 2024).

Key findings include:

  • The sharpness λmax\lambda_{\max} trajectory during training collapses onto a universal curve across different (n,L)(n,L).
  • The “edge-of-stability” threshold 2/η02/\eta_0 determines stable training for a wide range of depths and widths.
  • Learning rate η0\eta_0 tuned on a small model remains optimal (within 10%) as both nn and LL scale up.

In contrast, the NTK scaling does not exhibit this property: sharpness and learning rates drift with nn and LL, and no feature learning persists in the infinite-width limit (Noci et al., 27 Feb 2024).

4. Mechanism for Feature-Learning Collapse in Multi-Layer Residual Blocks

While depth-muP preserves feature learning and hyperparameter invariance for single-layer residual blocks, it breaks down in multi-layer blocks (as used in modern architectures, e.g., Transformers). For a two-layer residual block under standard depth-muP scaling:

  • The first layer’s representation xx_\ell is updated by an amount O(1/L)\mathcal{O}(1/\sqrt{L}) after each SGD step, causing updates to vanish as LL\to\infty.
  • This leads to “feature-learning collapse” in the first internal layer, while the second layer can retain nontrivial learning dynamics.

This structural collapse explains empirical failures of hyperparameter transfer and training stagnation in deep, multi-layered residual architectures when using naive depth-muP scaling (Yao et al., 24 Dec 2025).

5. Depth-Aware Learning Rate Correction for Multi-Layer Blocks

To counteract the vanishing feature learning in the first layer of multi-layer residual blocks, a simple block-wise learning-rate correction is introduced:

  • In a two-layer block, set the learning rate for the first layer as η,1=η0L\eta_{\ell,1} = \eta_0 \sqrt{L} and leave the second layer’s LR as η0\eta_0.
  • With this correction, the effective update Δx\Delta x_\ell instead remains O(1)\mathcal{O}(1) as LL\to\infty, thereby restoring nontrivial feature learning.

Empirical evaluation confirms that with the depth-aware correction:

  • The per-step change in the first-layer feature becomes independent of LL.
  • The same base learning rate η0\eta_0 aligns test curves across LL (restoring hyperparameter transfer).
  • Deep two-layer block networks (e.g., 64-layer ResNet) achieve improved test accuracy and lower test loss compared to depth-muP without correction (Yao et al., 24 Dec 2025).

6. Implications and Practical Significance

Depth-muP’s design principles and the attendant learning-rate corrections for multi-layer blocks unify the treatment of wide and deep scaling, providing a mathematically principled path to robust, scalable architecture design. The restoration of feature learning, Super Consistency, and hyperparameter transfer are critical for practical large-scale model design—especially as architectures become deeper or more complicated. Depth-muP provides an avenue to analytically tractable joint scaling regimes with provable properties, in contrast to heuristically motivated or NTK-based initialization schemes.

A plausible implication is that as architectures increase in complexity, careful attention to scaling—particularly depth-dependent learning-rate adjustments—will be required to maintain both feature learning and reliable optimization performance. In summary, depth-muP delineates a precise framework to extend the benefits of maximal update parametrization to very deep models, exposes limitations in naïve scaling for complex residual architectures, and motivates principled corrections that are empirically validated on realistic benchmarks (Yao et al., 24 Dec 2025, Noci et al., 27 Feb 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Depth-MuP.