Depth-muP: Scaling Deep Neural Networks
- Depth-muP is a scaling framework that extends maximal update parametrization to both infinite-width and infinite-depth regimes, ensuring nontrivial feature learning.
- It prescribes precise initialization and learning rate scaling rules that yield Super Consistency and reliable hyperparameter transfer across various deep residual architectures.
- The framework identifies and addresses feature learning collapse in multi-layer blocks with depth-aware corrections, offering a principled path to robust model optimization.
Depth-muP refers to the “depth-extended Maximal Update Parametrization” (depth-μP), a formal scaling rule for neural network initialization and learning-rate parametrization designed to ensure nontrivial feature learning in deep, wide networks as both depth and width diverge. The depth-muP framework generalizes and extends the μP (“maximal update parametrization”) approach, which originated for width scaling, into the depth direction, especially for residual architectures. The depth-muP regime enables remarkable invariance properties for Hessian eigenstructure (“Super Consistency”), supports reliable hyperparameter transfer across model scales, and reveals structural mechanisms behind the breakdown of feature learning in ultra-deep architectures. This makes depth-muP a central object of study for understanding scaling laws and optimization stability in deep neural architectures (Yao et al., 24 Dec 2025, Noci et al., 27 Feb 2024).
1. Maximal Update Parametrization and Depth-muP Formalism
The original μP rule prescribes a scaling of weight matrices and learning rates in standard neural networks, such that the “feature learning” regime is maintained as width . In residual networks with blocks, the depth-muP parametrization prescribes:
- Each residual branch weight is initialized as and enters the forward pass as .
- The readout layer uses .
- In SGD, a base learning rate is scaled with width: .
The update for any block weight is thus (for pre-activation ResNets with single-layer blocks):
- Forward:
- Backward update:
This scaling guarantees that all hidden representations, gradients, and weight updates remain as both (Noci et al., 27 Feb 2024).
2. Neural Feature Dynamics and Infinite-Depth Limit
In the joint infinite-width, infinite-depth regime (), the dynamics of feature learning in residual networks under depth-muP can be described by a pair of coupled stochastic differential equations (SDEs) for the hidden features and backpropagated gradients . Specifically, the continuum limit for the layer index is rewritten as time ( with ), yielding:
- Forward SDE:
- Backward SDE:
Training under SGD results in the evolution of these SDEs, where the coupled forward–backward process captures the emergent nonlinear and stochastic feature-learning regime unreachable in the NTK (Neural Tangent Kernel) “lazy training” limit. The depth-muP scaling also restores the so-called Gradient Independence Assumption (GIA) in the limit, even though it fails at any finite (Yao et al., 24 Dec 2025).
3. Super Consistency and Hyperparameter Transfer
An essential empirical property of depth-muP is “Super Consistency”—the invariance of critical spectral properties of the loss landscape, including the Hessian sharpness (largest eigenvalue), to width and depth scaling. Under depth-muP, the learning rate required for stability and rapid convergence can be chosen on a small model and transferred almost unchanged to very large models. The spectra of both the Hessian and Neural Tangent Kernel remain consistent as vary, resulting in robust optimization dynamics and providing a structural explanation for the success of deep, wide neural networks in practice (Noci et al., 27 Feb 2024).
Key findings include:
- The sharpness trajectory during training collapses onto a universal curve across different .
- The “edge-of-stability” threshold determines stable training for a wide range of depths and widths.
- Learning rate tuned on a small model remains optimal (within 10%) as both and scale up.
In contrast, the NTK scaling does not exhibit this property: sharpness and learning rates drift with and , and no feature learning persists in the infinite-width limit (Noci et al., 27 Feb 2024).
4. Mechanism for Feature-Learning Collapse in Multi-Layer Residual Blocks
While depth-muP preserves feature learning and hyperparameter invariance for single-layer residual blocks, it breaks down in multi-layer blocks (as used in modern architectures, e.g., Transformers). For a two-layer residual block under standard depth-muP scaling:
- The first layer’s representation is updated by an amount after each SGD step, causing updates to vanish as .
- This leads to “feature-learning collapse” in the first internal layer, while the second layer can retain nontrivial learning dynamics.
This structural collapse explains empirical failures of hyperparameter transfer and training stagnation in deep, multi-layered residual architectures when using naive depth-muP scaling (Yao et al., 24 Dec 2025).
5. Depth-Aware Learning Rate Correction for Multi-Layer Blocks
To counteract the vanishing feature learning in the first layer of multi-layer residual blocks, a simple block-wise learning-rate correction is introduced:
- In a two-layer block, set the learning rate for the first layer as and leave the second layer’s LR as .
- With this correction, the effective update instead remains as , thereby restoring nontrivial feature learning.
Empirical evaluation confirms that with the depth-aware correction:
- The per-step change in the first-layer feature becomes independent of .
- The same base learning rate aligns test curves across (restoring hyperparameter transfer).
- Deep two-layer block networks (e.g., 64-layer ResNet) achieve improved test accuracy and lower test loss compared to depth-muP without correction (Yao et al., 24 Dec 2025).
6. Implications and Practical Significance
Depth-muP’s design principles and the attendant learning-rate corrections for multi-layer blocks unify the treatment of wide and deep scaling, providing a mathematically principled path to robust, scalable architecture design. The restoration of feature learning, Super Consistency, and hyperparameter transfer are critical for practical large-scale model design—especially as architectures become deeper or more complicated. Depth-muP provides an avenue to analytically tractable joint scaling regimes with provable properties, in contrast to heuristically motivated or NTK-based initialization schemes.
A plausible implication is that as architectures increase in complexity, careful attention to scaling—particularly depth-dependent learning-rate adjustments—will be required to maintain both feature learning and reliable optimization performance. In summary, depth-muP delineates a precise framework to extend the benefits of maximal update parametrization to very deep models, exposes limitations in naïve scaling for complex residual architectures, and motivates principled corrections that are empirically validated on realistic benchmarks (Yao et al., 24 Dec 2025, Noci et al., 27 Feb 2024).