Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 59 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 127 tok/s Pro
Kimi K2 189 tok/s Pro
GPT OSS 120B 421 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Depth-wise Hyperparameter Transfer

Updated 2 October 2025
  • Depth-wise hyperparameter transfer is a methodology that ensures optimal hyperparameters remain invariant as both network depth and width scale, using specific scaling of residual branches.
  • It leverages parameterizations like Depth-μP and CompleteP to maintain stable training dynamics, ensuring consistent learning rates and feature updates across deep networks.
  • Empirical studies in computer vision, language modeling, and scientific ML demonstrate that tuning on a small base model can be effectively transferred to larger architectures, reducing costly re-tuning.

Depth-wise hyperparameter transfer comprises a set of principles, algorithms, and parameterizations that enable hyperparameters—such as learning rate, momentum, weight decay, and initialization variances—tuned on a “base” model to transfer effectively across neural network depth. Recent research advances show that, when networks are equipped with specific scaling rules (notably the Maximal Update Parametrization μP and extensions), optimal hyperparameters remain invariant as models scale in depth, not just width. This property enables practitioners to “tune small and train large” without the otherwise prohibitive cost of re-tuning as model size expands in modern deep learning scenarios.

1. Fundamental Concepts of Depth-wise Hyperparameter Transfer

Depth-wise hyperparameter transfer is defined by the invariance of optimal hyperparameters as both the width and the depth of a neural network are scaled. In standard parameterizations, the optimal settings for learning rate and related hyperparameters can vary dramatically when the number of layers is increased, necessitating nontrivial re-tuning. The μP parameterization, originally developed for width-wise transfer, ensures layer-wise maximal feature update scale and thus cross-width transferability; however, it does not guarantee transfer across depth. The theoretical breakthrough—initiated by (Bordelon et al., 2023, Yang et al., 2023, Noci et al., 27 Feb 2024, Bordelon et al., 4 Feb 2025), and further refined in (Dey et al., 2 May 2025)—was to introduce depth-wise scaling of residual branches, typically of the form 1/L1/\sqrt{L} or $1/L$ for LL layers, in combination with μP, allowing for depth-wise transfer. In models such as ResNets, Vision Transformers, and deep linear networks, this scaling has been shown to preserve the training dynamics and loss landscape curvature, supporting stable hyperparameter transfer.

2. Optimal Parameterizations for Depth-wise Transfer

The key technical innovation in depth-wise transfer is the precise scaling of the residual branch and optimizer parameters. The “Depth-μP” (Yang et al., 2023) sets the residual branch multiplier as a/La/\sqrt{L} for LL blocks and adjusts the learning rate scaling for block parameters—SGD uses a constant learning rate, Adam uses 1/L1/\sqrt{L}. Alternatively, “CompleteP” (Dey et al., 2 May 2025) applies L1L^{-1} scaling, shown to yield “complete feature learning” and prevent lazy learning in all layers. For Fourier Neural Operators, scaling the kernel integral operator with 1/dlogK1/\sqrt{d \log K} (where KK is the number of Fourier modes, dd the PDE dimension) enables transfer across K, analogous to layer depth (Li et al., 24 Jun 2025).

Parameterization Residual Scaling Learning Rate Scaling
μP width only varies with width (and sometimes depth)
Depth-μP a/La/\sqrt{L} SGD: constant, Adam: 1/L1/\sqrt{L}
CompleteP L1L^{-1} invariant across depth
μTransfer-FNO 1/dlogK1/\sqrt{d \log K} 1/dlogK\propto 1/\sqrt{d \log K}

These scaling schemes guarantee that as LL changes, the effective scale of each parameter update and the sharpness (largest Hessian eigenvalue) of the loss surface remains O(1), satisfying the “super consistency” property described in (Noci et al., 27 Feb 2024).

3. Theoretical Framework: DMFT and Loss Landscape Consistency

Research demonstrates, through dynamical mean-field theory (DMFT), that depth-wise scaling induces a well-defined infinite-width and infinite-depth limit in neural network training dynamics (Bordelon et al., 2023, Bordelon et al., 4 Feb 2025). The principal order parameters—feature kernels and gradient kernels—converge as LL \to \infty if residual branches are scaled by 1/L1/\sqrt{L}. This scaling ensures nontrivial evolution of internal features (feature learning) instead of “lazy” dynamics, where updates remain linearized and effective nonlinear feature acquisition is lost (Dey et al., 2 May 2025).

The “super consistency” property, empirically validated in (Noci et al., 27 Feb 2024), asserts that the spectrum of the Hessian—especially the largest eigenvalue dictating step size stability—remains invariant with width and depth when μP and its depth-extensions are used. Mathematically,

L(θt+1)L(θt)(ηη2λt2)L(θt)2,L(\theta_{t+1}) - L(\theta_t) \leq -(\eta - \frac{\eta^2 \lambda_t}{2})\|\nabla L(\theta_t)\|^2,

and when λt\lambda_t (the Hessian's top eigenvalue) is scale-invariant, so is the stability window and the optimal learning rate.

4. Empirical Evidence and Impact on Compute Efficiency

Experimental results across computer vision (CIFAR-10, ImageNet), LLMing (BERT, GPT-3, Megatron), and scientific learning (FNOs for PDEs) confirm that with proper parametrization, hyperparameters (notably the learning rate, Adam ε\varepsilon, initialization scales) tuned on shallow (or reduced-resolution) models transfer to deep (or high-resolution) models efficiently (Yang et al., 2022, Bordelon et al., 2023, Li et al., 24 Jun 2025, Dey et al., 2 May 2025). CompleteP yields compute-optimal configurations with up to 34.4% FLOP savings over μP for very deep transformers (Dey et al., 2 May 2025), and μTransfer-FNO achieves stable accuracy at large scale with only 30% of the tuning compute required for baseline approaches (Li et al., 24 Jun 2025). In deep ResNets, applying 1/L1/\sqrt{L} scaling results in hyperparameter invariance for both learning rate and momentum across depths (Bordelon et al., 2023).

5. Applications Across Architectures and Domains

Depth-wise hyperparameter transfer has direct applications in:

6. Limitations and Open Research Directions

Multiple studies note limitations. In architectures with block depth >> 1 (e.g., transformers, multi-layer-per-block ResNets), Depth-μP and similar scaling prescriptions encounter fundamental constraints; feature diversity across layers diminishes, and hyperparameter transfer becomes brittle (Yang et al., 2023). Future research may focus on developing new parameterizations for multi-layer blocks, refining DMFT to address discrete time SGD with momentum, and discovering scaling rules for architectures with nontrivial layer interactions. In applied contexts, further work is needed to extend empirical and theoretical guarantees to other models and data regimes (e.g., multi-fidelity tasks or attention-based modules).

7. Practical Implementation Considerations

For practical adoption, practitioners should:

  • Apply depth-scaled residual branch multipliers (e.g., L1L^{-1} CompleteP or 1/L1/\sqrt{L} Depth-μP) throughout the model, rescale optimizer parameters and initialization variances per scaling rules ((Dey et al., 2 May 2025) Table 1).
  • Tune hyperparameters on a shallow/base model, then transfer settings to deep models as is, with no additional costly search.
  • Exploit open-source implementations, such as the mup package for PyTorch that automates μP and depth-wise scaling (Yang et al., 2022).
  • Observe the lazy learning regime avoidance and maximal feature diversity as metrics for successful transfer.

Depth-wise hyperparameter transfer thus constitutes a rigorous framework—underpinned by theoretical analysis and broad empirical validation—for scaling modern neural networks in a compute-efficient and reliable manner. The ongoing extension of these principles to new architectures, domains, and scaling limits is a central topic of current deep learning research.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Depth-wise Hyperparameter Transfer.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube