Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Don't be lazy: CompleteP enables compute-efficient deep transformers (2505.01618v2)

Published 2 May 2025 in cs.LG and cs.AI

Abstract: We study compute efficiency of LLM training when using different parameterizations, i.e., rules for adjusting model and optimizer hyperparameters (HPs) as model size changes. Some parameterizations fail to transfer optimal base HPs (such as learning rate) across changes in model depth, requiring practitioners to either re-tune these HPs as they scale up (expensive), or accept sub-optimal training when re-tuning is prohibitive. Even when they achieve HP transfer, we develop theory to show parameterizations may still exist in the lazy learning regime where layers learn only features close to their linearization, preventing effective use of depth and nonlinearity. Finally, we identify and adopt the parameterization we call CompleteP that achieves both depth-wise HP transfer and non-lazy learning in all layers. CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts. Moreover, CompleteP enables 12-34% compute efficiency improvements over the prior state-of-the-art.

Summary

  • The paper demonstrates CompleteP's key contribution by enabling hyperparameter transfer across different model depths, supporting tune-small train-large strategies.
  • The paper shows that CompleteP achieves significant compute efficiency, with up to 34.4% FLOP savings for deep models compared to previous parameterizations.
  • The paper provides theoretical justification, proving that only alpha=1 guarantees stable initialization, maximal residual updates, and complete feature learning.

LLMs achieve better performance with increased scale, but training larger models is computationally expensive. Finding optimal hyperparameters (HPs) for these large models is challenging and costly. The paper addresses this by studying parameterizations, which are rules for adjusting model and optimizer HPs as model size (width and depth) changes, aiming for HP transferability across scales and improved compute efficiency.

Existing parameterizations like the standard parameterization (SP) and maximal update parameterization (\textmu P) primarily focus on scaling width and struggle to maintain optimal HPs and training stability when scaling model depth. The authors investigate extensions of \textmu P that incorporate depth scaling, specifically focusing on parameterizations indexed by a parameter α[0.5,1]\alpha \in [0.5, 1]. These parameterizations rescale the output of each residual block F\mathcal{F}_\ell by a factor of LαL^{-\alpha} before adding it to the residual stream: h+1=h+LαF(h)\mathbf{h}^{\ell+1} = \mathbf{h}^\ell + L^{-\alpha} \mathcal{F}_\ell(\mathbf{h}^\ell).

The paper identifies a unique parameterization, called CompleteP, which corresponds to setting α=1\alpha=1. The authors demonstrate both empirically and theoretically that CompleteP offers significant advantages over other values of α\alpha and prior parameterizations like SP and \textmu P.

Key Findings and Contributions:

  1. Depth-wise Hyperparameter Transfer: The authors empirically evaluate HP transferability across different model depths (LL) while keeping width (NN) fixed. They find that SP, \textmu P, and α=0.5\alpha=0.5 fail to maintain stable optimal base learning rates (ηbase\eta_\text{base}) and initialization standard deviations (σbase\sigma_\text{base}) as depth increases. Using HPs tuned for a shallower base model leads to suboptimal training for deeper models under these parameterizations. In contrast, CompleteP (α=1\alpha=1) shows stable optimal HPs across varying depths, enabling effective "tune small and train large" strategies for deep models (\cref{fig:mutransfer-300m}, \cref{fig:mutransfer-lr-20tpp-tepoch}). This HP transferability is shown to hold even in compute-optimal training settings (using 20 tokens per parameter and batch sizes scaled by FLOPs).
  2. Compute-Optimal Shape and Efficiency: The paper revisits the question of compute-optimal transformer shapes (width-to-depth ratio, N:LN:L) under proper scaling control for both width and depth. By training models with varying NN and LL while approximately holding the number of non-embedding parameters constant, they analyze training loss at a fixed compute budget (20 TPP). While the optimal N:LN:L ratio increases with model scale for all parameterizations, CompleteP consistently achieves lower loss for a given parameter count and compute budget (\cref{fig:aspect}a-c). CompleteP shows increasing FLOP savings over \textmu P as model depth increases (\cref{fig:aspect}e). For 1.5B parameter models, CompleteP achieves 11.8\% FLOP savings at the optimal N:LN:L ratio and 34.4\% savings for the deepest models tested compared to \textmu P. CompleteP also enables a wider range of N:LN:L ratios, particularly deeper-narrower models, to remain close to compute-optimal (\cref{fig:aspect}f). These upstream gains translate to improved performance on downstream tasks (\cref{tab:downstream-1p5b}).
  3. Theoretical Justification via Desiderata: The authors propose three desiderata for designing parameterizations that enable HP transfer and effective scaling:

    • Stable Initialization: Hidden layers and outputs remain stable at initialization as width and depth grow. This requires α0.5\alpha \geq 0.5.
    • Maximal Residual Stream Update: Each residual block's weight updates contribute a consistent, non-trivial amount to the change in hidden representations (Θ(1/L)\Theta(1/L) scale per step for residual blocks, Θ(1)\Theta(1) for non-residual layers). This requirement determines the learning rate scaling with depth (η=Θ(Lα1)\eta = \Theta(L^{\alpha-1})) and constrains α1\alpha \leq 1.
    • Complete Feature Learning: Neither hidden layers nor the model output should become "lazy" with respect to any model parameters as scale increases. A layer is considered lazy if its update behavior asymptotically resembles that of its linearization around initialization. The authors show via a simple example that only α=1\alpha=1 ensures the non-linear contribution to the update remains comparable in scale to the linear contribution as LL \to \infty. This property ensures that deeper layers continue to learn complex, non-linear features.

    CompleteP (α=1\alpha=1) is presented as the unique parameterization satisfying all three desiderata, offering stable training, maximal updates, and complete feature learning across scales.

  4. Extended Parameterization Rules: To fully realize the benefits of depth scaling, the authors extend existing parameterization rules to include guidance for LayerNorm (LN) parameters, bias learning rates, AdamW weight decay (λ\lambda), and AdamW ϵ\epsilon. They derive scalings for these HPs as functions of NN and LL (\cref{tab:parameterization-summary}). Empirical checks confirm these adjustments are necessary for stable training, especially for α=0.5\alpha=0.5 (\cref{fig:nonlinear-coordinate-check}, \cref{fig:mutransfer-eps}).

Implementation:

Implementing CompleteP primarily involves adjusting initialization variances, learning rates, weight decay, and AdamW ϵ\epsilon based on the model's width multiplier (mN=N/Nbasem_N = N/N_\text{base}) and depth multiplier (mL=L/Lbasem_L = L/L_\text{base}), following the rules in \cref{tab:parameterization-summary}. The base model (mN=1,mL=1m_N=1, m_L=1) can be a small, easily tunable model. The adjustments include:

  • Initializing hidden weights with variance σbase2mN1\sigma_{\text{base}}^2 \cdot m_N^{-1}.
  • Scaling hidden learning rates by ηbasemN1mLα1\eta_{\text{base}} \cdot m_N^{-1} \cdot m_L^{\alpha-1} (with α=1\alpha=1, this is ηbasemN1\eta_{\text{base}} \cdot m_N^{-1}).
  • Scaling bias and LayerNorm learning rates by ηbasemLα1\eta_{\text{base}} \cdot m_L^{\alpha-1} (with α=1\alpha=1, this is ηbase\eta_{\text{base}}).
  • Scaling residual block outputs by mLαm_L^{-\alpha} before the residual connection (with α=1\alpha=1, this is mL1m_L^{-1}).
  • Scaling hidden weight decay by λbasemN\lambda_{\text{base}} \cdot m_N.
  • Scaling AdamW ϵ\epsilon in residual blocks by ϵbasemN1mLα\epsilon_{\text{base}} \cdot m_N^{-1} \cdot m_L^{-\alpha} (with α=1\alpha=1, this is ϵbasemN1mL1\epsilon_{\text{base}} \cdot m_N^{-1} \cdot m_L^{-1}). The authors provide a minimal implementation example based on nanoGPT.

Practical Implications:

  • Reduced HP Tuning Cost: CompleteP's HP transferability significantly reduces the need for costly HP tuning when scaling model depth, allowing practitioners to tune on smaller models and scale up reliably.
  • Improved Compute Efficiency: By enabling more effective training of deep models and allowing for compute-efficient performance across a wider range of N:L ratios, CompleteP offers substantial FLOP savings during pre-training.
  • Flexible Model Shapes: CompleteP makes deeper, narrower models more competitive in terms of compute efficiency, potentially beneficial for hardware with memory constraints (e.g., requiring weight streaming).
  • Deeper Understanding of Scaling: The concept of Complete Feature Learning provides a novel theoretical lens for understanding why certain parameterizations perform better when scaling depth.

The paper demonstrates these benefits empirically on pre-LN decoder-only transformers up to 1.5B parameters, trained in compute-optimal settings.

Limitations: The empirical results are specific to pre-LN decoder-only transformers with AdamW on text data. The theoretical analysis simplifies certain aspects (e.g., using a fixed token count limit). Compute constraints limited the scale of the largest models tested and the precision of scaling law fits.

Broader Impacts: The compute efficiency gains from CompleteP can help reduce the significant carbon emissions and financial costs associated with training large LLMs, promoting more environmentally sustainable and equitable AI research practices.