- The paper demonstrates CompleteP's key contribution by enabling hyperparameter transfer across different model depths, supporting tune-small train-large strategies.
- The paper shows that CompleteP achieves significant compute efficiency, with up to 34.4% FLOP savings for deep models compared to previous parameterizations.
- The paper provides theoretical justification, proving that only alpha=1 guarantees stable initialization, maximal residual updates, and complete feature learning.
LLMs achieve better performance with increased scale, but training larger models is computationally expensive. Finding optimal hyperparameters (HPs) for these large models is challenging and costly. The paper addresses this by studying parameterizations, which are rules for adjusting model and optimizer HPs as model size (width and depth) changes, aiming for HP transferability across scales and improved compute efficiency.
Existing parameterizations like the standard parameterization (SP) and maximal update parameterization (\textmu P) primarily focus on scaling width and struggle to maintain optimal HPs and training stability when scaling model depth. The authors investigate extensions of \textmu P that incorporate depth scaling, specifically focusing on parameterizations indexed by a parameter α∈[0.5,1]. These parameterizations rescale the output of each residual block Fℓ by a factor of L−α before adding it to the residual stream: hℓ+1=hℓ+L−αFℓ(hℓ).
The paper identifies a unique parameterization, called CompleteP, which corresponds to setting α=1. The authors demonstrate both empirically and theoretically that CompleteP offers significant advantages over other values of α and prior parameterizations like SP and \textmu P.
Key Findings and Contributions:
- Depth-wise Hyperparameter Transfer: The authors empirically evaluate HP transferability across different model depths (L) while keeping width (N) fixed. They find that SP, \textmu P, and α=0.5 fail to maintain stable optimal base learning rates (ηbase) and initialization standard deviations (σbase) as depth increases. Using HPs tuned for a shallower base model leads to suboptimal training for deeper models under these parameterizations. In contrast, CompleteP (α=1) shows stable optimal HPs across varying depths, enabling effective "tune small and train large" strategies for deep models (\cref{fig:mutransfer-300m}, \cref{fig:mutransfer-lr-20tpp-tepoch}). This HP transferability is shown to hold even in compute-optimal training settings (using 20 tokens per parameter and batch sizes scaled by FLOPs).
- Compute-Optimal Shape and Efficiency: The paper revisits the question of compute-optimal transformer shapes (width-to-depth ratio, N:L) under proper scaling control for both width and depth. By training models with varying N and L while approximately holding the number of non-embedding parameters constant, they analyze training loss at a fixed compute budget (20 TPP). While the optimal N:L ratio increases with model scale for all parameterizations, CompleteP consistently achieves lower loss for a given parameter count and compute budget (\cref{fig:aspect}a-c). CompleteP shows increasing FLOP savings over \textmu P as model depth increases (\cref{fig:aspect}e). For 1.5B parameter models, CompleteP achieves 11.8\% FLOP savings at the optimal N:L ratio and 34.4\% savings for the deepest models tested compared to \textmu P. CompleteP also enables a wider range of N:L ratios, particularly deeper-narrower models, to remain close to compute-optimal (\cref{fig:aspect}f). These upstream gains translate to improved performance on downstream tasks (\cref{tab:downstream-1p5b}).
- Theoretical Justification via Desiderata: The authors propose three desiderata for designing parameterizations that enable HP transfer and effective scaling:
- Stable Initialization: Hidden layers and outputs remain stable at initialization as width and depth grow. This requires α≥0.5.
- Maximal Residual Stream Update: Each residual block's weight updates contribute a consistent, non-trivial amount to the change in hidden representations (Θ(1/L) scale per step for residual blocks, Θ(1) for non-residual layers). This requirement determines the learning rate scaling with depth (η=Θ(Lα−1)) and constrains α≤1.
- Complete Feature Learning: Neither hidden layers nor the model output should become "lazy" with respect to any model parameters as scale increases. A layer is considered lazy if its update behavior asymptotically resembles that of its linearization around initialization. The authors show via a simple example that only α=1 ensures the non-linear contribution to the update remains comparable in scale to the linear contribution as L→∞. This property ensures that deeper layers continue to learn complex, non-linear features.
CompleteP (α=1) is presented as the unique parameterization satisfying all three desiderata, offering stable training, maximal updates, and complete feature learning across scales.
- Extended Parameterization Rules: To fully realize the benefits of depth scaling, the authors extend existing parameterization rules to include guidance for LayerNorm (LN) parameters, bias learning rates, AdamW weight decay (λ), and AdamW ϵ. They derive scalings for these HPs as functions of N and L (\cref{tab:parameterization-summary}). Empirical checks confirm these adjustments are necessary for stable training, especially for α=0.5 (\cref{fig:nonlinear-coordinate-check}, \cref{fig:mutransfer-eps}).
Implementation:
Implementing CompleteP primarily involves adjusting initialization variances, learning rates, weight decay, and AdamW ϵ based on the model's width multiplier (mN=N/Nbase) and depth multiplier (mL=L/Lbase), following the rules in \cref{tab:parameterization-summary}. The base model (mN=1,mL=1) can be a small, easily tunable model. The adjustments include:
- Initializing hidden weights with variance σbase2⋅mN−1.
- Scaling hidden learning rates by ηbase⋅mN−1⋅mLα−1 (with α=1, this is ηbase⋅mN−1).
- Scaling bias and LayerNorm learning rates by ηbase⋅mLα−1 (with α=1, this is ηbase).
- Scaling residual block outputs by mL−α before the residual connection (with α=1, this is mL−1).
- Scaling hidden weight decay by λbase⋅mN.
- Scaling AdamW ϵ in residual blocks by ϵbase⋅mN−1⋅mL−α (with α=1, this is ϵbase⋅mN−1⋅mL−1).
The authors provide a minimal implementation example based on nanoGPT.
Practical Implications:
- Reduced HP Tuning Cost: CompleteP's HP transferability significantly reduces the need for costly HP tuning when scaling model depth, allowing practitioners to tune on smaller models and scale up reliably.
- Improved Compute Efficiency: By enabling more effective training of deep models and allowing for compute-efficient performance across a wider range of N:L ratios, CompleteP offers substantial FLOP savings during pre-training.
- Flexible Model Shapes: CompleteP makes deeper, narrower models more competitive in terms of compute efficiency, potentially beneficial for hardware with memory constraints (e.g., requiring weight streaming).
- Deeper Understanding of Scaling: The concept of Complete Feature Learning provides a novel theoretical lens for understanding why certain parameterizations perform better when scaling depth.
The paper demonstrates these benefits empirically on pre-LN decoder-only transformers up to 1.5B parameters, trained in compute-optimal settings.
Limitations: The empirical results are specific to pre-LN decoder-only transformers with AdamW on text data. The theoretical analysis simplifies certain aspects (e.g., using a fixed token count limit). Compute constraints limited the scale of the largest models tested and the precision of scaling law fits.
Broader Impacts: The compute efficiency gains from CompleteP can help reduce the significant carbon emissions and financial costs associated with training large LLMs, promoting more environmentally sustainable and equitable AI research practices.