- The paper presents a novel method that combines μP with unit scaling to simplify hyperparameter tuning across varied model sizes.
- It achieves stable low-precision training by initializing weights, activations, and gradients with unit variance and reducing hyperparameter interdependence.
- Empirical results demonstrate enhanced learning rate transfer and efficient hyperparameter search, making large-scale LLM training more practical.
The paper "u-μP: The Unit-Scaled Maximal Update Parametrization" (2407.17465) introduces a new approach to scaling neural network training, particularly for LLMs, by combining the Maximal Update Parametrization (μP) with Unit Scaling. The core motivation is to address the practical challenges of training large models, including the high cost of hyperparameter tuning and the difficulties of low-precision training.
μP aims to make optimal hyperparameters (HPs), especially the learning rate, independent of model size, allowing practitioners to find optimal HPs on smaller "proxy" models and transfer them to larger "target" models (μTransfer). While μP has theoretical grounding and has been used in some open LLM projects, the authors identify several practical limitations: μTransfer does not consistently work out-of-the-box with standard training setups (requiring fixes like non-parametric norms and independent weight decay), selecting which HPs to sweep is often arbitrary, the concept of a "base shape" adds complexity, and μP models can struggle with low-precision training despite theoretical claims about activation scales.
The proposed method, u-μP (Unit-Scaled Maximal Update Parametrization), synthesizes μP and Unit Scaling. Unit Scaling is a method that ensures weights, activations, and gradients begin training with unit variance, which facilitates stable training in low precision. By combining these, u-μP aims for a simpler, more stable, and more practical scaling scheme.
Key aspects and contributions of u-μP include:
- Simplified Scaling Rules: u-μP derives new scaling rules by applying Unit Scaling principles and leveraging abc-symmetry (a property of parametrizations allowing scale shifts between initialization, parameter, and LR multipliers) to modify μP rules. This process eliminates the need for the "base shape" and σinit hyperparameters present in standard μP, leading to simpler implementation. The final u-μP scheme (Table 2) specifies initialization scales of 1 and incorporates Unit Scaling factors (like $1/\sqrt{fan{}$) into the parameter multipliers, while maintaining width-dependent scaling for Adam LR akin to μP.
- Out-of-the-box Low-Precision Training: A major benefit of integrating Unit Scaling is improved numerical stability. u-μP models start with tensors having RMS close to 1, making them well-suited for finite-range formats like FP8. The authors propose a simple FP8 scheme where most matmul inputs are cast to E4M3 or E5M2 without complex dynamic scaling. For very large models and training steps, where some layer inputs show scale growth, a lightweight form of dynamic rescaling is introduced for those specific layers. Experiments show u-μP can train in FP8 with minimal degradation, while μP fails under the same simple FP8 scheme (Figure 1c, Figure 13).
- Principled and Interpretable Hyperparameters: The paper proposes a refined set of hyperparameters for u-μP that are designed to be minimal, expressive, have low interdependency, and be interpretable. Instead of associating α HPs with weights, u-μP associates them with non-homogeneous operations (like softmax, non-linear activations, residual additions) that break scale invariance. These α HPs can be interpreted as influencing the scale of inputs to these critical functions (e.g., affecting softmax temperature). The residual layer hyperparameters are also redesigned for better interpretability and independence. This structure leads to significantly reduced interdependency between HPs compared to standard μP (Figure 6, Appendix C), simplifying HP search.
- Improved Hyperparameter Transfer and Search: The reduced HP interdependence allows for a more efficient "independent search" strategy for u-μP, involving a learning rate sweep followed by parallel one-dimensional sweeps for other HPs (Appendix D). This is much cheaper than the random search commonly used for μP. The authors show that for u-μP, simply sweeping the learning rate (with other HPs fixed at their default value of 1) is often sufficient to reach near-optimal loss (Figure 1a). An improved empirical rule for embedding layer LR scaling ($1/\sqrt{fan{}$) is introduced based on empirical observations, which significantly improves performance at larger widths (Figure 4). Experiments demonstrate strong LR transfer for u-μP across width, depth, training steps, and batch size (Figure 1b, Figure 7).
- Practical Implementation Guidance: The paper includes detailed appendices providing a guide to using u-μP (Appendix B) and introducing an accompanying open-source library implementing u-μP functions, layers, and optimizers in PyTorch (Appendix A, Appendix G). This library provides unit-scaled versions of common operations and handles the necessary parameter initialization and LR scaling rules.
Experiments conducted on Llama-style models trained on WikiText-103 and SlimPajama datasets validate the proposed scheme. u-μP models consistently achieve lower loss than comparable μP models at larger widths (Figure 1b) and demonstrate stable convergence during large-scale FP8 training up to 7B parameters (Figure 13).
In conclusion, u-μP offers a practical and effective method for scaling LLM training by simplifying the parametrization and hyperparameter tuning process, enhancing numerical stability for low-precision training, and improving the interpretability and independence of hyperparameters. The authors provide tools and guidance to facilitate the adoption of u-μP in practice.