u-$μ$P: The Unit-Scaled Maximal Update Parametrization (2407.17465v3)

Published 24 Jul 2024 in cs.LG

Abstract: The Maximal Update Parametrization ($\mu$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$\mu$P, which improves upon $\mu$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $\mu$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$\mu$P models reaching a loss that is equal to or lower than comparable $\mu$P models and working out-of-the-box in FP8.

Citations (6)

View on Semantic Scholar

Summary

The paper presents a novel method that combines μP with unit scaling to simplify hyperparameter tuning across varied model sizes.
It achieves stable low-precision training by initializing weights, activations, and gradients with unit variance and reducing hyperparameter interdependence.
Empirical results demonstrate enhanced learning rate transfer and efficient hyperparameter search, making large-scale LLM training more practical.

The paper "u- $\mu$ P: The Unit-Scaled Maximal Update Parametrization" (2407.17465) introduces a new approach to scaling neural network training, particularly for LLMs, by combining the Maximal Update Parametrization ( $\mu$ P) with Unit Scaling. The core motivation is to address the practical challenges of training large models, including the high cost of hyperparameter tuning and the difficulties of low-precision training.

$\mu$ P aims to make optimal hyperparameters (HPs), especially the learning rate, independent of model size, allowing practitioners to find optimal HPs on smaller "proxy" models and transfer them to larger "target" models ( $\mu$ Transfer). While $\mu$ P has theoretical grounding and has been used in some open LLM projects, the authors identify several practical limitations: $\mu$ Transfer does not consistently work out-of-the-box with standard training setups (requiring fixes like non-parametric norms and independent weight decay), selecting which HPs to sweep is often arbitrary, the concept of a "base shape" adds complexity, and $\mu$ P models can struggle with low-precision training despite theoretical claims about activation scales.

The proposed method, u- $μ$ P (Unit-Scaled Maximal Update Parametrization), synthesizes $\mu$ P and Unit Scaling. Unit Scaling is a method that ensures weights, activations, and gradients begin training with unit variance, which facilitates stable training in low precision. By combining these, u- $μ$ P aims for a simpler, more stable, and more practical scaling scheme.

Key aspects and contributions of u- $μ$ P include:

Simplified Scaling Rules: u- $μ$ P derives new scaling rules by applying Unit Scaling principles and leveraging abc-symmetry (a property of parametrizations allowing scale shifts between initialization, parameter, and LR multipliers) to modify $\mu$ P rules. This process eliminates the need for the "base shape" and $\sigma_{init}$ hyperparameters present in standard $\mu$ P, leading to simpler implementation. The final u- $μ$ P scheme (Table 2) specifies initialization scales of 1 and incorporates Unit Scaling factors (like $1/\sqrt{fan{}$) into the parameter multipliers, while maintaining width-dependent scaling for Adam LR akin to $\mu$ P.
Out-of-the-box Low-Precision Training: A major benefit of integrating Unit Scaling is improved numerical stability. u- $μ$ P models start with tensors having RMS close to 1, making them well-suited for finite-range formats like FP8. The authors propose a simple FP8 scheme where most matmul inputs are cast to E4M3 or E5M2 without complex dynamic scaling. For very large models and training steps, where some layer inputs show scale growth, a lightweight form of dynamic rescaling is introduced for those specific layers. Experiments show u- $μ$ P can train in FP8 with minimal degradation, while $\mu$ P fails under the same simple FP8 scheme (Figure 1c, Figure 13).
Principled and Interpretable Hyperparameters: The paper proposes a refined set of hyperparameters for u- $μ$ P that are designed to be minimal, expressive, have low interdependency, and be interpretable. Instead of associating $\alpha$ HPs with weights, u- $μ$ P associates them with non-homogeneous operations (like softmax, non-linear activations, residual additions) that break scale invariance. These $\alpha$ HPs can be interpreted as influencing the scale of inputs to these critical functions (e.g., affecting softmax temperature). The residual layer hyperparameters are also redesigned for better interpretability and independence. This structure leads to significantly reduced interdependency between HPs compared to standard $\mu$ P (Figure 6, Appendix C), simplifying HP search.
Improved Hyperparameter Transfer and Search: The reduced HP interdependence allows for a more efficient "independent search" strategy for u- $μ$ P, involving a learning rate sweep followed by parallel one-dimensional sweeps for other HPs (Appendix D). This is much cheaper than the random search commonly used for $\mu$ P. The authors show that for u- $μ$ P, simply sweeping the learning rate (with other HPs fixed at their default value of 1) is often sufficient to reach near-optimal loss (Figure 1a). An improved empirical rule for embedding layer LR scaling ($1/\sqrt{fan{}$) is introduced based on empirical observations, which significantly improves performance at larger widths (Figure 4). Experiments demonstrate strong LR transfer for u- $μ$ P across width, depth, training steps, and batch size (Figure 1b, Figure 7).
Practical Implementation Guidance: The paper includes detailed appendices providing a guide to using u- $μ$ P (Appendix B) and introducing an accompanying open-source library implementing u- $μ$ P functions, layers, and optimizers in PyTorch (Appendix A, Appendix G). This library provides unit-scaled versions of common operations and handles the necessary parameter initialization and LR scaling rules.

Experiments conducted on Llama-style models trained on WikiText-103 and SlimPajama datasets validate the proposed scheme. u- $μ$ P models consistently achieve lower loss than comparable $\mu$ P models at larger widths (Figure 1b) and demonstrate stable convergence during large-scale FP8 training up to 7B parameters (Figure 13).

In conclusion, u- $μ$ P offers a practical and effective method for scaling LLM training by simplifying the parametrization and hyperparameter tuning process, enhancing numerical stability for low-precision training, and improving the interpretability and independence of hyperparameters. The authors provide tools and guidance to facilitate the adoption of u- $μ$ P in practice.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JosefDeanGC/status/1825818784495554693

https://twitter.com/cloneofsimo/status/1816348709137047770

https://twitter.com/_clashluke/status/1863213419362394214

https://twitter.com/thecharlieblake/status/1849826188098691148

https://twitter.com/Aleph__Alpha/status/1851547589197328441

https://twitter.com/jxbz/status/1823748548992905695

YouTube

Show All Videos