Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

u-$μ$P: The Unit-Scaled Maximal Update Parametrization (2407.17465v3)

Published 24 Jul 2024 in cs.LG

Abstract: The Maximal Update Parametrization ($\mu$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$\mu$P, which improves upon $\mu$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $\mu$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$\mu$P models reaching a loss that is equal to or lower than comparable $\mu$P models and working out-of-the-box in FP8.

Citations (6)

Summary

  • The paper presents a novel method that combines μP with unit scaling to simplify hyperparameter tuning across varied model sizes.
  • It achieves stable low-precision training by initializing weights, activations, and gradients with unit variance and reducing hyperparameter interdependence.
  • Empirical results demonstrate enhanced learning rate transfer and efficient hyperparameter search, making large-scale LLM training more practical.

The paper "u-μ\muP: The Unit-Scaled Maximal Update Parametrization" (2407.17465) introduces a new approach to scaling neural network training, particularly for LLMs, by combining the Maximal Update Parametrization (μ\muP) with Unit Scaling. The core motivation is to address the practical challenges of training large models, including the high cost of hyperparameter tuning and the difficulties of low-precision training.

μ\muP aims to make optimal hyperparameters (HPs), especially the learning rate, independent of model size, allowing practitioners to find optimal HPs on smaller "proxy" models and transfer them to larger "target" models (μ\muTransfer). While μ\muP has theoretical grounding and has been used in some open LLM projects, the authors identify several practical limitations: μ\muTransfer does not consistently work out-of-the-box with standard training setups (requiring fixes like non-parametric norms and independent weight decay), selecting which HPs to sweep is often arbitrary, the concept of a "base shape" adds complexity, and μ\muP models can struggle with low-precision training despite theoretical claims about activation scales.

The proposed method, u-μμP (Unit-Scaled Maximal Update Parametrization), synthesizes μ\muP and Unit Scaling. Unit Scaling is a method that ensures weights, activations, and gradients begin training with unit variance, which facilitates stable training in low precision. By combining these, u-μμP aims for a simpler, more stable, and more practical scaling scheme.

Key aspects and contributions of u-μμP include:

  1. Simplified Scaling Rules: u-μμP derives new scaling rules by applying Unit Scaling principles and leveraging abc-symmetry (a property of parametrizations allowing scale shifts between initialization, parameter, and LR multipliers) to modify μ\muP rules. This process eliminates the need for the "base shape" and σinit\sigma_{init} hyperparameters present in standard μ\muP, leading to simpler implementation. The final u-μμP scheme (Table 2) specifies initialization scales of 1 and incorporates Unit Scaling factors (like $1/\sqrt{fan{}$) into the parameter multipliers, while maintaining width-dependent scaling for Adam LR akin to μ\muP.
  2. Out-of-the-box Low-Precision Training: A major benefit of integrating Unit Scaling is improved numerical stability. u-μμP models start with tensors having RMS close to 1, making them well-suited for finite-range formats like FP8. The authors propose a simple FP8 scheme where most matmul inputs are cast to E4M3 or E5M2 without complex dynamic scaling. For very large models and training steps, where some layer inputs show scale growth, a lightweight form of dynamic rescaling is introduced for those specific layers. Experiments show u-μμP can train in FP8 with minimal degradation, while μ\muP fails under the same simple FP8 scheme (Figure 1c, Figure 13).
  3. Principled and Interpretable Hyperparameters: The paper proposes a refined set of hyperparameters for u-μμP that are designed to be minimal, expressive, have low interdependency, and be interpretable. Instead of associating α\alpha HPs with weights, u-μμP associates them with non-homogeneous operations (like softmax, non-linear activations, residual additions) that break scale invariance. These α\alpha HPs can be interpreted as influencing the scale of inputs to these critical functions (e.g., affecting softmax temperature). The residual layer hyperparameters are also redesigned for better interpretability and independence. This structure leads to significantly reduced interdependency between HPs compared to standard μ\muP (Figure 6, Appendix C), simplifying HP search.
  4. Improved Hyperparameter Transfer and Search: The reduced HP interdependence allows for a more efficient "independent search" strategy for u-μμP, involving a learning rate sweep followed by parallel one-dimensional sweeps for other HPs (Appendix D). This is much cheaper than the random search commonly used for μ\muP. The authors show that for u-μμP, simply sweeping the learning rate (with other HPs fixed at their default value of 1) is often sufficient to reach near-optimal loss (Figure 1a). An improved empirical rule for embedding layer LR scaling ($1/\sqrt{fan{}$) is introduced based on empirical observations, which significantly improves performance at larger widths (Figure 4). Experiments demonstrate strong LR transfer for u-μμP across width, depth, training steps, and batch size (Figure 1b, Figure 7).
  5. Practical Implementation Guidance: The paper includes detailed appendices providing a guide to using u-μμP (Appendix B) and introducing an accompanying open-source library implementing u-μμP functions, layers, and optimizers in PyTorch (Appendix A, Appendix G). This library provides unit-scaled versions of common operations and handles the necessary parameter initialization and LR scaling rules.

Experiments conducted on Llama-style models trained on WikiText-103 and SlimPajama datasets validate the proposed scheme. u-μμP models consistently achieve lower loss than comparable μ\muP models at larger widths (Figure 1b) and demonstrate stable convergence during large-scale FP8 training up to 7B parameters (Figure 13).

In conclusion, u-μμP offers a practical and effective method for scaling LLM training by simplifying the parametrization and hyperparameter tuning process, enhancing numerical stability for low-precision training, and improving the interpretability and independence of hyperparameters. The authors provide tools and guidance to facilitate the adoption of u-μμP in practice.

Youtube Logo Streamline Icon: https://streamlinehq.com