Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

158 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

236

A Large-Scale Exploration of $μ$-Transfer (2404.05728v5)

Published 8 Apr 2024 in cs.LG

Abstract: Large artificial neural networks have become a mainstay of language, vision, and audio processing and synthesis, yet their initializations and learning rates are often set in an unsophisticated fashion, due to the high cost of hyperparameter sweeps at scale. The $\mu$-Parameterization ($\mu$P) offers a potential solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the $\mu$P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? Studying models of up to 10B parameters and training budgets of up to 190B tokens, we find $\mu$-Transfer works as intended for the majority of important cases, yet also identify a few cases where it may not.

References (58)

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that μ-Transfer accurately scales hyperparameters from proxy (2M) to large (10B) transformer models, ensuring efficient training.
It reveals that incorporating trainable scales and adjustments in the attention mechanism can disrupt the effectiveness of μ-Transfer.
The study confirms that μP reduces hyperparameter tuning costs, although its compatibility varies with different optimizers and training schemes.

Empirical Investigation of μ-Parameterization for Large-Scale Transformer Models

Introduction

The efficient training of transformer models, prevalent in both NLP and computer vision, faces the challenge of setting hyperparameters like initialization and learning rates, which is often done heuristically and varies across different model sizes. The μ-Parameterization (μP) approach proposes a method for scaling these parameters, facilitating the transfer of hyperparameters from smaller "proxy" models to substantially larger target models with minimal performance loss—a process termed as μ-Transfer. Despite its potential, μP's complexity and theoretical underpinnings may have hindered its widespread adoption. This paper conducts a comprehensive examination of μ-Transfer, particularly within the context of the transformer architecture, across a spectrum from 2M to 10B parameters to evaluate its practical effectiveness and identify instances where it may falter.

Background and Notation

Transformers consist of embeddings, layers of transformer blocks, and unembeddings. The transformer blocks execute two residual blocks: Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP). The crucial aspect of μP is its specific set of initializations and learning rate scaling rules which allow for stable training across different model sizes. This paper concentrates on the transformer's width as the scaling dimension and employs specific rules for hyperparameter adjustment in line with μP directives.

Experimental Findings

Baseline μP Performance

The paper outlines several experimental scenarios addressing different aspects of the transformer's design and training process to test μ-Transfer's reliability. The baseline experiments verify that under μP, hyperparameters optimally transfer across different model sizes, confirming the premise that a well-tuned small model can accurately inform the training setup of much larger models.

Projection Biases and RMSNorm Gains

Incorporating projection biases and RMSNorm gains into the models does not significantly affect the transferability of learning rates or result in considerable model quality improvement. Specifically, trainable scales (or gains) in RMSNorm disrupt hyperparameter transfer, suggesting a potential area of incompatibility with μP.

Attention Mechanism Adjustments

Experiments adjusting the attention mechanism’s initialization and scaling highlight the sensitivity of μ-Transfer to these components. For instance, switching to a standard parameterization for unembedding projections or modifying the attention scale to $1/\sqrt{D}$ adversely affects the transferability and performance.

Optimizer and Training Scheme Compatibility

The paper extends the investigation to the compatibility of μ-Transfer with various optimizers and training schemes. The findings indicate that while some optimizers like Lion do not facilitate seamless μ-Transfer, other adjustments like batch size variation and the application of multiplicative nonlinearities like SwiGLU or Squared ReLU are accommodated well by μP.

Large-Scale Transfer Experiment

A pivotal part of the research is a large-scale transfer experiment, escalating model sizes to 10 billion parameters. This experiment combines architectural adjustments that were compatible with μ-Transfer and improved performance, conclusively demonstrating that optimal learning rates identified in a 2 million parameter model could accurately predict those for the 10 billion parameter model.

Implications and Future Directions

The findings validate μ-Transfer's efficacy in a broad scope of settings, with some noted exceptions which invite further research. Given the successful transfer to a 10 billion parameter model, the results affirm the scalability of μP and suggest its potential to substantially reduce the computational costs associated with tuning large-scale models. Future inquiries may delve into refining μP's compatibility with trainable scale parameters and exploring adaptable attention scales, possibly extending these principles beyond transformer architectures.

Conclusion

This paper substantiates the effectiveness of μP and μ-Transfer in optimizing hyperparameter settings across transformer models of vastly differing sizes, with some limitations. By navigating through various architectural and training adaptations, the research elucidates pathways and pitfalls for applying μP in practice, contributing valuable insights to the continuous evolution of large-scale model training methodologies.

The robust testing across multiple configurations underscores μP's promise in simplifying the arduous process of hyperparameter optimization for large neural networks, paving the way for more efficient and scalable AI systems.

PDF Markdown

Tweets

https://twitter.com/arankomatsuzaki/status/1777525077753364822

https://twitter.com/thecharlieblake/status/1821583413830406643

https://twitter.com/fly51fly/status/1777813239252074966

https://twitter.com/arxivsanitybot/status/1777879234394181696

https://twitter.com/thecharlieblake/status/1797910290513998180