Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Large-Scale Exploration of $μ$-Transfer (2404.05728v5)

Published 8 Apr 2024 in cs.LG

Abstract: Large artificial neural networks have become a mainstay of language, vision, and audio processing and synthesis, yet their initializations and learning rates are often set in an unsophisticated fashion, due to the high cost of hyperparameter sweeps at scale. The $\mu$-Parameterization ($\mu$P) offers a potential solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the $\mu$P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? Studying models of up to 10B parameters and training budgets of up to 190B tokens, we find $\mu$-Transfer works as intended for the majority of important cases, yet also identify a few cases where it may not.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. GQA: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URL https://arxiv.org/abs/2305.13245.
  2. The falcon series of open language models, 2023. URL https://arxiv.org/abs/2311.16867.
  3. Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, 2024. Accessed: 2024-03-13.
  4. Online learning rate adaptation with hypergradient descent, 2017. URL https://arxiv.org/abs/1703.04782.
  5. Automatic gradient descent: Deep learning without hyperparameters, 2023. URL https://arxiv.org/abs/2304.05187.
  6. Gradient descent: The ultimate optimizer, 2019. URL https://arxiv.org/abs/1909.13371.
  7. Lion secretly solves constrained optimization: As Lyapunov predicts, 2023a. URL https://arxiv.org/abs/2310.05898.
  8. Symbolic discovery of optimization algorithms, 2023b. URL https://arxiv.org/abs/2302.06675.
  9. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
  10. Deepseek llm: Scaling open-source language models with longtermism, 2024. URL https://arxiv.org/abs/2401.02954.
  11. Scaling vision transformers to 22 billion parameters, 2023. URL https://arxiv.org/abs/2302.05442.
  12. Tim Dettmers. LLM.int8() and emergent features. https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/, 2022. Accessed: 2024-03-09.
  13. Cerebras-GPT: Open compute-optimal language models trained on the cerebras wafer-scale cluster, 2023. URL https://arxiv.org/abs/2304.03208.
  14. Privileged bases in the transformer residual stream. https://transformer-circuits.pub/2023/privileged-basis/index.html, 2023. Accessed: 2024-03-09.
  15. Releasing persimmon-8b, 2023. URL https://www.adept.ai/blog/persimmon-8b.
  16. NanoLM: An affordable LLM study benchmark via accurate loss prediction across scales, 2024. URL https://openreview.net/forum?id=mao3y822aM.
  17. Gemini: A family of highly capable multimodal models, 2023. URL https://arxiv.org/abs/2312.11805.
  18. Team Gemma. Gemma: Open models based on gemini research and technology, 2024. URL https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf.
  19. Identity mappings in deep residual networks, 2016. URL https://arxiv.org/abs/1603.05027.
  20. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/2203.15556.
  21. Minicpm: Unveiling the potential of end-side large language models. https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4, 2024. Accessed: 2024-03-09.
  22. Mixtral of experts, 2024. URL https://arxiv.org/abs/2401.04088.
  23. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.
  24. Adam: A method for stochastic optimization, 2014. URL https://arxiv.org/abs/1412.6980.
  25. Adaptive optimization in the $\infty$-width limit. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=zgVDqw9ZUES.
  26. Visual instruction tuning, 2023. URL https://arxiv.org/abs/2304.08485.
  27. Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017. URL http://arxiv.org/abs/1711.05101.
  28. On the SDEs and scaling rules for adaptive gradient algorithms, 2022. URL https://arxiv.org/abs/2205.10287.
  29. An empirical model of large-batch training. CoRR, abs/1812.06162, 2018. URL http://arxiv.org/abs/1812.06162.
  30. Transformers without tears: Improving the normalization of self-attention, 2019. URL https://arxiv.org/abs/1910.05895.
  31. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
  32. Nemotron-4 15b technical report, 2024. URL https://arxiv.org/abs/2402.16819.
  33. Rwkv: Reinventing rnns for the transformer era, 2023. URL https://arxiv.org/abs/2305.13048.
  34. Language models are unsupervised multitask learners, 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Last visited on 2023/09/07.
  35. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019. URL https://arxiv.org/abs/1910.10683.
  36. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/2403.05530.
  37. Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019. URL https://arxiv.org/abs/1911.02150.
  38. Noam Shazeer. GLU variants improve transformer, 2020. URL https://arxiv.org/abs/2002.05202.
  39. Megatron-LM: Training multi-billion parameter language models using model parallelism, 2020. URL https://arxiv.org/abs/1909.08053.
  40. Searching for efficient transformers for language modeling. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  6010–6022. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/2f3c6a4cd8af177f6456e7e51a916ff3-Paper.pdf.
  41. RoFormer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  42. Llama: Open and efficient foundation language models, 2023a. URL https://arxiv.org/abs/2302.13971.
  43. Llama 2: Open foundation and fine-tuned chat models, 2023b. URL https://arxiv.org/abs/2307.09288.
  44. Twan van Laarhoven. L2 regularization versus batch and weight normalization, 2017. URL https://arxiv.org/abs/1706.05350.
  45. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  46. Small-scale proxies for large-scale transformer training instabilities, 2023. URL https://arxiv.org/abs/2309.14322.
  47. XAI. Grok-1, 2024. URL https://github.com/xai-org/grok-1.
  48. GSPMD: General and scalable parallelization for ML computation graphs, 2021. URL https://arxiv.org/abs/2105.04663.
  49. Greg Yang. Tensor Programs I: Wide feedforward or recurrent neural networks of any architecture are gaussian processes, 2019. URL https://arxiv.org/abs/1910.12478.
  50. Greg Yang. Tensor Programs II: Neural tangent kernel for any architecture, 2020. URL https://arxiv.org/abs/2006.14548.
  51. Greg Yang. Tensor Programs III: Neural matrix laws, 2021. URL https://arxiv.org/abs/2009.10685.
  52. Tensor Programs IV: Feature learning in infinite-width neural networks. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  11727–11737. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/yang21c.html.
  53. Tensor Programs V: Tuning large neural networks via zero-shot hyperparameter transfer, 2022. URL https://arxiv.org/abs/2203.03466.
  54. A spectral condition for feature learning, 2023a. URL https://arxiv.org/abs/2310.17813.
  55. Tensor Programs VI: Feature learning in infinite-depth neural networks, 2023b. URL https://arxiv.org/abs/2310.02244.
  56. Research without re-search: Maximal update parametrization yields accurate loss prediction across scales, 2023. URL https://arxiv.org/abs/2304.06875.
  57. Reducing BERT pre-training time from 3 days to 76 minutes. CoRR, abs/1904.00962, 2019. URL http://arxiv.org/abs/1904.00962.
  58. Root mean square layer normalization. CoRR, abs/1910.07467, 2019. URL http://arxiv.org/abs/1910.07467.
Citations (3)

Summary

  • The paper demonstrates that μ-Transfer accurately scales hyperparameters from proxy (2M) to large (10B) transformer models, ensuring efficient training.
  • It reveals that incorporating trainable scales and adjustments in the attention mechanism can disrupt the effectiveness of μ-Transfer.
  • The study confirms that μP reduces hyperparameter tuning costs, although its compatibility varies with different optimizers and training schemes.

Empirical Investigation of μ-Parameterization for Large-Scale Transformer Models

Introduction

The efficient training of transformer models, prevalent in both NLP and computer vision, faces the challenge of setting hyperparameters like initialization and learning rates, which is often done heuristically and varies across different model sizes. The μ-Parameterization (μP) approach proposes a method for scaling these parameters, facilitating the transfer of hyperparameters from smaller "proxy" models to substantially larger target models with minimal performance loss—a process termed as μ-Transfer. Despite its potential, μP's complexity and theoretical underpinnings may have hindered its widespread adoption. This paper conducts a comprehensive examination of μ-Transfer, particularly within the context of the transformer architecture, across a spectrum from 2M to 10B parameters to evaluate its practical effectiveness and identify instances where it may falter.

Background and Notation

Transformers consist of embeddings, layers of transformer blocks, and unembeddings. The transformer blocks execute two residual blocks: Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP). The crucial aspect of μP is its specific set of initializations and learning rate scaling rules which allow for stable training across different model sizes. This paper concentrates on the transformer's width as the scaling dimension and employs specific rules for hyperparameter adjustment in line with μP directives.

Experimental Findings

Baseline μP Performance

The paper outlines several experimental scenarios addressing different aspects of the transformer's design and training process to test μ-Transfer's reliability. The baseline experiments verify that under μP, hyperparameters optimally transfer across different model sizes, confirming the premise that a well-tuned small model can accurately inform the training setup of much larger models.

Projection Biases and RMSNorm Gains

Incorporating projection biases and RMSNorm gains into the models does not significantly affect the transferability of learning rates or result in considerable model quality improvement. Specifically, trainable scales (or gains) in RMSNorm disrupt hyperparameter transfer, suggesting a potential area of incompatibility with μP.

Attention Mechanism Adjustments

Experiments adjusting the attention mechanism’s initialization and scaling highlight the sensitivity of μ-Transfer to these components. For instance, switching to a standard parameterization for unembedding projections or modifying the attention scale to 1/D1/\sqrt{D} adversely affects the transferability and performance.

Optimizer and Training Scheme Compatibility

The paper extends the investigation to the compatibility of μ-Transfer with various optimizers and training schemes. The findings indicate that while some optimizers like Lion do not facilitate seamless μ-Transfer, other adjustments like batch size variation and the application of multiplicative nonlinearities like SwiGLU or Squared ReLU are accommodated well by μP.

Large-Scale Transfer Experiment

A pivotal part of the research is a large-scale transfer experiment, escalating model sizes to 10 billion parameters. This experiment combines architectural adjustments that were compatible with μ-Transfer and improved performance, conclusively demonstrating that optimal learning rates identified in a 2 million parameter model could accurately predict those for the 10 billion parameter model.

Implications and Future Directions

The findings validate μ-Transfer's efficacy in a broad scope of settings, with some noted exceptions which invite further research. Given the successful transfer to a 10 billion parameter model, the results affirm the scalability of μP and suggest its potential to substantially reduce the computational costs associated with tuning large-scale models. Future inquiries may delve into refining μP's compatibility with trainable scale parameters and exploring adaptable attention scales, possibly extending these principles beyond transformer architectures.

Conclusion

This paper substantiates the effectiveness of μP and μ-Transfer in optimizing hyperparameter settings across transformer models of vastly differing sizes, with some limitations. By navigating through various architectural and training adaptations, the research elucidates pathways and pitfalls for applying μP in practice, contributing valuable insights to the continuous evolution of large-scale model training methodologies.

The robust testing across multiple configurations underscores μP's promise in simplifying the arduous process of hyperparameter optimization for large neural networks, paving the way for more efficient and scalable AI systems.