A Large-Scale Exploration of $μ$-Transfer (2404.05728v5)
Abstract: Large artificial neural networks have become a mainstay of language, vision, and audio processing and synthesis, yet their initializations and learning rates are often set in an unsophisticated fashion, due to the high cost of hyperparameter sweeps at scale. The $\mu$-Parameterization ($\mu$P) offers a potential solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the $\mu$P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? Studying models of up to 10B parameters and training budgets of up to 190B tokens, we find $\mu$-Transfer works as intended for the majority of important cases, yet also identify a few cases where it may not.
- GQA: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URL https://arxiv.org/abs/2305.13245.
- The falcon series of open language models, 2023. URL https://arxiv.org/abs/2311.16867.
- Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, 2024. Accessed: 2024-03-13.
- Online learning rate adaptation with hypergradient descent, 2017. URL https://arxiv.org/abs/1703.04782.
- Automatic gradient descent: Deep learning without hyperparameters, 2023. URL https://arxiv.org/abs/2304.05187.
- Gradient descent: The ultimate optimizer, 2019. URL https://arxiv.org/abs/1909.13371.
- Lion secretly solves constrained optimization: As Lyapunov predicts, 2023a. URL https://arxiv.org/abs/2310.05898.
- Symbolic discovery of optimization algorithms, 2023b. URL https://arxiv.org/abs/2302.06675.
- PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
- Deepseek llm: Scaling open-source language models with longtermism, 2024. URL https://arxiv.org/abs/2401.02954.
- Scaling vision transformers to 22 billion parameters, 2023. URL https://arxiv.org/abs/2302.05442.
- Tim Dettmers. LLM.int8() and emergent features. https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/, 2022. Accessed: 2024-03-09.
- Cerebras-GPT: Open compute-optimal language models trained on the cerebras wafer-scale cluster, 2023. URL https://arxiv.org/abs/2304.03208.
- Privileged bases in the transformer residual stream. https://transformer-circuits.pub/2023/privileged-basis/index.html, 2023. Accessed: 2024-03-09.
- Releasing persimmon-8b, 2023. URL https://www.adept.ai/blog/persimmon-8b.
- NanoLM: An affordable LLM study benchmark via accurate loss prediction across scales, 2024. URL https://openreview.net/forum?id=mao3y822aM.
- Gemini: A family of highly capable multimodal models, 2023. URL https://arxiv.org/abs/2312.11805.
- Team Gemma. Gemma: Open models based on gemini research and technology, 2024. URL https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf.
- Identity mappings in deep residual networks, 2016. URL https://arxiv.org/abs/1603.05027.
- Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/2203.15556.
- Minicpm: Unveiling the potential of end-side large language models. https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4, 2024. Accessed: 2024-03-09.
- Mixtral of experts, 2024. URL https://arxiv.org/abs/2401.04088.
- Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.
- Adam: A method for stochastic optimization, 2014. URL https://arxiv.org/abs/1412.6980.
- Adaptive optimization in the $\infty$-width limit. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=zgVDqw9ZUES.
- Visual instruction tuning, 2023. URL https://arxiv.org/abs/2304.08485.
- Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017. URL http://arxiv.org/abs/1711.05101.
- On the SDEs and scaling rules for adaptive gradient algorithms, 2022. URL https://arxiv.org/abs/2205.10287.
- An empirical model of large-batch training. CoRR, abs/1812.06162, 2018. URL http://arxiv.org/abs/1812.06162.
- Transformers without tears: Improving the normalization of self-attention, 2019. URL https://arxiv.org/abs/1910.05895.
- Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
- Nemotron-4 15b technical report, 2024. URL https://arxiv.org/abs/2402.16819.
- Rwkv: Reinventing rnns for the transformer era, 2023. URL https://arxiv.org/abs/2305.13048.
- Language models are unsupervised multitask learners, 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Last visited on 2023/09/07.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2019. URL https://arxiv.org/abs/1910.10683.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/2403.05530.
- Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019. URL https://arxiv.org/abs/1911.02150.
- Noam Shazeer. GLU variants improve transformer, 2020. URL https://arxiv.org/abs/2002.05202.
- Megatron-LM: Training multi-billion parameter language models using model parallelism, 2020. URL https://arxiv.org/abs/1909.08053.
- Searching for efficient transformers for language modeling. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 6010–6022. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/2f3c6a4cd8af177f6456e7e51a916ff3-Paper.pdf.
- RoFormer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
- Llama: Open and efficient foundation language models, 2023a. URL https://arxiv.org/abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models, 2023b. URL https://arxiv.org/abs/2307.09288.
- Twan van Laarhoven. L2 regularization versus batch and weight normalization, 2017. URL https://arxiv.org/abs/1706.05350.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Small-scale proxies for large-scale transformer training instabilities, 2023. URL https://arxiv.org/abs/2309.14322.
- XAI. Grok-1, 2024. URL https://github.com/xai-org/grok-1.
- GSPMD: General and scalable parallelization for ML computation graphs, 2021. URL https://arxiv.org/abs/2105.04663.
- Greg Yang. Tensor Programs I: Wide feedforward or recurrent neural networks of any architecture are gaussian processes, 2019. URL https://arxiv.org/abs/1910.12478.
- Greg Yang. Tensor Programs II: Neural tangent kernel for any architecture, 2020. URL https://arxiv.org/abs/2006.14548.
- Greg Yang. Tensor Programs III: Neural matrix laws, 2021. URL https://arxiv.org/abs/2009.10685.
- Tensor Programs IV: Feature learning in infinite-width neural networks. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 11727–11737. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/yang21c.html.
- Tensor Programs V: Tuning large neural networks via zero-shot hyperparameter transfer, 2022. URL https://arxiv.org/abs/2203.03466.
- A spectral condition for feature learning, 2023a. URL https://arxiv.org/abs/2310.17813.
- Tensor Programs VI: Feature learning in infinite-depth neural networks, 2023b. URL https://arxiv.org/abs/2310.02244.
- Research without re-search: Maximal update parametrization yields accurate loss prediction across scales, 2023. URL https://arxiv.org/abs/2304.06875.
- Reducing BERT pre-training time from 3 days to 76 minutes. CoRR, abs/1904.00962, 2019. URL http://arxiv.org/abs/1904.00962.
- Root mean square layer normalization. CoRR, abs/1910.07467, 2019. URL http://arxiv.org/abs/1910.07467.