Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations (2405.18392v3)
Abstract: Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across different lengths for the same model size. We investigate the training behavior of a direct alternative -- constant learning rate and cooldowns -- and find that it scales predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales. Importantly, with these findings we demonstrate that scaling experiments can be performed with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs. Our code is available at \url{https://github.com/epfml/schedules-and-scaling/}.
- Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pp. 265–279. PMLR, 2023.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Griffin: Mixing gated linear recurrences with local attention for efficient language models. Feb 2024. URL http://arxiv.org/abs/2402.19427v1.
- De Vries, H. Go smol or go home, 2023. URL https://www.harmdevries.com/post/model-size-vs-compute-overhead/.
- When, why and how much? adaptive learning rate scheduling by refinement. Oct 2023. URL http://arxiv.org/abs/2310.07831v1.
- The Road Less Scheduled. May 2024. URL http://arxiv.org/abs/2405.15682v1.
- Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796, 2024.
- Language models scale reliably with over-training and on downstream tasks. Mar 2024. URL http://arxiv.org/abs/2403.08540v1.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027, 2020.
- Scaling Laws for Data Filtering–Data Curation cannot be Compute Agnostic. arXiv preprint arXiv:2404.07177, 2024.
- Mamba: Linear-time sequence modeling with selective state spaces. Dec 2023. URL http://arxiv.org/abs/2312.00752v1.
- Continual pre-training of large language models: How to (re)warm your model? Aug 2023. URL http://arxiv.org/abs/2308.04014v2.
- Training compute-optimal large language models. Advances in Neural Information Processing Systems, 35:30016–30030, 2022.
- Minicpm: Unveiling the potential of small language models with scalable training strategies. Apr 2024. URL https://arxiv.org/abs/2404.06395v2.
- Simple and scalable strategies to continually pre-train large language models. Mar 2024. URL https://arxiv.org/abs/2403.08763v3.
- Averaging weights leads to wider optima and better generalization. Mar 2018. URL http://arxiv.org/abs/1803.05407v3.
- Kaddour, J. Stop wasting my time! saving days of imagenet and bert training with latest weight averaging. Sep 2022. URL http://arxiv.org/abs/2209.14981v2.
- Scaling laws for neural language models. Jan 2020. URL http://arxiv.org/abs/2001.08361v1.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
- Exponential moving average of weights in deep learning: Dynamics and benefits. Transactions on Machine Learning Research, 2024.
- Scaling data-constrained language models. May 2023. URL http://arxiv.org/abs/2305.16264v4.
- Nesterov, Y. A method of solving a convex programming problem with convergence rate o (1/k** 2). volume 269, pp. 543. Russian Academy of Sciences, 1983.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774.
- Reka core, flash, and edge: A series of powerful multimodal language models. arXiv preprint arXiv:2404.12387, 2024.
- Pandey, R. gzip predicts data-dependent scaling laws. May 2024. URL http://arxiv.org/abs/2405.16684v1.
- Automatic differentiation in pytorch. 2017.
- Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. doi: 10.1137/0330046.
- Improving language understanding by generative pre-training. 2018. URL https://s3-us-west-2.amazonaws.com/openai-assets/research-%covers/language-unsupervised/language_understanding_paper.pdf.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. URL http://arxiv.org/abs/1910.10683v4.
- Training trajectories, mini-batch losses and the curious role of the learning rate. Jan 2023. URL http://arxiv.org/abs/2301.02312v2.
- Early weight averaging meets high learning rates for llm pre-training. Jun 2023. URL http://arxiv.org/abs/2306.03241v2.
- Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448, 2023. URL https://arxiv.org/abs/2401.00448.
- Shazeer, N. Glu variants improve transformer. Feb 2020. URL http://arxiv.org/abs/2002.05202v1.
- Jetmoe: Reaching llama2 performance with 0.1m dollars. Apr 2024. URL http://arxiv.org/abs/2404.07413v1.
- Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, June 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Scale efficiently: Insights from pre-training and fine-tuning transformers. Sep 2021. URL http://arxiv.org/abs/2109.10686v2.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Emergent abilities of large language models. Jun 2022. URL http://arxiv.org/abs/2206.07682v2.
- Small-scale proxies for large-scale transformer training instabilities. Sep 2023. URL http://arxiv.org/abs/2309.14322v2.
- Scaling vision transformers. pp. 12104–12113, 2022. URL http://arxiv.org/abs/2106.04560v2.
- Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.