Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations (2405.18392v3)

Published 28 May 2024 in cs.LG

Abstract: Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across different lengths for the same model size. We investigate the training behavior of a direct alternative -- constant learning rate and cooldowns -- and find that it scales predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales. Importantly, with these findings we demonstrate that scaling experiments can be performed with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs. Our code is available at \url{https://github.com/epfml/schedules-and-scaling/}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pp. 265–279. PMLR, 2023.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  6. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  7. Griffin: Mixing gated linear recurrences with local attention for efficient language models. Feb 2024. URL http://arxiv.org/abs/2402.19427v1.
  8. De Vries, H. Go smol or go home, 2023. URL https://www.harmdevries.com/post/model-size-vs-compute-overhead/.
  9. When, why and how much? adaptive learning rate scheduling by refinement. Oct 2023. URL http://arxiv.org/abs/2310.07831v1.
  10. The Road Less Scheduled. May 2024. URL http://arxiv.org/abs/2405.15682v1.
  11. Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796, 2024.
  12. Language models scale reliably with over-training and on downstream tasks. Mar 2024. URL http://arxiv.org/abs/2403.08540v1.
  13. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027, 2020.
  14. Scaling Laws for Data Filtering–Data Curation cannot be Compute Agnostic. arXiv preprint arXiv:2404.07177, 2024.
  15. Mamba: Linear-time sequence modeling with selective state spaces. Dec 2023. URL http://arxiv.org/abs/2312.00752v1.
  16. Continual pre-training of large language models: How to (re)warm your model? Aug 2023. URL http://arxiv.org/abs/2308.04014v2.
  17. Training compute-optimal large language models. Advances in Neural Information Processing Systems, 35:30016–30030, 2022.
  18. Minicpm: Unveiling the potential of small language models with scalable training strategies. Apr 2024. URL https://arxiv.org/abs/2404.06395v2.
  19. Simple and scalable strategies to continually pre-train large language models. Mar 2024. URL https://arxiv.org/abs/2403.08763v3.
  20. Averaging weights leads to wider optima and better generalization. Mar 2018. URL http://arxiv.org/abs/1803.05407v3.
  21. Kaddour, J. Stop wasting my time! saving days of imagenet and bert training with latest weight averaging. Sep 2022. URL http://arxiv.org/abs/2209.14981v2.
  22. Scaling laws for neural language models. Jan 2020. URL http://arxiv.org/abs/2001.08361v1.
  23. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  24. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  25. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  26. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  27. Exponential moving average of weights in deep learning: Dynamics and benefits. Transactions on Machine Learning Research, 2024.
  28. Scaling data-constrained language models. May 2023. URL http://arxiv.org/abs/2305.16264v4.
  29. Nesterov, Y. A method of solving a convex programming problem with convergence rate o (1/k** 2). volume 269, pp.  543. Russian Academy of Sciences, 1983.
  30. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774.
  31. Reka core, flash, and edge: A series of powerful multimodal language models. arXiv preprint arXiv:2404.12387, 2024.
  32. Pandey, R. gzip predicts data-dependent scaling laws. May 2024. URL http://arxiv.org/abs/2405.16684v1.
  33. Automatic differentiation in pytorch. 2017.
  34. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. doi: 10.1137/0330046.
  35. Improving language understanding by generative pre-training. 2018. URL https://s3-us-west-2.amazonaws.com/openai-assets/research-%covers/language-unsupervised/language_understanding_paper.pdf.
  36. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  37. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. URL http://arxiv.org/abs/1910.10683v4.
  39. Training trajectories, mini-batch losses and the curious role of the learning rate. Jan 2023. URL http://arxiv.org/abs/2301.02312v2.
  40. Early weight averaging meets high learning rates for llm pre-training. Jun 2023. URL http://arxiv.org/abs/2306.03241v2.
  41. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448, 2023. URL https://arxiv.org/abs/2401.00448.
  42. Shazeer, N. Glu variants improve transformer. Feb 2020. URL http://arxiv.org/abs/2002.05202v1.
  43. Jetmoe: Reaching llama2 performance with 0.1m dollars. Apr 2024. URL http://arxiv.org/abs/2404.07413v1.
  44. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
  45. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, June 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
  46. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  47. Scale efficiently: Insights from pre-training and fine-tuning transformers. Sep 2021. URL http://arxiv.org/abs/2109.10686v2.
  48. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  49. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  50. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  51. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  52. Emergent abilities of large language models. Jun 2022. URL http://arxiv.org/abs/2206.07682v2.
  53. Small-scale proxies for large-scale transformer training instabilities. Sep 2023. URL http://arxiv.org/abs/2309.14322v2.
  54. Scaling vision transformers. pp.  12104–12113, 2022. URL http://arxiv.org/abs/2106.04560v2.
  55. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
Citations (19)

Summary

  • The paper demonstrates that replacing the cosine LR schedule with constant LR and a cooldown phase retains or improves model performance while slashing compute costs.
  • Methodology includes experiments on up to 360M-parameter models using various cooldown strategies, highlighting predictable scaling and greater training flexibility.
  • Implications for scaling law research are significant, enabling dynamic training adjustments and better resource efficiency, especially when combined with SWA.

Analyzing the Complexity of Cosine Learning Rate Schedules in Machine Learning and Proposing Solutions

In this paper, the authors critically examine the complexities imposed by the widely-used cosine learning rate schedule in the training of LLMs. They argue that the reliance on cosine decay necessitates retraining models from scratch for different training lengths, which escalates the computational costs. As an alternative, they propose a streamlined approach employing a constant learning rate combined with a cooldown phase. This novel method reportedly matches or exceeds the performance of the cosine schedule while simplifying the training process and reducing computational expenses.

Cosine Learning Rate Schedule and Its Limitations

The cosine learning rate (LR) schedule, ubiquitous in LLM training, gradually decreases the LR following a cosine function. However, based on their empirical findings, the authors highlight several drawbacks:

  • Dependency on Training Duration: The optimal performance of the cosine schedule is contingent on the alignment of the cycle length with the total training duration. This means that every experimental setup requires predefining the training length, leading to inefficiencies when only minor changes are needed.
  • Inflexibility in Experimentation: To accurately estimate the quality of training for architectural adjustments or varied data mixtures, multiple models must be trained from scratch. This complexity is compounded by the necessity to match the cosine schedule to the training length in advance.
  • Suboptimal Model Performance Estimation: During training, the cosine schedule tends to underestimate the model's performance, which complicates the process of deciding when to halt or extend training.

Proposed Solution: Constant Learning Rate with Cooldown

The authors propose a simple yet effective alternative comprising a constant learning rate followed by a cooldown phase. Their main insights are:

  • Predictable Scaling: The constant LR with cooldown exhibits predictable scaling behavior similar to cosine decay. This was validated through comprehensive training runs demonstrating that constant LR, followed by a cooldown, consistently achieves comparable or better performance.
  • Conceptual Flexibility: This method eliminates the need to specify training length in advance. The cooldown can be initiated flexibly at any point, thereby making it convenient for large-scale runs and continual learning scenarios.
  • Compatibility with Stochastic Weight Averaging (SWA): The authors found that SWA synergizes well with the cooldown approach, enhancing model performance without additional computational costs. SWA specifically averages model parameters over several checkpoints within a training window, yielding improved generalization properties.

Empirical Validation and Experimental Results

The paper conducted experiments using models of up to 360M parameters and training on the SlimPajama dataset. Key results include:

  • Performance Metrics: Across various training durations, the constant LR with cooldown performed comparably or even outperformed the cosine schedule, especially for longer cooldown phases.
  • Effectiveness of Cooldown Schedules: The authors explored different forms for the cooldown phase, including linear and 1-sqrt decays, identifying the latter as particularly effective.
  • Compute Efficiency: The alternative approach results in substantial savings in terms of compute and GPU hours, significantly enhancing the feasibility of frequent scaling law computations. For instance, mimicking the Chinchilla scaling law paper, the proposed method reduced compute costs by more than half.

Implications for Scaling Law Research

These findings have profound implications for scaling law research, where the objective is to establish functional forms of model performance based on parameters and training tokens. Traditional scaling law investigations, utilizing cosine schedules, are computationally intensive. The proposed method offers a more efficient pathway:

  • Reduced Computational Costs: By training models just once and employing checkpoints for subsequent cooldown phases or SWA, researchers can achieve scaling laws with a fraction of the previous computational load.
  • Flexibility in Continual Learning: The ability to continue training without predefined duration makes it possible to adapt to new data mixtures or architectural changes dynamically.

Limitations and Future Research

While the results are promising, the authors acknowledge limitations, including the scope of their experiments being confined to models up to 360M parameters and datasets up to 10B tokens. Future research should verify the scalability of their approach to modern, larger-scale LLMs. Additionally, while the paper primarily focuses on the training and validation loss, further investigations should assess the method's impact on downstream tasks, which are the ultimate benchmarks for model efficacy.

Conclusion

The paper demonstrates that the complexities introduced by the cosine learning rate schedule in LLM training can be effectively mitigated with a constant LR followed by a cooldown. This alternative not only matches the performance of the cosine schedule but also offers greater flexibility and computational efficiency, making scaling law research more accessible. By enabling more frequent updates to scaling laws and facilitating continual learning, the proposed method holds significant potential for future advancements in AI research and model optimization.