Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations (2405.18392v3)

Published 28 May 2024 in cs.LG

Abstract: Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across different lengths for the same model size. We investigate the training behavior of a direct alternative -- constant learning rate and cooldowns -- and find that it scales predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales. Importantly, with these findings we demonstrate that scaling experiments can be performed with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs. Our code is available at \url{https://github.com/epfml/schedules-and-scaling/}.

References (55)

Citations (19)

View on Semantic Scholar

Summary

The paper demonstrates that replacing the cosine LR schedule with constant LR and a cooldown phase retains or improves model performance while slashing compute costs.
Methodology includes experiments on up to 360M-parameter models using various cooldown strategies, highlighting predictable scaling and greater training flexibility.
Implications for scaling law research are significant, enabling dynamic training adjustments and better resource efficiency, especially when combined with SWA.

Analyzing the Complexity of Cosine Learning Rate Schedules in Machine Learning and Proposing Solutions

In this paper, the authors critically examine the complexities imposed by the widely-used cosine learning rate schedule in the training of LLMs. They argue that the reliance on cosine decay necessitates retraining models from scratch for different training lengths, which escalates the computational costs. As an alternative, they propose a streamlined approach employing a constant learning rate combined with a cooldown phase. This novel method reportedly matches or exceeds the performance of the cosine schedule while simplifying the training process and reducing computational expenses.

Cosine Learning Rate Schedule and Its Limitations

The cosine learning rate (LR) schedule, ubiquitous in LLM training, gradually decreases the LR following a cosine function. However, based on their empirical findings, the authors highlight several drawbacks:

Dependency on Training Duration: The optimal performance of the cosine schedule is contingent on the alignment of the cycle length with the total training duration. This means that every experimental setup requires predefining the training length, leading to inefficiencies when only minor changes are needed.
Inflexibility in Experimentation: To accurately estimate the quality of training for architectural adjustments or varied data mixtures, multiple models must be trained from scratch. This complexity is compounded by the necessity to match the cosine schedule to the training length in advance.
Suboptimal Model Performance Estimation: During training, the cosine schedule tends to underestimate the model's performance, which complicates the process of deciding when to halt or extend training.

Proposed Solution: Constant Learning Rate with Cooldown

The authors propose a simple yet effective alternative comprising a constant learning rate followed by a cooldown phase. Their main insights are:

Predictable Scaling: The constant LR with cooldown exhibits predictable scaling behavior similar to cosine decay. This was validated through comprehensive training runs demonstrating that constant LR, followed by a cooldown, consistently achieves comparable or better performance.
Conceptual Flexibility: This method eliminates the need to specify training length in advance. The cooldown can be initiated flexibly at any point, thereby making it convenient for large-scale runs and continual learning scenarios.
Compatibility with Stochastic Weight Averaging (SWA): The authors found that SWA synergizes well with the cooldown approach, enhancing model performance without additional computational costs. SWA specifically averages model parameters over several checkpoints within a training window, yielding improved generalization properties.

Empirical Validation and Experimental Results

The paper conducted experiments using models of up to 360M parameters and training on the SlimPajama dataset. Key results include:

Performance Metrics: Across various training durations, the constant LR with cooldown performed comparably or even outperformed the cosine schedule, especially for longer cooldown phases.
Effectiveness of Cooldown Schedules: The authors explored different forms for the cooldown phase, including linear and 1-sqrt decays, identifying the latter as particularly effective.
Compute Efficiency: The alternative approach results in substantial savings in terms of compute and GPU hours, significantly enhancing the feasibility of frequent scaling law computations. For instance, mimicking the Chinchilla scaling law paper, the proposed method reduced compute costs by more than half.

Implications for Scaling Law Research

These findings have profound implications for scaling law research, where the objective is to establish functional forms of model performance based on parameters and training tokens. Traditional scaling law investigations, utilizing cosine schedules, are computationally intensive. The proposed method offers a more efficient pathway:

Reduced Computational Costs: By training models just once and employing checkpoints for subsequent cooldown phases or SWA, researchers can achieve scaling laws with a fraction of the previous computational load.
Flexibility in Continual Learning: The ability to continue training without predefined duration makes it possible to adapt to new data mixtures or architectural changes dynamically.

Limitations and Future Research

While the results are promising, the authors acknowledge limitations, including the scope of their experiments being confined to models up to 360M parameters and datasets up to 10B tokens. Future research should verify the scalability of their approach to modern, larger-scale LLMs. Additionally, while the paper primarily focuses on the training and validation loss, further investigations should assess the method's impact on downstream tasks, which are the ultimate benchmarks for model efficacy.

Conclusion

The paper demonstrates that the complexities introduced by the cosine learning rate schedule in LLM training can be effectively mitigated with a constant LR followed by a cooldown. This alternative not only matches the performance of the cosine schedule but also offers greater flexibility and computational efficiency, making scaling law research more accessible. By enabling more frequent updates to scaling laws and facilitating continual learning, the proposed method holds significant potential for future advancements in AI research and model optimization.

PDF Markdown

Related Papers

Tweets

https://twitter.com/eliebakouch/status/1796216968040587588

https://twitter.com/aaron_defazio/status/1796569215627108428

https://twitter.com/andrew_n_carr/status/1796026013811650886

https://twitter.com/cloneofsimo/status/1798891479412490512

https://twitter.com/haeggee/status/1863747134215844024

https://twitter.com/DeanHu11/status/1796542390683193837