A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules (2503.12811v1)

Published 17 Mar 2025 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we present an empirical law that describes how the pretraining loss of LLMs evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay. We extensively validate this law on various model sizes and architectures, and demonstrate that after fitting on a few learning rate schedules, the law accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (Hu et al, 2024) but achieves a slightly lower final loss. We believe these results could offer valuable insights for understanding the dynamics of pretraining and designing learning rate schedules to improve efficiency.

Summary

A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

In the digital landscape of artificial intelligence, the optimization of LLMs constitutes a challenging task, often bounded by substantial computational and temporal investment. This paper, authored by researchers at prestigious institutions including Tsinghua University and UC Berkeley, proposes a nuanced empirical law aimed at forecasting the evolution of pretraining losses in LLMs. This work explores the dynamics shaped by diverse learning rate schedules, including constant, cosine, and step decay methods.

Overview of the Multi-Power Law

The innovative aspect of this research lies in its introduction of the Multi-Power Law (MPL), designed to predict training loss curves based on a sequence of learning rates. The proposed MPL encapsulates loss prediction through a composite power-law form, where the pretraining loss at any given timestep $t$ is expressed as:

Primary Loss Prediction: A classical power law dependent on the accumulated learning rates, denoted $S_1(t)$ , augmented by the initial loss accumulation during the warmup phase $S_W$ .
Loss Reduction Mechanism: A supplemental term encapsulates the decrement in loss attributable to the decay in learning rates. This accounts for subtleties such as sharp learning rate reductions and adaptive training progressions.

Validation Across Diverse Models and Architectures

The MPL flexibly adapts to a spectrum of model sizes and configurations, exceeding the capabilities of existing scaling laws, which are often limited to specific families of learning rate schedules. Empirical results are uncompromisingly thorough; numerous LLM architectures and sizes were subjected to validation experiments, substantiating the accuracy of MPL's predictions for loss trajectories under various untested learning rate schedules.

Implications and Future Directions

The introduction of MPL signifies a substantial stride in our understanding of LLM training dynamics. The law not only enhances the prediction accuracy for unseen learning rate strategies but also facilitates the identification of superior learning rate schedules, surpassing traditional cosine methods. Furthermore, the automated discovery of optimal schedules, mimicking yet advancing upon those like Warmup-Stable-Decay (WSD), highlights strategic improvements in efficiency and training loss reduction.

From a practical standpoint, MPL aids in reducing the need for exhaustive hyperparameter tuning experiments, presenting an opportunity for computational resource savings and efficiency amplification. Theoretically, the derived power-law relationships, rooted in empirical evidence and reinforced by theoretical insights, bridge existing gaps in understanding the effects of diverse learning rates on model performance.

Conclusions

The breakthrough detailed in this paper offers a transformative perspective on hyperparameter configuration for LLMs, with MPL serving as a predictive tool and a guidepost for learning rate schedule optimization. Although not claimed to be revolutionary, the approach certainly marks progress toward refined model training paradigms, promising improved computational efficiency and enhanced model performance in the ever-evolving space of artificial intelligence. Future exploration could extend these findings, integrating additional aspects of LLM training and deployment, potentially impacting diverse AI applications.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (8)

Tweets

https://twitter.com/vfleaking/status/1902078525483708561

https://twitter.com/fly51fly/status/1903563074248036572