Scaling Law with Learning Rate Annealing (2408.11029v2)

Published 20 Aug 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We find that the cross-entropy loss curves of neural LLMs empirically adhere to a scaling law with learning rate (LR) annealing over training steps: $$L(s) = L_0 + A\cdot S_1^{-\alpha} - C\cdot S_2,$$ where $L(s)$ is the validation loss at step $s$, $S_1$ is the area under the LR curve, $S_2$ is the LR annealing area, and $L_0$, $A$, $C$, $\alpha$ are constant parameters. This formulation takes into account two factors: (1) power-law scaling over data size, and (2) the additional loss reduction during LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss at any given step across any learning rate scheduler (LRS). This approach significantly reduces computational cost in formulating scaling laws while providing more accuracy and expressiveness for training dynamics. Extensive experiments demonstrate that our findings hold across a range of hyper-parameters and model architectures, and our equation can extend to scaling effect of model sizes. Moreover, our formulation provides accurate theoretical verification and explanation for empirical results observed in numerous previous studies, particularly those focusing on LR schedule and annealing. We believe that this work is promising to enhance the understanding of LLM training dynamics while greatly democratizing scaling laws, and it can guide researchers in refining training strategies (e.g. critical LRS) for further LLMs.

Citations (4)

View on Semantic Scholar

Summary

The paper presents a novel scaling law that integrates cumulative learning rates and annealing effects to accurately model training loss across different steps.
Empirical results on constant and cosine learning rate schedulers show over 99% R² accuracy and a mean prediction error around 0.2%, validating the model's precision.
The work offers practical guidance on choosing learning rate schedules, significantly reducing computational costs and enhancing training efficiency.

Scaling Law with Learning Rate Annealing

The paper entitled "Scaling Law with Learning Rate Annealing" presents a novel and empirically validated scaling law for LLMs, which incorporates the effects of learning rate (LR) annealing over training steps. This work builds upon prior research that established power-law relationships between model performance, model size, and data size, but addresses an important gap: typical scaling laws only describe the performance at the end of the training process, while neglecting the dynamics throughout the training.

Theoretical Framework

The authors propose a modification to the standard scaling law by integrating two key components: the forward area, which is a function of the accumulated learning rate over steps, and the learning rate annealing area, which is influenced by the historical momentum of LR changes. The resulting empirical formula for the cross-entropy loss $L(s)$ at training step $s$ is:

$L(s) = L_0 + A \cdot S_1^{-\alpha} - C \cdot S_2$

where $S_1$ and $S_2$ are defined as follows:

$S_1 = \sum_{i=1}^{s} \eta_i$

$S_2 = \sum_{i=1}^{s} \sum_{k=1}^{i} (\eta_{k-1} - \eta_k) \cdot \lambda^{i-k}$

Here, $S_1$ represents the forward area, essentially capturing the cumulative impact of the learning rates over the steps, while $S_2$ incorporates the decayed influence of the learning rate changes, introducing a form of momentum analogous to physical systems. The parameters $L_0$ , $A$ , $C$ , and $\alpha$ are constants to be determined empirically, and $\lambda$ is a decay factor typically ranging from 0.99 to 0.999.

Numerical Results and Empirical Validation

The empirical validation of this model is thorough and compelling. By fitting their proposed equation on training curves with different learning rate schedulers (LRS), the authors demonstrate that the equation can predict the loss at any step accurately, across a variety of LRS. Notably, they show that their formulation can describe the entire loss curve during training, capturing both forward and annealing behaviors.

For instance, fitting experiments on constant and cosine LRS with 20K training steps yielded an almost perfect coefficient of determination ( $R^2 > 0.999$ ). This high level of precision in fitting was also observed in the prediction experiments for LRS with longer training steps (e.g., 60K steps), where the mean prediction error was around 0.2%. The authors also extended their formulation to account for model size $N$ , showing that the loss drop during LR annealing scales with model size, and confirmed this through additional experiments.

Implications and Applications

The implications of these findings are multi-faceted. Practically, this scaling law provides a powerful tool for researchers and practitioners to predict training dynamics and make informed decisions about LRS before conducting full-scale experiments. Theoretically, the incorporation of LR annealing into the scaling law framework enhances the understanding of training dynamics beyond endpoint performance.

Guidance on LRS Selection: The paper provides several practical insights into LRS selection. For example, it verifies that a cosine LRS with a cycle length equal to the total steps and minimal learning rate set to zero generally yields the lowest loss. It also justifies the improved performance of warmup-stable-decay (WSD) LRS and multi-step cosine LRS compared to the standard cosine LRS.

Verification of Experimental Findings: The paper successfully verifies many experimentally observed phenomena. For instance, it explains why constant LRS might outperform cosine LRS with limited training steps and supports the finding that a moderate annealing ratio in WSD LRS is optimal.

Computational Efficiency: A notable contribution is the significant reduction in computational cost for fitting the scaling laws. Traditional scaling laws require numerous full training runs to gather endpoint loss data, whereas the proposed method can fit using a single training curve, saving over 99% of the computational resources.

Future Directions

The authors acknowledge that the delay phenomenon in LR annealing, while empirically validated, lacks a fully understood theoretical basis. Future research may explore the root causes of this phenomenon and explore further refinements of the scaling law format. Moreover, extending the current formulation to post-training phases, including fine-tuning and domain adaptation, could further expand its applicability.

Conclusion

In conclusion, "Scaling Law with Learning Rate Annealing" offers a substantial advancement in the understanding and utilization of scaling laws in the training of neural LLMs. By integrating the dynamics of learning rate annealing into the scaling law framework, this research provides both theoretical insights and practical tools that democratize the prediction and optimization of LLM training, with significant implications for computational efficiency and model performance.

PDF Markdown

Related Papers

Tweets

https://twitter.com/cloneofsimo/status/1869783777041580323

https://twitter.com/fly51fly/status/1826372878243402110

https://twitter.com/main_horse/status/1883079065369985099

https://twitter.com/tissue19248/status/1852251536090624024

https://twitter.com/Grad62304977/status/1828132567469695094

https://twitter.com/Grad62304977/status/1852434724859605010

YouTube

Show All Videos