- The paper presents a novel scaling law that integrates cumulative learning rates and annealing effects to accurately model training loss across different steps.
- Empirical results on constant and cosine learning rate schedulers show over 99% R² accuracy and a mean prediction error around 0.2%, validating the model's precision.
- The work offers practical guidance on choosing learning rate schedules, significantly reducing computational costs and enhancing training efficiency.
Scaling Law with Learning Rate Annealing
The paper entitled "Scaling Law with Learning Rate Annealing" presents a novel and empirically validated scaling law for LLMs, which incorporates the effects of learning rate (LR) annealing over training steps. This work builds upon prior research that established power-law relationships between model performance, model size, and data size, but addresses an important gap: typical scaling laws only describe the performance at the end of the training process, while neglecting the dynamics throughout the training.
Theoretical Framework
The authors propose a modification to the standard scaling law by integrating two key components: the forward area, which is a function of the accumulated learning rate over steps, and the learning rate annealing area, which is influenced by the historical momentum of LR changes. The resulting empirical formula for the cross-entropy loss L(s) at training step s is:
L(s)=L0+A⋅S1−α−C⋅S2
where S1 and S2 are defined as follows:
S1=∑i=1sηi
S2=i=1∑sk=1∑i(ηk−1−ηk)⋅λi−k
Here, S1 represents the forward area, essentially capturing the cumulative impact of the learning rates over the steps, while S2 incorporates the decayed influence of the learning rate changes, introducing a form of momentum analogous to physical systems. The parameters L0, A, C, and α are constants to be determined empirically, and λ is a decay factor typically ranging from 0.99 to 0.999.
Numerical Results and Empirical Validation
The empirical validation of this model is thorough and compelling. By fitting their proposed equation on training curves with different learning rate schedulers (LRS), the authors demonstrate that the equation can predict the loss at any step accurately, across a variety of LRS. Notably, they show that their formulation can describe the entire loss curve during training, capturing both forward and annealing behaviors.
For instance, fitting experiments on constant and cosine LRS with 20K training steps yielded an almost perfect coefficient of determination (R2>0.999). This high level of precision in fitting was also observed in the prediction experiments for LRS with longer training steps (e.g., 60K steps), where the mean prediction error was around 0.2%. The authors also extended their formulation to account for model size N, showing that the loss drop during LR annealing scales with model size, and confirmed this through additional experiments.
Implications and Applications
The implications of these findings are multi-faceted. Practically, this scaling law provides a powerful tool for researchers and practitioners to predict training dynamics and make informed decisions about LRS before conducting full-scale experiments. Theoretically, the incorporation of LR annealing into the scaling law framework enhances the understanding of training dynamics beyond endpoint performance.
Guidance on LRS Selection: The paper provides several practical insights into LRS selection. For example, it verifies that a cosine LRS with a cycle length equal to the total steps and minimal learning rate set to zero generally yields the lowest loss. It also justifies the improved performance of warmup-stable-decay (WSD) LRS and multi-step cosine LRS compared to the standard cosine LRS.
Verification of Experimental Findings: The paper successfully verifies many experimentally observed phenomena. For instance, it explains why constant LRS might outperform cosine LRS with limited training steps and supports the finding that a moderate annealing ratio in WSD LRS is optimal.
Computational Efficiency: A notable contribution is the significant reduction in computational cost for fitting the scaling laws. Traditional scaling laws require numerous full training runs to gather endpoint loss data, whereas the proposed method can fit using a single training curve, saving over 99% of the computational resources.
Future Directions
The authors acknowledge that the delay phenomenon in LR annealing, while empirically validated, lacks a fully understood theoretical basis. Future research may explore the root causes of this phenomenon and explore further refinements of the scaling law format. Moreover, extending the current formulation to post-training phases, including fine-tuning and domain adaptation, could further expand its applicability.
Conclusion
In conclusion, "Scaling Law with Learning Rate Annealing" offers a substantial advancement in the understanding and utilization of scaling laws in the training of neural LLMs. By integrating the dynamics of learning rate annealing into the scaling law framework, this research provides both theoretical insights and practical tools that democratize the prediction and optimization of LLM training, with significant implications for computational efficiency and model performance.