Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
112 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective (2410.05192v3)

Published 7 Oct 2024 in cs.LG, cs.CL, and stat.ML

Abstract: Training LLMs currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates a non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate's oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple LLM checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.

Summary

  • The paper introduces a novel river valley loss landscape hypothesis to explain Warmup-Stable-Decay learning dynamics in LLM training.
  • It demonstrates that the simplified WSD-S method achieves lower validation loss and efficient resource usage compared to existing schedules.
  • The findings offer actionable insights for optimizing large model training across varying compute budgets and future adaptive learning strategies.

Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

The paper "Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective" presents a detailed theoretical analysis and a practical exploration of the Warmup-Stable-Decay (WSD) learning rate schedule, with a novel interpretation of its performance dynamics based on the concept of a river valley loss landscape. The authors investigate the dynamics of LLM training and how this innovative learning rate schedule can be leveraged to optimize performance across varying compute budgets.

Theoretical Foundations

The paper introduces the river valley loss landscape hypothesis to provide a theoretical framework for understanding the behavior of the WSD learning rate schedule. This landscape is characterized by steep, sharp hillsides with a river—a relatively flat and more navigable path—at the valley's bottom. Within this framework, two key factors shape optimization: the steepness of the hillsides and the trajectory along the river.

The authors argue that during LLM training, the network initially oscillates between hillsides due to a high learning rate, progressing swiftly along the river direction. However, the high variance along the hillside directions underscores the appearance of limited immediate optimization. When the learning rate enters the decay phase, it approximates zero, allowing for minimized oscillations and revealing the accumulative optimization progress along the river direction.

Empirical Analysis and Developments

In alignment with their hypothesis, the authors empirically demonstrate several phenomena consistent with the river valley landscape. They employ a synthetic dataset and real-world data to illustrate how loss curves corresponding to certain constant learning rate phases can appear non-convergent until the learning rate begins to decay.

Furthermore, inspired by their theoretical insights, the authors propose a simplification of the WSD schedule named WSD-S. The WSD-S variant avoids retracing previous steps but rather continues training from the latest checkpoint after a decay phase. This method facilitates efficient utilization of computational resources by enabling checkpoints to be reused effectively throughout training runs.

Results and Comparative Analysis

The paper reports empirical results showcasing WSD-S's capacity to match or exceed the performance of other prevalent learning rate schedules such as Cyclic-Cosine and even the original WSD when maintaining a single run across different compute budgets, ranging from 0.1B to 1.2B parameters. This is particularly striking when examining the validation loss after learning rate decay, where WSD-S maintains lower loss values with reduced computational overhead.

Implications and Future Directions

The implications of these findings are substantial for both the practical efficiencies of machine learning training regimes and our understanding of network optimization dynamics. The decomposition of loss into river and hill components not only provides an analytical tool for understanding training dynamics but also influences practical strategies for scalable training.

For future developments, the research suggests several avenues, such as further extending these notions to more complex, multi-dimensional river valley landscapes beyond the simplistic 1-dimensional analysis. Additionally, integrating insights derived from heterogeneity in token stochasticity might facilitate designing adaptive learning schedules that dynamically adjust based on the data's uncertainty characteristics.

In summary, the investigation into WSD learning rates through the river valley landscape offers nuanced insights into learning dynamics for LLM pretraining. The WSD-S method appears particularly promising, providing a framework for adaptive, efficient, and resource-conscious training strategies.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

HackerNews