Optimal Linear Decay Learning Rate Schedules and Further Refinements (2310.07831v2)

Published 11 Oct 2023 in cs.LG, cs.AI, and stat.ML

Abstract: Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules. Our main technical contribution is a refined analysis of learning rate schedules for a wide class of optimization algorithms (including SGD). When considering only worst-case analysis, our theory predicts that the optimal choice is the linear decay schedule where the step-size is set proportional to 1 - t/T, where t is the current iteration and T is the total number of steps. To go beyond this worst-case analysis, we use the observed gradient norms to derive schedules refined for any particular task. These refined schedules exhibit learning rate warm-up and rapid learning rate annealing near the end of training. Ours is the first systematic approach to automatically yield both of these properties. We perform the most comprehensive evaluation of learning rate schedules to date, evaluating across 10 diverse deep learning problems, a series of LLMs, and a suite of logistic regression problems. We validate that overall, the linear-decay schedule outperforms all commonly used default schedules including cosine annealing. Our adaptive schedule refinement method gives further improvements.

PDF HTML Abstract

Overview of "When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement"

The paper "When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement" presents a nuanced analysis and development of learning rate schedules that seek to close the gap between theoretical recommendations and practical application in machine learning optimization. The authors introduce a method that refines learning rate schedules for optimization algorithms, focusing particularly on the performance of the last iterate, which is of significant importance since it aligns with what is routinely used in practice.

Key Contributions

The authors make several critical contributions that promote a deeper understanding and application of learning rate scheduling:

Theoretical Analysis of Learning Rate Schedules: The paper rigorously explores learning rate schedules, particularly analyzing the linear decay schedule, which has broad popularity in practice but lacks robust theoretical underpinning. The analysis emphasizes the convergence of the last iterate rather than the average, facilitating bridges between theoretical optimality and practical efficacy.
Refinement Framework: A pivotal contribution is the refinement framework, which uses gradient norms observed during initial runs to tailor learning rate schedules more effectively to specific tasks. This approach introduces the capability of automatic incorporation of foundational strategies such as learning rate warmup and rapid annealing toward the end of training.
Generality and Adaptability: The methodology is applicable to a wide class of optimization algorithms beyond Stochastic Gradient Descent (SGD), including modern variants such as Adam. The technique allows adaptation based on empirical data, providing a pathway to customizing schedules without human intervention.
Comprehensive Empirical Evaluation: By validating the proposed frameworks against ten diverse deep learning problems and evaluating against state-of-the-art algorithms, the authors provide robust empirical evidence of the efficacy of their methods. They demonstrate that linear decay schedules typically meet or surpass performance metrics compared to conventional choices like cosine annealing.

Numerical Results and Empirical Insights

The results reveal that linear decay schedules consistently perform at least as well as, and often better than, cosine annealing and other traditional schedules across a spectrum of tasks. Moreover, refined schedules, derived using historical gradient norms, show additional improvements, suggesting enhanced performance depth achievable through task-adaptive scheduling.

Notably, the refined schedules successfully maintained improvements across various problem domains, including logistic regression problems and deep learning contexts involving image classification and LLMing.
The experimental data also highlights an intriguing trait: schedules like $1/t$ or $1/\sqrt{t}$ , common in theoretical discussions, fail to match the robustness of the linear decay in practice, emphasizing the necessity of theoretical realignment.

Implications and Future Directions

Practically, this paper underscores a significant shift towards automating learning rate schedule determination, freeing practitioners from exhaustive hyper-parameter tuning. The refined method could seamlessly integrate into machine learning workflows, ensuring theoretical robustness along with empirical adaptability.

Theoretically, the alignment of practical implementations and theoretical guarantees extends the conversation around SPD and other optimizers, challenging current perspectives and encouraging adaptations that consider the last iterate's behavior.

Looking ahead, further investigation might involve exploring the fine-grain adaptability of such scheduling in dynamic problem environments or real-time modifications during training. Additionally, integrating the refinement approach could prove influential in domains using highly irregular or evolving data landscapes.

Ultimately, the insights this paper offers can reshape standard practices, providing an efficient mechanism for enhancing learning algorithm performance across diverse machine learning applications.