The Road Less Scheduled (2405.15682v4)

Published 24 May 2024 in cs.LG, cs.AI, math.OC, and stat.ML

Abstract: Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available at https://github.com/facebookresearch/schedule_free. Schedule-Free AdamW is the core algorithm behind our winning entry to the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge Self-Tuning track.

References (62)

Citations (21)

View on Semantic Scholar

Summary

The paper introduces Schedule-Free learning, a novel method that replaces preset learning rate schedules with a specific iterate averaging and interpolation strategy to achieve optimal convergence.
The approach uses a flexible momentum-like parameter, enabling performance that matches or outperforms highly tuned schedules across 28 diverse machine learning tasks.
A new online-to-batch conversion theorem unifies prior averaging methods, ensuring optimal worst-case convergence rates for non-smooth convex functions without increasing computational or memory costs.

This paper, "The Road Less Scheduled" (2405.15682), addresses a long-standing practical challenge in machine learning optimization: the reliance on learning rate schedules that require knowing the total training duration ( $T$ ) in advance. While convex optimization theory often suggests iterate averaging methods like Polyak-Ruppert achieve optimal convergence rates, empirical practice strongly favors using the last iterate of gradient descent with carefully tuned learning rate schedules (like cosine decay or step decay), despite theoretical gaps. The need to pre-specify $T$ for schedules is a significant practical limitation, as optimal training time is often unknown.

The authors propose a novel optimization approach called Schedule-Free (SF) learning that aims to bridge this theory-practice gap. The core idea is to replace learning rate schedules entirely with a specific form of iterate averaging combined with an interpolated point for gradient evaluation. The method maintains three sequences of parameters:

$z_t$ : A fast-moving sequence updated similarly to standard SGD or Adam (e.g., $z_{t+1} = z_t - \gamma \nabla f(y_t, \zeta_t)$ ).
$x_t$ : An equal-weighted average of the $z_t$ sequence up to time $t$ (specifically, $x_{t+1} = (1-c_{t+1})x_t + c_{t+1}z_{t+1}$ with $c_{t+1}=1/(t+1)$ in the basic version, or weighted by $\gamma_t^2$ during warmup).
$y_t$ : The point where the gradient is computed, defined as an interpolation between $z_t$ and $x_t$ : $y_t = (1-\beta)z_t + \beta x_t$ .

The method introduces a single momentum-like hyperparameter $\beta \in [0,1]$ . Setting $\beta=0$ recovers Polyak-Ruppert averaging (gradient evaluated at $z_t$ ), and $\beta=1$ recovers Primal averaging (gradient evaluated at $x_t$ ). The authors find that values of $\beta$ around $0.9$ work well in practice, analogous to typical momentum values.

A key theoretical contribution is a new online-to-batch conversion theorem (Theorem 1 and the general Theorem 3 in Appendix A/B). This theorem shows how bounds on the "regret" of the $z_t$ sequence in online convex optimization translate directly into convergence guarantees for the average sequence $x_T$ in stochastic optimization. The theorem unifies previous online-to-batch results, including those corresponding to Polyak averaging, Primal averaging, and even the recent linear decay schedule theory. For Schedule-Free SGD, the theory shows that it achieves the optimal worst-case convergence rate of $\mathcal{O}(DG/\sqrt{T})$ for non-smooth convex functions, regardless of the choice of $\beta \in [0,1]$ . This is a notable theoretical advantage over traditional momentum, which can worsen worst-case rates in this setting. The paper also explores generalizations using Bregman divergences (Appendix C) and shows potential for accelerated rates with optimistic gradient methods (Appendix D) and improved rates for strongly convex problems (Appendix E).

Empirically, Schedule-Free learning demonstrates state-of-the-art performance across a wide range of 28 problems, from convex logistic regression to large-scale deep learning tasks in computer vision, natural language processing, recommendation systems, and medical imaging. Experiments show that SF methods consistently match or outperform highly tuned learning rate schedules (cosine decay, linear decay) and significantly outperform traditional averaging methods. A key finding is that Schedule-Free methods track the Pareto frontier of loss versus training time during a single run, providing good performance at any stopping point without pre-specification. The experiments also show that Schedule-Free momentum ( $\beta < 1$ ) enables the use of larger learning rates, which may contribute to faster convergence.

Practical Implementation Details:

Memory: Schedule-Free variants have the same memory requirements as their base optimizers. For example, Schedule-Free SGD needs to store $x$ and $z$ , similar to how standard SGD with momentum stores the current parameters and a momentum buffer. The intermediate point $y_t$ does not need separate storage as it can be computed from $x_t$ and $z_t$ .
Batch Normalization: Models using BatchNorm layers require special handling. Since the gradient is computed at $y_t$ , the running statistics (mean and variance) used by BatchNorm layers need to reflect the $x_t$ sequence parameters for evaluation. This can be achieved by running a small number of training batches through the model using the $x_t$ parameters before each evaluation.
Warmup: Learning rate warmup is still beneficial for Schedule-Free methods in deep learning. When using warmup, the authors found it improves performance to weight the averaging coefficient $c_{t+1}$ by the square of the current learning rate $\gamma_t$ , using $c_{t+1} = \gamma_t^2 / \sum_{i=1}^t \gamma_i^2$ . This weighting strategy is motivated by the general theory (Theorem 3).
Weight Decay: Weight decay can be applied to either the $y_t$ or $z_t$ sequence. Applying decay to $y_t$ aligns with the interpretation of weight decay as an L2 regularization term added to the loss function.
Hyperparameters: Schedule-Free learning removes the need for schedule-specific hyperparameters (like the total number of steps $T$ for cosine decay). It introduces the $\beta$ parameter, but authors show that a default value like 0.9 works well across many tasks, requiring minimal tuning. However, users still need to tune the base learning rate $\gamma$ and weight decay, and the optimal values may differ from those used with scheduled optimizers.
An open-source implementation is provided for researchers and practitioners to use.

In summary, Schedule-Free learning offers a promising alternative to learning rate schedules, providing competitive or superior empirical performance without the need to know the training duration beforehand. It is presented as a viable drop-in replacement for schedules in various machine learning tasks, with comparable computational and memory costs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/miltonl_/status/1801976235855138894

https://twitter.com/aaron_defazio/status/1795435679339700238

https://twitter.com/aaron_defazio/status/1854268028088922337

https://twitter.com/_akhaliq/status/1794943080098906339

https://twitter.com/Euclaise_/status/1795091444635513145

https://twitter.com/alicey_ang/status/1795485277168820262

YouTube

Show All Videos

HackerNews

The Road Less Scheduled (3 points, 0 comments)