Learning-Rate-Free Learning by D-Adaptation
(2301.07733v5)
Published 18 Jan 2023 in cs.LG, cs.AI, math.OC, and stat.ML
Abstract: D-Adaptation is an approach to automatically setting the learning rate which asymptotically achieves the optimal rate of convergence for minimizing convex Lipschitz functions, with no back-tracking or line searches, and no additional function value or gradient evaluations per step. Our approach is the first hyper-parameter free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. An open-source implementation is available.
The paper introduces D-Adaptation, a method that removes the need for manual learning-rate tuning by maintaining a dynamic lower bound on the distance-to-solution.
It leverages a variant of AdaGrad-Norm dual averaging and provides optimal convergence guarantees for convex Lipschitz objectives.
Empirical results across various benchmarks, including deep learning models, confirm that D-Adaptation matches or exceeds hand-tuned performance with minimal overhead.
D-Adaptation is a recipe for eliminating manual learning-rate tuning in first-order optimization.
Instead of trying to guess the optimal fixed step size
Ξ³kβ=GnβDβ
(which requires the unknown distance-to-solution $D=\|x_0-x_\*\|$ and gradient bound G),
the method keeps a running lower bounddkββ€D that is updated online from quantities that are already computed during optimization.
Intuition: if optimization is still far from the optimum, the RHS is positive and informative; once we get close, it may become negative and is ignored.
Set
d_{k+1} = max(d_k , hat_d_{k+1}).
Because the sequence is non-decreasing and capped by the true D, it grows only when justified by progress and never explodes.
Return the weighted average iterate
x^nβ=Ξ£k=0nβdkβΞ£k=0nβdkβxkββ.
Although the proofs are for deterministic convex problems, the authors plug the same logic into stochastic optimizers and show strong empirical results.
d, s = d0, 0for k, (x, g) inenumerate(data_loader):
Ξ» = d * Ξ³_k # Ξ³_k comes from an external LR schedule (warm-up, cosine, etc.)
s += Ξ» * g
x -= Ξ» * g # SGD step with momentum handled outside# Distance lower-bound update (Option II)
r += Ξ» * g.dot(s)
hat_d = r / abs(s).sum() # β1 norm for vector case
d = max(d, hat_d)
Adam variant: replace plain norms by element-wise EMAβscaled ones; Appendix F shows how to derive the moving-average corrections so that the theory still holds.
Implementation notes
Initial guess: d0 can be extremely small (1e-8 β 1e-6).
Experiments show the method is insensitive to this choice.
Schedule handling: treat 1.0 as the βbaseβ LR, then let D-Adaptation scale it β warm-up and decay policies remain unchanged.
Cost: one extra scalar/vector addition per step plus storage of s and d; negligible compared to back-prop.
Numerical stability: when using half precision, keep d, s, r in FP32 to avoid underflow.
Across all tasks except ViT-tiny, the automatically found learning rate matches or exceeds carefully grid-searched baselines (Table 3 in the paper). The final adapted LR is typically within a factor of 2 of the hand-tuned value.
β Removes LR grid search; no new tunables except tiny d0.
β No extra gradient or function evaluations; works with any external schedule.
β The update is cheap and easy to retrofit into existing training loops.
β Theory limited to convex Lipschitz objectives; deep-learning success is empirical.
β Gradient-descent form has an extra log factor β DA version is preferred.
β When schedules or architectures are extremely sensitive to LR (e.g. ViT-tiny), performance can lag.