Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 72 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 115 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 451 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Learning-Rate-Free Learning by D-Adaptation (2301.07733v5)

Published 18 Jan 2023 in cs.LG, cs.AI, math.OC, and stat.ML

Abstract: D-Adaptation is an approach to automatically setting the learning rate which asymptotically achieves the optimal rate of convergence for minimizing convex Lipschitz functions, with no back-tracking or line searches, and no additional function value or gradient evaluations per step. Our approach is the first hyper-parameter free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. An open-source implementation is available.

Citations (66)

Summary

  • The paper introduces D-Adaptation, a method that removes the need for manual learning-rate tuning by maintaining a dynamic lower bound on the distance-to-solution.
  • It leverages a variant of AdaGrad-Norm dual averaging and provides optimal convergence guarantees for convex Lipschitz objectives.
  • Empirical results across various benchmarks, including deep learning models, confirm that D-Adaptation matches or exceeds hand-tuned performance with minimal overhead.

D-Adaptation is a recipe for eliminating manual learning-rate tuning in first-order optimization. Instead of trying to guess the optimal fixed step size

Ξ³k=DGn\displaystyle \gamma_k=\frac{D}{G\sqrt{n}}

(which requires the unknown distance-to-solution $D=\|x_0-x_\*\|$ and gradient bound GG), the method keeps a running lower bound dk ⁣≀ ⁣Dd_k\!\le\!D that is updated online from quantities that are already computed during optimization.

=====================================================================

Key ideas

  1. Start from AdaGrad-Norm dual averaging (DA) – a projection-free variant of AdaGrad that uses Ξ³k=1βˆ‘i=0kβˆ’1βˆ₯giβˆ₯2\displaystyle \gamma_{k}=\tfrac{1}{\sqrt{\sum_{i=0}^{k-1}\|g_i\|^2}}.
  2. Maintain the running sum of scaled gradients sk+1=sk+dkgks_{k+1}=s_k+d_k g_k and the iterate xk+1=x0βˆ’Ξ³k+1sk+1x_{k+1}=x_0-\gamma_{k+1}s_{k+1}.
  3. After every update, derive a data-dependent lower bound on DD:
    1
    2
    3
    4
    5
    
    hat_d_{k+1} = [ Ξ³_{k+1}β€–s_{k+1}β€–Β² βˆ’ Ξ£_{i≀k} Ξ³_i d_iΒ²β€–g_iβ€–Β² ] / [ 2β€–s_{k+1}β€– ]    (Option I)
    
    or
    
    hat_d_{k+1} =  Ξ£_{i≀k} Ξ³_i d_i ⟨g_i , s_i⟩  /  β€–s_{k+1}β€–                        (Option II)
    Intuition: if optimization is still far from the optimum, the RHS is positive and informative; once we get close, it may become negative and is ignored.
  4. Set d_{k+1} = max(d_k , hat_d_{k+1}). Because the sequence is non-decreasing and capped by the true DD, it grows only when justified by progress and never explodes.
  5. Return the weighted average iterate x^n=Ξ£k=0ndkxkΞ£k=0ndk\displaystyle \hat x_n = \frac{Ξ£_{k=0}^n d_k x_k}{Ξ£_{k=0}^n d_k}.

=====================================================================

Theoretical guarantees

Convex, GG-Lipschitz objective:

  • Asymptotic rate (Dual-Averaging version) $f(\hat x_n)-f_\* = π’ͺ\!\bigl(DG/\sqrt{n}\bigr)$ β€” optimal and parameter-free (no extra log factor).
  • Non-asymptotic bound π’ͺ ⁣(DGlog⁑2(D/d0)/n)π’ͺ\!\Bigl(DG\sqrt{\log_2(D/d_0)}/\sqrt{n}\Bigr).
  • A gradient-descent formulation also exists but incurs an extra log⁑(n)\log(n) factor.
  • Coordinate-wise extension (D-Adapted AdaGrad) achieves π’ͺ ⁣(p G∞D∞/n)π’ͺ\!\bigl(p\,G_\infty D_\infty/\sqrt{n}\bigr) without knowing D∞D_\infty.

=====================================================================

Practical variants for deep learning

Although the proofs are for deterministic convex problems, the authors plug the same logic into stochastic optimizers and show strong empirical results.

SGD variant (Algorithm 1 in the paper):

1
2
3
4
5
6
7
8
9
10
d, s = d0, 0
for k, (x, g) in enumerate(data_loader):
    Ξ» = d * Ξ³_k   # Ξ³_k comes from an external LR schedule (warm-up, cosine, etc.)
    s += Ξ» * g
    x -= Ξ» * g    # SGD step with momentum handled outside

    # Distance lower-bound update (Option II)
    r += Ξ» * g.dot(s)
    hat_d = r / abs(s).sum()          # β„“1 norm for vector case
    d     = max(d, hat_d)

Adam variant: replace plain norms by element-wise EMA–scaled ones; Appendix F shows how to derive the moving-average corrections so that the theory still holds.

Implementation notes

  • Initial guess: d0 can be extremely small (1e-8 – 1e-6). Experiments show the method is insensitive to this choice.
  • Schedule handling: treat 1.0 as the β€œbase” LR, then let D-Adaptation scale it β€” warm-up and decay policies remain unchanged.
  • Cost: one extra scalar/vector addition per step plus storage of s and d; negligible compared to back-prop.
  • Numerical stability: when using half precision, keep d, s, r in FP32 to avoid underflow.

=====================================================================

Empirical evidence

Benchmarks include:

  • 12 LIBSVM logistic-regression datasets (convex)
  • CIFAR-10/100, ImageNet (ResNet/DenseNet)
  • Vision Transformer (ViT-tiny)
  • LLMs: IWSLT14 LSTM, RoBERTa-base MLM, GPT-small
  • COCO object detection (Faster-RCNN)
  • fastMRI reconstruction (VarNet)
  • DLRM recommendation (Criteo CTR)

Across all tasks except ViT-tiny, the automatically found learning rate matches or exceeds carefully grid-searched baselines (Table 3 in the paper). The final adapted LR is typically within a factor of 2 of the hand-tuned value.

=====================================================================

Trade-offs & limitations

βœ” Removes LR grid search; no new tunables except tiny d0. βœ” No extra gradient or function evaluations; works with any external schedule. βœ” The update is cheap and easy to retrofit into existing training loops.

βœ— Theory limited to convex Lipschitz objectives; deep-learning success is empirical. βœ— Gradient-descent form has an extra log factor – DA version is preferred. βœ— When schedules or architectures are extremely sensitive to LR (e.g. ViT-tiny), performance can lag.

=====================================================================

How to use in your codebase

  1. Drop-in PyTorch implementation is available at https://github.com/facebookresearch/dadaptation
  2. Replace your optimizer with DAdaptSGD or DAdaptAdam:
    1
    2
    3
    4
    
    opt = DAdaptAdam(model.parameters(),
                     lr=1.0,         # keep schedule as before
                     d0=1e-6,        # safe default
                     weight_decay=0) # use decoupled decay
  3. Keep existing LR schedulers (warm-up, cosine, milestones). D-Adaptation automatically rescales the base LR on-the-fly.
  4. Monitor opt.param_groups[0] ["d"] or log the internal adapt_lr to see how fast the method converges to the effective learning rate.

=====================================================================

Take-away

D-Adaptation provides a learning-rate-free extension to AdaGrad, SGD and Adam that:

  • is theoretically optimal for convex Lipschitz problems,
  • requires virtually no tuning,
  • adds negligible overhead,
  • and matches hand-tuned performance on a wide spectrum of modern deep-learning workloads.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.