Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 72 tok/s

Gemini 2.5 Pro 41 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 115 tok/s Pro

Kimi K2 203 tok/s Pro

GPT OSS 120B 451 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Learning-Rate-Free Learning by D-Adaptation (2301.07733v5)

Published 18 Jan 2023 in cs.LG, cs.AI, math.OC, and stat.ML

Abstract: D-Adaptation is an approach to automatically setting the learning rate which asymptotically achieves the optimal rate of convergence for minimizing convex Lipschitz functions, with no back-tracking or line searches, and no additional function value or gradient evaluations per step. Our approach is the first hyper-parameter free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. An open-source implementation is available.

Citations (66)

View on Semantic Scholar

Summary

The paper introduces D-Adaptation, a method that removes the need for manual learning-rate tuning by maintaining a dynamic lower bound on the distance-to-solution.
It leverages a variant of AdaGrad-Norm dual averaging and provides optimal convergence guarantees for convex Lipschitz objectives.
Empirical results across various benchmarks, including deep learning models, confirm that D-Adaptation matches or exceeds hand-tuned performance with minimal overhead.

D-Adaptation is a recipe for eliminating manual learning-rate tuning in first-order optimization. Instead of trying to guess the optimal fixed step size

$\displaystyle \gamma_k=\frac{D}{G\sqrt{n}}$

(which requires the unknown distance-to-solution $D=\|x_0-x_\*\|$ and gradient bound $G$ ), the method keeps a running lower bound $d_k\!\le\!D$ that is updated online from quantities that are already computed during optimization.

=====================================================================

Key ideas

Start from AdaGrad-Norm dual averaging (DA) – a projection-free variant of AdaGrad that uses $\displaystyle \gamma_{k}=\tfrac{1}{\sqrt{\sum_{i=0}^{k-1}\|g_i\|^2}}$ .
Maintain the running sum of scaled gradients $s_{k+1}=s_k+d_k g_k$ and the iterate $x_{k+1}=x_0-\gamma_{k+1}s_{k+1}$ .

After every update, derive a data-dependent lower bound on

D

hat_d_{k+1} = [ γ_{k+1}‖s_{k+1}‖² − Σ_{i≤k} γ_i d_i²‖g_i‖² ] / [ 2‖s_{k+1}‖ ]    (Option I)

or

hat_d_{k+1} =  Σ_{i≤k} γ_i d_i ⟨g_i , s_i⟩  /  ‖s_{k+1}‖                        (Option II)

Intuition: if optimization is still far from the optimum, the RHS is positive and informative; once we get close, it may become negative and is ignored.

Set d_{k+1} = max(d_k , hat_d_{k+1}). Because the sequence is non-decreasing and capped by the true $D$ , it grows only when justified by progress and never explodes.
Return the weighted average iterate $\displaystyle \hat x_n = \frac{Σ_{k=0}^n d_k x_k}{Σ_{k=0}^n d_k}$ .

=====================================================================

Theoretical guarantees

Convex, $G$ -Lipschitz objective:

Asymptotic rate (Dual-Averaging version) $f(\hat x_n)-f_\* = 𝒪\!\bigl(DG/\sqrt{n}\bigr)$ — optimal and parameter-free (no extra log factor).
Non-asymptotic bound $𝒪\!\Bigl(DG\sqrt{\log_2(D/d_0)}/\sqrt{n}\Bigr)$ .
A gradient-descent formulation also exists but incurs an extra $\log(n)$ factor.
Coordinate-wise extension (D-Adapted AdaGrad) achieves $𝒪\!\bigl(p\,G_\infty D_\infty/\sqrt{n}\bigr)$ without knowing $D_\infty$ .

=====================================================================

Practical variants for deep learning

Although the proofs are for deterministic convex problems, the authors plug the same logic into stochastic optimizers and show strong empirical results.

SGD variant (Algorithm 1 in the paper):

d, s = d0, 0
for k, (x, g) in enumerate(data_loader):
    λ = d * γ_k   # γ_k comes from an external LR schedule (warm-up, cosine, etc.)
    s += λ * g
    x -= λ * g    # SGD step with momentum handled outside

    # Distance lower-bound update (Option II)
    r += λ * g.dot(s)
    hat_d = r / abs(s).sum()          # ℓ1 norm for vector case
    d     = max(d, hat_d)

Adam variant: replace plain norms by element-wise EMA–scaled ones; Appendix F shows how to derive the moving-average corrections so that the theory still holds.

Implementation notes

Initial guess: d0 can be extremely small (1e-8 – 1e-6). Experiments show the method is insensitive to this choice.
Schedule handling: treat 1.0 as the “base” LR, then let D-Adaptation scale it — warm-up and decay policies remain unchanged.
Cost: one extra scalar/vector addition per step plus storage of s and d; negligible compared to back-prop.
Numerical stability: when using half precision, keep d, s, r in FP32 to avoid underflow.

=====================================================================

Empirical evidence

Benchmarks include:

12 LIBSVM logistic-regression datasets (convex)
CIFAR-10/100, ImageNet (ResNet/DenseNet)
Vision Transformer (ViT-tiny)
LLMs: IWSLT14 LSTM, RoBERTa-base MLM, GPT-small
COCO object detection (Faster-RCNN)
fastMRI reconstruction (VarNet)
DLRM recommendation (Criteo CTR)

Across all tasks except ViT-tiny, the automatically found learning rate matches or exceeds carefully grid-searched baselines (Table 3 in the paper). The final adapted LR is typically within a factor of 2 of the hand-tuned value.

=====================================================================

Trade-offs & limitations

✔ Removes LR grid search; no new tunables except tiny d0. ✔ No extra gradient or function evaluations; works with any external schedule. ✔ The update is cheap and easy to retrofit into existing training loops.

✗ Theory limited to convex Lipschitz objectives; deep-learning success is empirical. ✗ Gradient-descent form has an extra log factor – DA version is preferred. ✗ When schedules or architectures are extremely sensitive to LR (e.g. ViT-tiny), performance can lag.

=====================================================================

How to use in your codebase

Drop-in PyTorch implementation is available at https://github.com/facebookresearch/dadaptation

Replace your optimizer with DAdaptSGD or DAdaptAdam:

opt = DAdaptAdam(model.parameters(),
                 lr=1.0,         # keep schedule as before
                 d0=1e-6,        # safe default
                 weight_decay=0) # use decoupled decay

Keep existing LR schedulers (warm-up, cosine, milestones). D-Adaptation automatically rescales the base LR on-the-fly.
Monitor opt.param_groups[0] ["d"] or log the internal adapt_lr to see how fast the method converges to the effective learning rate.

=====================================================================

Take-away

D-Adaptation provides a learning-rate-free extension to AdaGrad, SGD and Adam that:

is theoretically optimal for convex Lipschitz problems,
requires virtually no tuning,
adds negligible overhead,
and matches hand-tuned performance on a wide spectrum of modern deep-learning workloads.