Adaptive Gradient Descent Without Descent
- The paper introduces a framework that dynamically adjusts step sizes based solely on local curvature information, eliminating the need for global hyperparameters.
- It details the use of Lyapunov-type energy functions to guarantee global convergence despite non-monotonic decreases in function values.
- The study demonstrates robustness in both deterministic and stochastic settings, offering theoretical guarantees and practical advantages over traditional methods.
Adaptive gradient descent without descent refers to a spectrum of algorithms that dynamically adjust step sizes or update rules based solely on local geometric information—often without recourse to functional values, line search, or manual hyperparameter selection—while sometimes even eschewing traditional gradient-based descent directions. In this paradigm, “without descent” does not necessarily imply a lack of progress towards minimization but rather the omission of classical monotonic descent in function value at every step or the explicit use of gradients for parameter updates. These techniques span deterministic, stochastic, and meta-learned frameworks (including non-Euclidean settings), and have demonstrated theoretical and empirical efficacy for a range of convex, nonconvex, and high-dimensional problems.
1. Local Adaptivity via Gradient Differences
Adaptive gradient descent without descent circumvents the need for global smoothness constants and line search by estimating step sizes from local curvature. Central to these methods is the use of gradient differences between successive iterates: This local Lipschitz estimate then determines a stepsize upper bound,
with an additional “growth control” constraint to avoid abrupt increases, such as
These two rules—(1) not increasing the step size too quickly, (2) avoiding overstepping local curvature—guarantee that the method is responsive to the geometry of the function. The iterates are updated as
without any direct use of function values or expensive line search procedures (Malitsky et al., 2019, Malitsky et al., 2023).
2. Energy Functions and Non-Monotone Descent
A defining aspect of these adaptive methods is reliance on Lyapunov-type energy functions, rather than strict monotonic decrease in function value. Instead of ensuring that , a typical guarantee is for a compound measure, e.g.,
to decrease at each iteration, where is a minimizer. This telescoping structure ensures global convergence and permits occasional non-monotonic behavior of while maintaining global progress (Malitsky et al., 2019, Aujol et al., 18 Sep 2025).
In stochastic variants, the Lyapunov function extends to expectation over the oracle randomness, with additional error control terms due to gradient noise, ensuring that the decay (in expectation) drives convergence (Aujol et al., 18 Sep 2025).
3. Stochastic Adaptation and Oracle-Driven Step-Size
Recent advances adapt these principles to stochastic optimization, where only unbiased gradient estimators from a first-order oracle are accessible. The key idea is to update the step-size using (i) the difference between current and previous iterates and (ii) the difference of their stochastic gradients, both evaluated using the same mini-batch realization: where denotes the stochastic gradient for the fixed batch or sample (Aujol et al., 18 Sep 2025). Additional variants incorporate decay factors to ensure diminishing step-sizes in accordance with theoretical convergence of stochastic approximation, yielding rates of in the strongly convex setting.
This fully removes dependence on any global parameter or hyperparameter (e.g., the Lipschitz constant), as all adaptation is local and data-driven.
4. Robustness, Complexity, and Benchmark Comparisons
Empirical and theoretical results consistently demonstrate that adaptive gradient descent without descent is robust to initialization, insensitive to hyperparameter mis-specification, and competitive with or superior to tuned baselines. In convex and non-convex real-data tasks (e.g., logistic regression, matrix factorization), the method achieves O(1/k) convergence for convex objectives with only local smoothness assumptions, and can show linear convergence in locally strongly convex settings (Malitsky et al., 2019, Qiao, 2020).
In stochastic optimization tasks, the performance of these adaptive schemes approximates or outperforms standard Stochastic Gradient Descent (SGD) and even approaches that require oracle knowledge of global parameters, while avoiding manual tuning and hyperparameter schedules (Aujol et al., 18 Sep 2025).
5. Extensions: Non-Euclidean and Parameter-Free Settings
The adaptive local curvature paradigm extends beyond standard Euclidean space. On Riemannian manifolds with nonnegative curvature, the method adapts by computing local curvature from parallel-transported gradients between points, utilizing the Riemannian exponential map and its differential: with adaptively chosen via local geometric comparisons. This achieves automatic step-size selection even in spaces without a global linear structure and provides convergence guarantees matched to the manifold context (Ansari-Önnestam et al., 23 Apr 2025).
In high-dimensional convex scenarios, parameter-free AdaGrad-inspired schemes remove reliance on the domain diameter or smoothness by using adaptive scaling and automatic doubling strategies, yielding comparable regret and convergence rates to optimally tuned methods (Chzhen et al., 2023, Khaled et al., 2023).
6. Implementation and Practical Trade-Offs
Unlike traditional gradient descent frameworks, these adaptive schemes require only local iterate and gradient storage across iterations, no function value comparisons, and a fixed number of gradient evaluations per iteration—except in stochastic settings where an extra gradient is required at the current point for variance reduction.
Their automatic step-size adaptation makes them suitable for automated machine learning, large-scale empirical risk minimization, and stochastic approximation problems where global parameter-tuning is costly or impractical.
However, the lack of monotonic function decrease at each step can complicate progress monitoring; practitioners must track energy functions or ergodic means to ensure convergence. In high-noise stochastic regimes, proper decay or control variants may be needed to regularize the step-size and guarantee robustness.
7. Significance and Future Directions
Adaptive gradient descent without descent offers a theoretically grounded pathway toward fully automated, plug-and-play first-order optimization in a broad range of convex, nonconvex, and stochastic settings. Its reliance on local curvature information, avoidance of global hyperparameter selection, and ability to operate with first-order oracles make it especially attractive in data-driven and automated learning pipelines.
Ongoing research aims to extend these methodologies to even broader settings, including distributed and federated learning, non-Euclidean geometries, and problems with additional structural constraints. Current challenges include adapting to nonstationarity in streaming data, further reducing variance amplification in high-noise settings, and integrating these principles into large-scale deep learning frameworks.
Key references: (Malitsky et al., 2019, Malitsky et al., 2023, Aujol et al., 18 Sep 2025, Ansari-Önnestam et al., 23 Apr 2025, Chzhen et al., 2023, Khaled et al., 2023).