Papers
Topics
Authors
Recent
2000 character limit reached

Adaptive Armijo Line-Search (AdaSLS)

Updated 26 November 2025
  • Adaptive Armijo Line-Search (AdaSLS) is a family of step-size selection methods that adaptively adjust based on optimization progress, ensuring convergence across varied regimes.
  • It generalizes classical backtracking by using an adaptive multiplier for step-size shrinkage, significantly reducing function evaluations and computational cost.
  • AdaSLS guarantees convergence in convex, nonconvex, and stochastic settings, with proven practical superiority in large-scale and deep learning optimizations.

Adaptive Armijo Line-Search (AdaSLS) is a family of step-size selection algorithms generalizing the classical Armijo backtracking line-search by making the search procedure explicitly adaptive to observed optimization progress. AdaSLS is notable for its rigorously guaranteed convergence in convex, nonconvex, and stochastic regimes, broad applicability across deterministic and stochastic optimization, practical superiority over tuned fixed-step methods, and for foundational innovations in both theoretical and applied optimization.

1. Core Principle: Armijo Condition and AdaSLS Mechanism

The Armijo condition is the foundational step-size rule in line-search methods: F(xk+αkdk)F(xk)+cαkF(xk),dk,F(x_k + \alpha_k d_k) \leq F(x_k) + c\,\alpha_k\,\langle \nabla F(x_k), d_k \rangle, where F:RnRF: \mathbb{R}^n \to \mathbb{R} is differentiable, xkx_k is the current iterate, dkd_k is a descent direction (F(xk),dk<0\langle \nabla F(x_k), d_k \rangle < 0), αk>0\alpha_k > 0 the step size, and c(0,1)c \in (0,1) is a scaling parameter. This condition enforces sufficient descent: the actual decrease is at least a cc-fraction of the linearized prediction.

Classical backtracking line-search initializes with a trial step-size α0\alpha_0 and repeatedly contracts αρα\alpha \leftarrow \rho\,\alpha with ρ(0,1)\rho \in (0,1) until the Armijo condition is met. AdaSLS generalizes this by replacing the fixed shrinkage ρ\rho with an adaptive multiplier that depends on the degree of Armijo violation. The violation metric is

v(α)F(xk+αdk)F(xk)cαF(xk),dkv(\alpha) \coloneqq \frac{F(x_k+\alpha d_k) - F(x_k)}{c\,\alpha\,\langle \nabla F(x_k), d_k \rangle}

with the adaptive shrink

ρ^(v)=max{ϵ,ρ1c1cv},ϵ>0.\hat{\rho}(v) = \max\bigl\{\epsilon,\, \rho \cdot \frac{1-c}{1-c v} \bigr\}, \quad \epsilon > 0.

This causes more aggressive shrinkage when the violation v(α)v(\alpha) is high and reduces spurious reductions when the condition is nearly met, yielding larger average step-sizes and fewer backtracking trials (Cavalcanti et al., 23 Aug 2024).

2. Algorithmic Variants and Bracketing Accelerations

While most line-search routines employ a geometric grid, bracketing-based AdaSLS can reduce function evaluations by orders of magnitude. Given the Armijo residual function

g(α)=f(x+αd)[f(x)+cαf(x)Td],g(\alpha) = f(x + \alpha d) - [\,f(x) + c\,\alpha\,\nabla f(x)^T d\,],

the objective is to find α^(βα,α]\hat{\alpha} \in (\beta \alpha^*, \alpha^*], where α\alpha^* maximizes the Armijo interval g(α)0g(\alpha) \le 0.

Bracketing methods (geometric bisection, superlinear ITP/Ridders) maintain [a,b][a,b] with aα<ba \leq \alpha^* < b, repeatedly bisecting until a>βba > \beta b. Complexity is reduced from O(log(α0/ϵ))O(\log(\alpha_0/\epsilon)) to O(loglog(α0/ϵ))O(\log\log(\alpha_0/\epsilon)) for geometric and O(logloglog(α0/ϵ))O(\log\log\log(\alpha_0/\epsilon)) for superlinear rules (Oliveira et al., 2021). This leads to 50–80% reduction in function evaluations compared to classical geometric backtracking.

Method Complexity (Func. Evals) Practical Savings
Classical Backtrack O(log(α0/ϵ))O(\log(\alpha_0/\epsilon)) Baseline
Geometric-Bisect O(loglog(α0/ϵ))O(\log\log(\alpha_0/\epsilon)) 50%+
Superlinear (ITP) O(logloglog(α0/ϵ))O(\log\log\log(\alpha_0/\epsilon)) up to 80%

3. Theoretical Guarantees in Convex and Nonconvex Regimes

Convex Smooth Functions

In the convex, LL-smooth case, if dkd_k is “gradient-related,” there exists a uniform threshold αˉ\bar \alpha such that all α(0,αˉ)\alpha \in (0, \bar \alpha) satisfy Armijo. AdaSLS delivers identical per-iteration complexity (GD: O(1/k)O(1/k), AGD: O(1/k2)O(1/k^2)) as classical backtracking, with no additional adjustment steps (Cavalcanti et al., 23 Aug 2024, Vaswani et al., 28 Feb 2025, Jiang et al., 2023).

AdaSLS also exhibits linear convergence in strongly convex/interpolation settings (e.g., separable logistic regression), where standard fixed-step GD admits only sublinear rates. In this regime, GD with AdaSLS achieves exponential convergence to the optimum (Vaswani et al., 28 Feb 2025, Vaswani et al., 2019).

Nonconvex and Gradient-Dominated Functions

For C1C^1 functions with LL-Lipschitz gradient and gradient-related dkd_k, all α<1/L\alpha < 1/L are guaranteed to satisfy the Armijo rule. Correspondingly, AdaSLS ensures the returned step-size αkmin{α0,ρ/L}\alpha_k \geq \min\{\alpha_0, \rho/L\} and gradient-descent-type methods achieve standard nonconvex complexity: min0i<kF(xi)=O(1/k)\min_{0 \leq i < k}\|\nabla F(x_i)\| = O(1/\sqrt{k}) (Cavalcanti et al., 23 Aug 2024, Wu, 25 Nov 2025, Jiang et al., 2023).

Under stronger properties such as the Polyak-Łojasiewicz (PL) condition or gradient domination, AdaSLS recovers linear rates in the nonconvex regime, and in strict-saddle problems, finds second-order stationary points with optimal complexity up to logarithmic terms (O'Neill et al., 2020).

4. Stochastic, Adaptive, and Composite Extensions

AdaSLS generalizes to stochastic regimes, handling mini-batch and private gradients:

  • Stochastic Armijo Line-search integrates noisy oracles, testing the condition on mini-batch subsamples or privatized gradients, and adaptively adjusting sampling accuracy to ensure high-probability descent (Paquette et al., 2018, Dvinskikh et al., 2019, Chen et al., 2020).
  • Renyi Differential Privacy (RDP) AdaSLS incorporates sparse vector tricks and subsampling amplification to control cumulative privacy loss, dynamically reallocating privacy budget based on the observed reliability of the noisy oracles (Chen et al., 2020).
  • Composite and Nonsmooth Problems: AdaSLS frameworks extend via generalized regularizations (e.g., φ\varphi-regulation) and nonmonotone line-search windows, yielding global convergence for nonsmooth convex objectives, Q-superlinear convergence under BD-regularity, and no requirement of semismoothness (Zhang, 2013).

5. Practical Implementations, Momentum, and Large-Scale Neural Optimization

AdaSLS has been incorporated into several high-performance optimizers for deep learning and large-scale problems, notably:

  • Momentum and Preconditioning: Recent works integrate the search direction of Adam-type methods directly into the Armijo test, using momentum vectors for reliable descent and finite-difference directionality to avoid vanishing step-sizes. This overcomes severe failures of vanilla line-search on large models and enables hyperparameter-free regimes (Kenneweg et al., 27 Mar 2024, Kenneweg et al., 30 Jul 2024).
  • Exponential Moving Average-Based Armijo: Algorithms such as SALSA stabilize the Armijo criterion against mini-batch noise by exponential moving averages over both loss and squared gradient terms, further reducing the frequency of expensive line-searches while maintaining theoretical and empirical guarantees (Kenneweg et al., 30 Jul 2024).

Empirical benchmarks show state-of-the-art performance across vision (CNNs, ResNets) and NLP (Transformers, BERT/GPT-2) tasks, with training time, function and gradient evaluations reduced by 30–80% over classical backtracking or fixed learning-rate schedules. These methods require minimal or no hand-tuning, with robust insensitive performance to the choice of the remaining parameters (Kenneweg et al., 27 Mar 2024, Kenneweg et al., 30 Jul 2024).

6. Connections to Variance Reduction, Generalization, and Advanced Sample Complexity

AdaSLS, when combined with variance reduction (e.g., loopless proxy functions for stochastic finite-sum optimization), achieves optimal sample complexity bounds matching the best known for SVRG and SARAH, while eliminating inner–outer loop nesting and any explicit dependency on the desired accuracy ϵ\epsilon (Jiang et al., 2023). The transition from O(1/ϵ21/\epsilon^2) for generic SGD/SPS to O(n+O~(1/ϵ)n+\widetilde O(1/\epsilon)) for AdaSLS-based VR variants demonstrates the adaptivity and scalability of the approach.

7. Impact, Extensions, and Outlook

AdaSLS has established itself as a canonical improvement over classical backtracking line-search across deterministic, stochastic, large-scale, private, and nonconvex optimization. Its blend of theoretical guarantees, algorithmic simplicity, and empirical advantage positions it as a foundational component in modern optimization toolkits.

Ongoing research explores:

  • Extension to progressively structured nonsmooth and conditionally smooth problems, notably constrained and compositional settings.
  • Enhanced compatibility with adaptive preconditioning, second-order (e.g., Hessian-AdaSLS) methods, decentralized/federated architectures.
  • Rigorous characterization of generalization benefits in overparameterized and interpolating regimes, particularly in deep learning.

Key references: (Cavalcanti et al., 23 Aug 2024, Oliveira et al., 2021, Vaswani et al., 28 Feb 2025, Wu, 25 Nov 2025, Jiang et al., 2023, Kenneweg et al., 27 Mar 2024, Kenneweg et al., 30 Jul 2024, Dvinskikh et al., 2019, Paquette et al., 2018, Zhang, 2013, Chen et al., 2020, O'Neill et al., 2020, Vaswani et al., 2019).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Adaptive Armijo Line-Search (AdaSLS).