Papers
Topics
Authors
Recent
Search
2000 character limit reached

AdaHesScale: Hessian-Aware Adaptive Gradient Descent

Updated 8 February 2026
  • AdaHesScale is a Hessian-aware, adaptively scaled gradient descent method that uses local curvature information to modulate steps for unconstrained optimization.
  • It computes a scalar from inner products between gradients and Hessian-vector products, ensuring local unit steps and global convergence even in nonconvex settings.
  • Empirical results on tasks like logistic regression and deep neural networks demonstrate its robust performance with low per-iteration cost compared to first-order methods.

AdaHesScale is a Hessian-aware, adaptively scaled variant of gradient descent designed for large-scale unconstrained optimization where the objective is twice differentiable and potentially nonconvex. It modifies the standard gradient descent scheme by introducing a scalar scaling for the gradient direction derived from local curvature information, maintaining low per-iteration cost and enhancing robustness to the choice of step size. AdaHesScale preserves the simplicity of first-order methods while integrating second-order information to provide a local unit step size guarantee and global convergence under notably weaker smoothness requirements than traditional gradient descent approaches (Smee et al., 6 Feb 2025).

1. Problem Formulation and Notation

AdaHesScale addresses optimization problems of the form

minxRd  f(x),\min_{x\in\mathbb R^d}\;f(x),

where ff is twice continuously differentiable and bounded below, g(x)=f(x)Rdg(x)=\nabla f(x)\in\mathbb R^d is the gradient, and H(x)=2f(x)Rd×dH(x)=\nabla^2 f(x)\in\mathbb R^{d\times d} is the Hessian. At the kkth iteration, xkx_k has associated gradient gkg_k and Hessian HkH_k.

2. Hessian-Aware Scaling Mechanics

Instead of adapting the search direction, AdaHesScale introduces a positive scalar scaling sks_k that modulates the gradient: pk=skgk,Dk=skI.p_k = -s_k\,g_k, \quad D_k = s_k\,I. The search direction pkp_k is determined to satisfy the second-order descent condition: gk,pk+pk,Hkpk0,\langle g_k,\,p_k\rangle + \langle p_k,\,H_k\,p_k\rangle \le 0, mirroring descent properties in Newton-type schemes. The following canonical scalings are defined for the one-dimensional case: skCG=gk2gk,Hkgk,skMR=gk,HkgkHkgk2,skGM=skCGskMR.s^{\rm CG}_k = \frac{\|g_k\|^2}{\langle g_k,H_k\,g_k\rangle}, \quad s^{\rm MR}_k = \frac{\langle g_k,H_k\,g_k\rangle}{\|H_k\,g_k\|^2}, \quad s^{\rm GM}_k = \sqrt{s^{\rm CG}_k\,s^{\rm MR}_k}. In situations with negative curvature (gk,Hkgk<0\langle g_k,H_k\,g_k\rangle<0), the method permits larger scalings; for small positive curvature, a cap sk1/σs_k\leq 1/\sigma is enforced, where σ\sigma is a small tolerance parameter.

3. Update Rule, Line Search, and Algorithmic Structure

The update is performed as

xk+1=xkαkskgk=xk+αkpk,x_{k+1} = x_k - \alpha_k\,s_k\,g_k = x_k + \alpha_k\,p_k,

with the step size αk>0\alpha_k>0 chosen to ensure sufficient decrease through Armijo-type backtracking or forward tracking: f(xk+αkpk)f(xk)+ραkgk,pk,ρ(0,½).f(x_k+\alpha_k\,p_k) \le f(x_k) + \rho\,\alpha_k\,\langle g_k,p_k\rangle, \quad \rho\in(0,½). Algorithmically, each iteration involves:

  • Gradient evaluation.
  • Hessian-vector product via reverse-mode autodifferentiation.
  • Scalar operations and possible function evaluations during line search.

Summary Table: Core Steps and Computations

Step Operation Computational Element
Compute gk,vk=Hkgkg_k, v_k=H_k g_k Gradient, Hessian-vector
Test curvature γ=gk,vk\gamma = \langle g_k, v_k\rangle Inner product
Select sks_k Curvature-guided, fixed or adaptive Scalar selection (CG/MR/GM)
Line search Satisfy Armijo rule Function evaluation(s)

4. Local and Global Convergence Properties

The method attains a local unit-step guarantee near a local minimizer xx^\star where the second-order sufficient conditions are met: g(x)=0,μIH(x)MI,xBr(x),g(x^\star)=0,\quad \mu\,I \preceq H(x) \preceq M\,I, \forall x\in B_r(x^\star), with μ>0,M<\mu>0, M<\infty. In the region xkx_k close to xx^\star, the method accepts αk=1\alpha_k=1 at each step, and achieves Q-linear convergence: f(xk+1)f(x)(1τ)(f(xk)f(x)),τ(0,1).f(x_{k+1})-f(x^\star) \le (1-\tau)\left(f(x_k)-f(x^\star)\right),\quad \tau\in(0,1).

Global convergence is demonstrated under weakened smoothness compared to classical gradient descent. The following directional conditions are imposed:

  • Hessian directional smoothness: For some L20L_2 \geq 0,

H(xtg(x))H(x)tL2g(x),    x,t0.\|H(x-tg(x))-H(x)\| \le t\,L_2\,\|g(x)\|,\;\;\forall x, t\ge0.

  • Hessian-gradient directional smoothness: For some L10L_1 \geq 0, if g(x),H(x)g(x)>0\langle g(x), H(x)g(x)\rangle > 0,

H(x)g(x)L1g(x).\|H(x)g(x)\| \le L_1\,\|g(x)\|.

With these, Algorithm 2 requires at most K=O(ε2)K = \mathcal O(\varepsilon^{-2}) iterations to achieve gkε\|g_k\| \le \varepsilon.

5. Handling Inexact Hessian Information

Recognizing that exact Hessian computation may not be feasible, AdaHesScale allows for Hessian approximations H~(x)\widetilde H(x) under the error bound: g(x),(H(x)H~(x))g(x)ΔHg(x)2.|\langle g(x), (H(x) - \widetilde H(x))g(x)\rangle| \le \Delta_H\,\|g(x)\|^2. Mild inexactness along the gradient direction does not compromise either the local unit-step-size acceptance or the global O(ε2)\mathcal O(\varepsilon^{-2}) convergence rate, provided an analogous smoothness assumption holds for H~(x)g(x)\widetilde H(x)g(x).

6. Algorithmic Pseudocode and Computational Overhead

The AdaHesScale method consists of the following core pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
Require initial x0, tolerance εg, ρ∈(0,½), σ>0
For k=0,1,2,...
    Evaluate gk=∇f(xk)
    If ∥gk∥≤εg: stop
    Compute vk=Hk gk (Hessian-vector product)
    γ←⟨gk, vk⟩
    If γ > σ∥gk∥^2 (SPC): sk←∥gk∥^2/γ (CG scaling or MR/GM variant)
    Else if γ≥0 (LPC): sk←1/σ
    Else (NC): sk←1/σ or larger
    pk=−sk gk
    Choose αk by Armijo line search
    xk+1←xk+αk pk

Each iteration involves a gradient and a Hessian-vector product, a few inner products and scalar operations, and a small number of function evaluations attributable to the line search.

7. Empirical Performance and Practical Findings

AdaHesScale was empirically validated on:

  • Convex 2\ell_2-regularized multiclass logistic regression (CIFAR-10)
  • Nonconvex two-layer MLP (FashionMNIST)
  • Deep ResNet-18 (Imagenette)

Baselines included fixed-step and line-search gradient descent, Heavy-Ball, Nesterov acceleration, and Adam, assessed by oracle call counts. The alternating CG/MR scaling variant ("MRCG") produced the best monotonic decrease in ff, with unit steps accepted almost everywhere. AdaHesScale matched or outperformed well-tuned first-order methods, without the need for hand-tuning step sizes. The MR variant particularly favored monotonic reduction in gk\|g_k\| under unit step sizes, showing a strong built-in bias toward gradient norm reduction.

These results indicate that AdaHesScale delivers the robustness and low per-iteration cost of plain gradient descent, combined with curvature adaptation that enables locally aggressive steps and globally reliable convergence, even under substantially weakened smoothness assumptions (Smee et al., 6 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdaHesScale.