AdaHesScale: Hessian-Aware Adaptive Gradient Descent

Updated 8 February 2026

AdaHesScale is a Hessian-aware, adaptively scaled gradient descent method that uses local curvature information to modulate steps for unconstrained optimization.
It computes a scalar from inner products between gradients and Hessian-vector products, ensuring local unit steps and global convergence even in nonconvex settings.
Empirical results on tasks like logistic regression and deep neural networks demonstrate its robust performance with low per-iteration cost compared to first-order methods.

AdaHesScale is a Hessian-aware, adaptively scaled variant of gradient descent designed for large-scale unconstrained optimization where the objective is twice differentiable and potentially nonconvex. It modifies the standard gradient descent scheme by introducing a scalar scaling for the gradient direction derived from local curvature information, maintaining low per-iteration cost and enhancing robustness to the choice of step size. AdaHesScale preserves the simplicity of first-order methods while integrating second-order information to provide a local unit step size guarantee and global convergence under notably weaker smoothness requirements than traditional gradient descent approaches (Smee et al., 6 Feb 2025).

1. Problem Formulation and Notation

AdaHesScale addresses optimization problems of the form

$\min_{x\in\mathbb R^d}\;f(x),$

where $f$ is twice continuously differentiable and bounded below, $g(x)=\nabla f(x)\in\mathbb R^d$ is the gradient, and $H(x)=\nabla^2 f(x)\in\mathbb R^{d\times d}$ is the Hessian. At the $k$ th iteration, $x_k$ has associated gradient $g_k$ and Hessian $H_k$ .

2. Hessian-Aware Scaling Mechanics

Instead of adapting the search direction, AdaHesScale introduces a positive scalar scaling $s_k$ that modulates the gradient: $p_k = -s_k\,g_k, \quad D_k = s_k\,I.$ The search direction $p_k$ is determined to satisfy the second-order descent condition: $\langle g_k,\,p_k\rangle + \langle p_k,\,H_k\,p_k\rangle \le 0,$ mirroring descent properties in Newton-type schemes. The following canonical scalings are defined for the one-dimensional case: $s^{\rm CG}_k = \frac{\|g_k\|^2}{\langle g_k,H_k\,g_k\rangle}, \quad s^{\rm MR}_k = \frac{\langle g_k,H_k\,g_k\rangle}{\|H_k\,g_k\|^2}, \quad s^{\rm GM}_k = \sqrt{s^{\rm CG}_k\,s^{\rm MR}_k}.$ In situations with negative curvature ( $\langle g_k,H_k\,g_k\rangle<0$ ), the method permits larger scalings; for small positive curvature, a cap $s_k\leq 1/\sigma$ is enforced, where $\sigma$ is a small tolerance parameter.

3. Update Rule, Line Search, and Algorithmic Structure

The update is performed as

$x_{k+1} = x_k - \alpha_k\,s_k\,g_k = x_k + \alpha_k\,p_k,$

with the step size $\alpha_k>0$ chosen to ensure sufficient decrease through Armijo-type backtracking or forward tracking: $f(x_k+\alpha_k\,p_k) \le f(x_k) + \rho\,\alpha_k\,\langle g_k,p_k\rangle, \quad \rho\in(0,½).$ Algorithmically, each iteration involves:

Gradient evaluation.
Hessian-vector product via reverse-mode autodifferentiation.
Scalar operations and possible function evaluations during line search.

Summary Table: Core Steps and Computations

Step	Operation	Computational Element
Compute	$g_k, v_k=H_k g_k$	Gradient, Hessian-vector
Test curvature	$\gamma = \langle g_k, v_k\rangle$	Inner product
Select $s_k$	Curvature-guided, fixed or adaptive	Scalar selection (CG/MR/GM)
Line search	Satisfy Armijo rule	Function evaluation(s)

4. Local and Global Convergence Properties

The method attains a local unit-step guarantee near a local minimizer $x^\star$ where the second-order sufficient conditions are met: $g(x^\star)=0,\quad \mu\,I \preceq H(x) \preceq M\,I, \forall x\in B_r(x^\star),$ with $\mu>0, M<\infty$ . In the region $x_k$ close to $x^\star$ , the method accepts $\alpha_k=1$ at each step, and achieves Q-linear convergence: $f(x_{k+1})-f(x^\star) \le (1-\tau)\left(f(x_k)-f(x^\star)\right),\quad \tau\in(0,1).$

Global convergence is demonstrated under weakened smoothness compared to classical gradient descent. The following directional conditions are imposed:

Hessian directional smoothness: For some $L_2 \geq 0$ ,

$\|H(x-tg(x))-H(x)\| \le t\,L_2\,\|g(x)\|,\;\;\forall x, t\ge0.$

Hessian-gradient directional smoothness: For some $L_1 \geq 0$ , if $\langle g(x), H(x)g(x)\rangle > 0$ ,

$\|H(x)g(x)\| \le L_1\,\|g(x)\|.$

With these, Algorithm 2 requires at most $K = \mathcal O(\varepsilon^{-2})$ iterations to achieve $\|g_k\| \le \varepsilon$ .

5. Handling Inexact Hessian Information

Recognizing that exact Hessian computation may not be feasible, AdaHesScale allows for Hessian approximations $\widetilde H(x)$ under the error bound: $|\langle g(x), (H(x) - \widetilde H(x))g(x)\rangle| \le \Delta_H\,\|g(x)\|^2.$ Mild inexactness along the gradient direction does not compromise either the local unit-step-size acceptance or the global $\mathcal O(\varepsilon^{-2})$ convergence rate, provided an analogous smoothness assumption holds for $\widetilde H(x)g(x)$ .

6. Algorithmic Pseudocode and Computational Overhead

The AdaHesScale method consists of the following core pseudocode:

Require initial x0, tolerance εg, ρ∈(0,½), σ>0
For k=0,1,2,...
    Evaluate gk=∇f(xk)
    If ∥gk∥≤εg: stop
    Compute vk=Hk gk (Hessian-vector product)
    γ←⟨gk, vk⟩
    If γ > σ∥gk∥^2 (SPC): sk←∥gk∥^2/γ (CG scaling or MR/GM variant)
    Else if γ≥0 (LPC): sk←1/σ
    Else (NC): sk←1/σ or larger
    pk=−sk gk
    Choose αk by Armijo line search
    xk+1←xk+αk pk

Each iteration involves a gradient and a Hessian-vector product, a few inner products and scalar operations, and a small number of function evaluations attributable to the line search.

7. Empirical Performance and Practical Findings

AdaHesScale was empirically validated on:

Convex $\ell_2$ -regularized multiclass logistic regression (CIFAR-10)
Nonconvex two-layer MLP (FashionMNIST)
Deep ResNet-18 (Imagenette)

Baselines included fixed-step and line-search gradient descent, Heavy-Ball, Nesterov acceleration, and Adam, assessed by oracle call counts. The alternating CG/MR scaling variant ("MRCG") produced the best monotonic decrease in $f$ , with unit steps accepted almost everywhere. AdaHesScale matched or outperformed well-tuned first-order methods, without the need for hand-tuning step sizes. The MR variant particularly favored monotonic reduction in $\|g_k\|$ under unit step sizes, showing a strong built-in bias toward gradient norm reduction.

These results indicate that AdaHesScale delivers the robustness and low per-iteration cost of plain gradient descent, combined with curvature adaptation that enables locally aggressive steps and globally reliable convergence, even under substantially weakened smoothness assumptions (Smee et al., 6 Feb 2025).

Markdown Upgrade to Chat

References (1)

First-ish Order Methods: Hessian-aware Scalings of Gradient Descent (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdaHesScale.