AdaHesScale: Hessian-Aware Adaptive Gradient Descent
- AdaHesScale is a Hessian-aware, adaptively scaled gradient descent method that uses local curvature information to modulate steps for unconstrained optimization.
- It computes a scalar from inner products between gradients and Hessian-vector products, ensuring local unit steps and global convergence even in nonconvex settings.
- Empirical results on tasks like logistic regression and deep neural networks demonstrate its robust performance with low per-iteration cost compared to first-order methods.
AdaHesScale is a Hessian-aware, adaptively scaled variant of gradient descent designed for large-scale unconstrained optimization where the objective is twice differentiable and potentially nonconvex. It modifies the standard gradient descent scheme by introducing a scalar scaling for the gradient direction derived from local curvature information, maintaining low per-iteration cost and enhancing robustness to the choice of step size. AdaHesScale preserves the simplicity of first-order methods while integrating second-order information to provide a local unit step size guarantee and global convergence under notably weaker smoothness requirements than traditional gradient descent approaches (Smee et al., 6 Feb 2025).
1. Problem Formulation and Notation
AdaHesScale addresses optimization problems of the form
where is twice continuously differentiable and bounded below, is the gradient, and is the Hessian. At the th iteration, has associated gradient and Hessian .
2. Hessian-Aware Scaling Mechanics
Instead of adapting the search direction, AdaHesScale introduces a positive scalar scaling that modulates the gradient: The search direction is determined to satisfy the second-order descent condition: mirroring descent properties in Newton-type schemes. The following canonical scalings are defined for the one-dimensional case: In situations with negative curvature (), the method permits larger scalings; for small positive curvature, a cap is enforced, where is a small tolerance parameter.
3. Update Rule, Line Search, and Algorithmic Structure
The update is performed as
with the step size chosen to ensure sufficient decrease through Armijo-type backtracking or forward tracking: Algorithmically, each iteration involves:
- Gradient evaluation.
- Hessian-vector product via reverse-mode autodifferentiation.
- Scalar operations and possible function evaluations during line search.
Summary Table: Core Steps and Computations
| Step | Operation | Computational Element |
|---|---|---|
| Compute | Gradient, Hessian-vector | |
| Test curvature | Inner product | |
| Select | Curvature-guided, fixed or adaptive | Scalar selection (CG/MR/GM) |
| Line search | Satisfy Armijo rule | Function evaluation(s) |
4. Local and Global Convergence Properties
The method attains a local unit-step guarantee near a local minimizer where the second-order sufficient conditions are met: with . In the region close to , the method accepts at each step, and achieves Q-linear convergence:
Global convergence is demonstrated under weakened smoothness compared to classical gradient descent. The following directional conditions are imposed:
- Hessian directional smoothness: For some ,
- Hessian-gradient directional smoothness: For some , if ,
With these, Algorithm 2 requires at most iterations to achieve .
5. Handling Inexact Hessian Information
Recognizing that exact Hessian computation may not be feasible, AdaHesScale allows for Hessian approximations under the error bound: Mild inexactness along the gradient direction does not compromise either the local unit-step-size acceptance or the global convergence rate, provided an analogous smoothness assumption holds for .
6. Algorithmic Pseudocode and Computational Overhead
The AdaHesScale method consists of the following core pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 |
Require initial x0, tolerance εg, ρ∈(0,½), σ>0
For k=0,1,2,...
Evaluate gk=∇f(xk)
If ∥gk∥≤εg: stop
Compute vk=Hk gk (Hessian-vector product)
γ←⟨gk, vk⟩
If γ > σ∥gk∥^2 (SPC): sk←∥gk∥^2/γ (CG scaling or MR/GM variant)
Else if γ≥0 (LPC): sk←1/σ
Else (NC): sk←1/σ or larger
pk=−sk gk
Choose αk by Armijo line search
xk+1←xk+αk pk |
Each iteration involves a gradient and a Hessian-vector product, a few inner products and scalar operations, and a small number of function evaluations attributable to the line search.
7. Empirical Performance and Practical Findings
AdaHesScale was empirically validated on:
- Convex -regularized multiclass logistic regression (CIFAR-10)
- Nonconvex two-layer MLP (FashionMNIST)
- Deep ResNet-18 (Imagenette)
Baselines included fixed-step and line-search gradient descent, Heavy-Ball, Nesterov acceleration, and Adam, assessed by oracle call counts. The alternating CG/MR scaling variant ("MRCG") produced the best monotonic decrease in , with unit steps accepted almost everywhere. AdaHesScale matched or outperformed well-tuned first-order methods, without the need for hand-tuning step sizes. The MR variant particularly favored monotonic reduction in under unit step sizes, showing a strong built-in bias toward gradient norm reduction.
These results indicate that AdaHesScale delivers the robustness and low per-iteration cost of plain gradient descent, combined with curvature adaptation that enables locally aggressive steps and globally reliable convergence, even under substantially weakened smoothness assumptions (Smee et al., 6 Feb 2025).