Papers
Topics
Authors
Recent
2000 character limit reached

HesScale: Scalable Hessian Diagonals

Updated 21 December 2025
  • HesScale is a computational methodology for efficiently estimating the diagonal of the Hessian matrix in deep neural networks, enabling scalable second-order optimization.
  • It employs a recursive backward propagation using exact final layer estimates and a Gauss–Newton variant for piecewise-linear activations to maintain linear complexity.
  • Empirical benchmarks demonstrate that HesScale reduces computational overhead compared to Monte Carlo methods while improving convergence in both supervised and reinforcement learning tasks.

HesScale is a computational methodology and family of approximations for efficiently estimating the diagonal of the Hessian matrix in deep learning models. It enables scalable, accurate incorporation of second-order (curvature) information into optimization and reinforcement learning, while maintaining linear computational and memory complexity, similar to standard backpropagation. HesScale refines earlier approaches (notably, Becker and LeCun 1989) by providing a principled backward recursion for diagonal Hessian entries, utilizing exact computation in the last layer when possible, and propagating these diagonals through the network with minimal additional cost (Elsayed et al., 2022, Elsayed et al., 5 Jun 2024).

1. Mathematical Foundations

The Hessian matrix, H(θ)=θ2L(θ)H(\theta) = \nabla^2_\theta L(\theta), encapsulates the local curvature of the loss LL with respect to parameters θ\theta. For models with nn parameters, direct computation and storage of HH scales as O(n2)O(n^2), which is infeasible for modern deep networks. Practical implementations require efficient approximations to the Hessian or its diagonal. The diagonal entries, Hii(θ)H_{ii}(\theta), provide per-parameter curvature estimates critical for preconditioning updates and adaptive step size selection.

HesScale builds upon the principle that, for a feed-forward network, the Hessian diagonal can be recursively propagated layer by layer if all off-diagonal terms are ignored—following the scalable approximation of BL89. The major advancement is the replacement of the final layer diagonal (for standard losses) with its analytic, exact form (e.g., for softmax-cross entropy, [2C]ii=qi(1qi)[\nabla^2 C]_{ii} = q_i(1-q_i), where qq is the output probability vector) (Elsayed et al., 2022, Elsayed et al., 5 Jun 2024).

2. HesScale Algorithmic Formulation

Let ala_l denote pre-activations, hl=σ(al)h_l=\sigma(a_l) activations, WlW_l weights, and gal,i=L/al,ig_{a_l,i} = \partial L/\partial a_{l,i}.

The HesScale diagonal recursion for the pre-activation at layer ll is

H^al,i=σ(al,i)2kWl+1,k,i2H^al+1,k+σ(al,i)kWl+1,k,igal+1,k,\widehat{H}_{a_l,i} = \sigma'(a_{l,i})^2 \sum_{k} W_{l+1,k,i}^2\,\widehat{H}_{a_{l+1},k} + \sigma''(a_{l,i}) \sum_{k} W_{l+1,k,i} g_{a_{l+1},k},

initialized at the output layer (l=Ll=L) by its closed-form diagonal. For the weights,

H^Wl,i,j=H^al,i(hl1,j)2.\widehat{H}_{W_l,i,j} = \widehat{H}_{a_l,i}\,(h_{l-1,j})^2.

A Gauss–Newton variant, "HesScaleGN" (Editor's term), omits the σ\sigma'' term for piecewise-linear activations (e.g., ReLU), yielding

H^al,iGN=σ(al,i)2kWl+1,k,i2H^al+1,k.\widehat{H}^{\mathrm{GN}}_{a_l,i} = \sigma'(a_{l,i})^2 \sum_k W_{l+1,k,i}^2\,\widehat{H}_{a_{l+1},k}.

This iterative scheme matches the per-layer computational structure and cost of standard backpropagation (Elsayed et al., 2022, Elsayed et al., 5 Jun 2024).

3. Computational Complexity and Practical Implementation

HesScale achieves O(n)O(n) time and space complexity per example, in stark contrast to O(n2)O(n^2) for the exact Hessian. At each layer, the additional computation beyond backpropagation comprises a matrix-vector product, elementwise squares, and sums with the same dimensions as the usual gradient backward pass. Empirical timings indicate that HesScale (AdaHesScale) incurs approximately 2×2\times the computational cost of Adam, and HesScaleGN 1.25×1.25\times (Elsayed et al., 2022, Elsayed et al., 5 Jun 2024). By comparison, Monte Carlo-based unbiased approximations (e.g., AdaHessian) are 3×3\times or more (Elsayed et al., 5 Jun 2024).

The HesScale backward sweep can be implemented as an augmentation to the standard backward pass, requiring only the storage of diagonal Hessian estimates and gradients per parameter. For convolutional networks, the per-parameter complexity remains linear, though with larger constants due to the nature of convolutional layers.

4. Integration with Optimization and Reinforcement Learning

Incorporation into optimization is direct: the diagonal Hessian estimates serve as adaptive preconditioners for Newton-style updates,

θiθiαgiH^ii+ϵ.\theta_i \leftarrow \theta_i - \alpha\,\frac{g_i}{\widehat{H}_{ii} + \epsilon}.

Improved stability and generalization are empirically observed when integrated into Adam-like schemes (AdaHesScale), where exponential moving averages of gradients and squared diagonals are used: mt=β1mt1+(1β1)gt, vt=β2vt1+(1β2)(H^t)2, m^t=mt/(1β1t),v^t=vt/(1β2t), θt+1=θtαm^tv^t+ϵ.\begin{align*} m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t, \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) (\widehat{H}_t)^2, \ \hat{m}_t &= m_t/(1-\beta_1^t),\quad \hat{v}_t = v_t/(1-\beta_2^t), \ \theta_{t+1} &= \theta_t - \alpha\,\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}. \end{align*} (Elsayed et al., 2022, Elsayed et al., 5 Jun 2024)

For reinforcement learning, HesScale enables efficient step-size scaling and robust trust-region style updating. The nominal parameter update u=m^/(v^+ϵ)u=-\hat{m}/(\sqrt{\hat{v}} + \epsilon) can be scaled to enforce a quadratic constraint ΔθHΔθδ2\Delta\theta^\top H \Delta\theta \leq \delta^2, with minimal additional computational overhead: Δθ=min(1,δuH~u)u.\Delta\theta = \min\left(1, \frac{\delta}{\sqrt{u^\top \tilde{H}u}}\right) u. Empirical evidence indicates significant improvements in stability and insensitivity to learning rate choices in both simulated and real-world RL tasks (Elsayed et al., 5 Jun 2024).

5. Empirical Results and Benchmarking

Extensive experiments compare HesScale, HesScaleGN, BL89, GGN-diagonal, AdaHessian (MC sampling), and first-order methods. Key findings include:

  • Approximation accuracy: HesScale achieves the lowest L1L^1 error to the true Hessian diagonal among scalable methods, outperforming BL89 and even high-sample MC methods (Elsayed et al., 2022, Elsayed et al., 5 Jun 2024).
  • Supervised classification: On DeepOBS benchmarks (MNIST-MLP, CIFAR-10/100-CNNs), AdaHesScale and AdaHesScaleGN converge faster and reach lower test loss than first-order (Adam, SGD) and stochastic second-order methods (Elsayed et al., 2022).
  • Reinforcement learning: In MuJoCo and real-robot environments, AdaHesScale achieves higher final returns and faster learning in several tasks compared to Adam and AdaHessian. Step-size scaling with HesScale leads to robust, tuning-free optimization across wide learning rate ranges (Elsayed et al., 5 Jun 2024).
Method Relative Cost (Adam=1) Diagonal Approx. Error (Normed to HesScale=1.0)
AdaHesScale 2.0 1.0
AdaHesScaleGN 1.25 1.0
AdaHessian (MC1) 3.0 >6.5
BL89 1.8 1.8

6. Extensions, Limitations, and Future Directions

Limitations of HesScale include neglect of all off-diagonal Hessian structure, which may affect performance in settings with strong parameter coupling or highly non-linear architectures. Dependence on analytic second derivatives σ\sigma'' restricts some activation choices, though the GN variant mitigates this for piecewise-linearities (Elsayed et al., 2022, Elsayed et al., 5 Jun 2024).

Potential extensions:

  • Block-diagonal HesScale, propagating small blocks of Hessian diagonals for richer curvature.
  • Hybrid stochastic–deterministic diagonals, mixing HesScale recursion with MC-based traces (e.g., Hutchinson).
  • Application to natural gradient/trust region methods and as a pruning criterion (e.g., optimal brain surgeon).
  • Integration into non-standard architectures (RNNs, transformers, GNNs).

Ongoing research areas involve automatic switching between HesScale and Gauss–Newton variants by layer, optimizing implementations for large-scale convolutions, and theoretical analysis of sample complexity and convergence in highly nonconvex optimization (Elsayed et al., 2022, Elsayed et al., 5 Jun 2024).

7. Significance and Impact

HesScale addresses the fundamental bottleneck of incorporating exact second-order information into large neural network optimization, offering a practical compromise between computational feasibility and approximation fidelity. Its public empirical validation demonstrates superior performance over both traditional and stochastic diagonal approximations, with minimal additional compute in standard learning workflows.

By enabling accurate, scalable Hessian diagonal estimation, HesScale supports improved optimization speed, adaptive step size scaling, and enhanced stability in both supervised and reinforcement learning domains. Its lightweight implementation and extensibility portend adoption in future optimization frameworks and architectures that rely on fine-grained curvature adaptation (Elsayed et al., 2022, Elsayed et al., 5 Jun 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to HesScale.