Hessian Approximation Techniques
- Hessian approximation is a method to construct a surrogate for the true Hessian, reducing computational and storage costs in high-dimensional optimization.
- Matrix-free, diagonal, and hierarchical low-rank techniques preserve critical curvature and spectral properties while enabling scalable Newton-type methods.
- Derivative-free and Bayesian approaches offer robust Hessian estimates with limited function evaluations, enhancing performance in machine learning and inverse problems.
A Hessian approximation is any computational technique or formula for constructing a surrogate to the exact Hessian matrix ∇²f(x) of a scalar function f(x), especially when ∇²f(x) is too costly or infeasible to compute directly. Hessian approximation is central to high-dimensional nonlinear optimization, PDE-constrained inverse problems, stochastic and derivative-free optimization, and large-scale machine learning. Techniques include structured analytic surrogates, matrix-free vector products, hierarchical low-rank representations, diagonal or block-diagonal heuristics, randomized finite-difference estimators, and Bayesian learning of the Hessian from noisy data.
1. Theoretical Foundations of Hessian Approximation
The Hessian ∇²f(x) contains all second partial derivatives of a function f: ℝⁿ → ℝ. In many optimization algorithms—and for uncertainty quantification in inverse and machine learning problems—the Hessian governs local curvature, Newton directions, and posterior variances. However, for n ≫ 1, ∇²f(x) is both expensive to form (O(n²) storage, O(n³) factorization) and, for PDE-constrained or implicit functions, never explicitly available.
The central objective is therefore to build an approximation H̃(x) ≈ ∇²f(x) that:
- Is much cheaper to form, store, or invert,
- Preserves spectral or structural properties (positive definiteness, locality, low rank, or sparsity),
- Retains convergence guarantees or practical efficacy in optimization or inference.
Approximations are tailored to the structure of f(x), the computational constraints (e.g., only matrix-free Hessian-vector products), and the statistical noise regime (deterministic, stochastic, or black-box).
2. Structured and Matrix-Free Diagonal Hessian Approximations
In nonlinear least-squares, f(x) = ½‖F(x)‖², the Hessian decomposes as ∇²f(x) = J(x)ᵀJ(x) + C(x), where J(x) is the Jacobian of F and C(x) is a sum of second-derivative terms (Awwal et al., 2020). Directly forming J(x)ᵀJ(x) is O(n²m), and the second term is usually neglected in practice (Gauss–Newton, GN). The iterative diagonal Hessian approximation (“ASDH”) constructs a matrix-free diagonal Dₖ ≈ ∇²f(xₖ) satisfying
Dₖ s_{k−1} ≈ J_kᵀJ_k s_{k−1} + (J_k–J_{k−1})ᵀF_k,
where all quantities are computed as matrix-vector products. Diagonal entries dᵢ are set to max{yᵢ/sᵢ, 0} and then safeguarded for strict positive definiteness. The resulting iteration forms a scalable, globally convergent direction-determining surrogate that is competitive on large-scale least-squares problems. Compared to dense BFGS (which is storage-intensive and does not exploit JᵀJ+C structure), such diagonal formulas are O(n) per iteration and well-suited for matrix-free contexts (Awwal et al., 2020).
3. Hierarchical and Low-Rank Hessian Approximations in PDE Inverse Problems
In PDE-constrained large-scale inverse problems, the Hessian is often approximated as a sum of low-rank or hierarchical off-diagonal low-rank (HODLR) matrices to enable fast Newton-type steps, posterior sampling, and Bayesian quantification (Hartland et al., 2023, Ambartsumyan et al., 2020). A symmetric HODLR matrix is constructed by block-partitioning and recursively compressing all off-diagonal blocks to rank ≪ n. The overall format is
A = Q L Qᵀ + ∑_{levels,blocks} U Σ Vᵀ,
where each UΣVᵀ represents a compressed off-diagonal block. The HODLR (and broader hierarchical matrix—ℋ-matrix) structure allows O(n log n) matrix-vector products, O(n log²n) direct solutions, and adaptive error control.
Building such approximations requires only matrix-free Hessian-vector products (available via forward–adjoint PDE solves or adjoint-based methods), so storage and computational costs are massively reduced compared to dense Hessians. Empirical results on ice-sheet and geophysical models show that HODLR outperforms global low-rank (truncated SVD, randomized SVD) when data informativeness is high or the sensitivity is localized (Hartland et al., 2023, Ambartsumyan et al., 2020).
4. Derivative-Free and Black-Box Hessian Approximations
When gradients and Hessian-vector products are unavailable, oracles may offer only function evaluations. Contemporary derivative-free schemes use finite difference analogues, randomized perturbations, or interpolating polynomials:
- Stein’s identity approaches: In black-box settings, the second-order Stein identity enables an unbiased symmetric Hessian estimate with only three zeroth-order (function value) queries per iteration (Zhu, 2021). For x ∈ ℝⁿ, u ∼ N(0, I):
This matches the optimal O(k{-1/3}) convergence rate for stochastic approximation with reduced query complexity relative to 2SPSA.
- Nested-set and simplex-based approximations: Matrix-algebraic schemes, such as the nested-set Hessian approximation and the (generalized) simplex Hessians, estimate either full or partial Hessians via structured sets of function evaluations and pseudoinverse-based least-squares systems. For full n×n Hessians, (n+1)(n+2)/2 function evaluations suffice for order-O(Δ) or O(Δ²) accuracy (Hare et al., 2020, Hare et al., 2023, Jarry-Bolduc, 2021, Jarry-Bolduc et al., 2023).
These methods can be specialized to recover only the diagonal, selected entries, or projected subblocks, thus controlling evaluation cost. They are suitable for integration into derivative-free trust-region or model-based optimization.
5. Scalable Hessian Diagonal Approximations in Large-Scale Machine Learning
Deep learning and modern reinforcement learning require scalable diagonal Hessian approximations for efficient second-order optimization and step-size adaptation. In this context, the Hessian diagonal can be approximated via layerwise backpropagation recursions:
- The HesScale algorithm runs a backward pass through a network, estimating the diagonal of the pre-activation Hessian at each layer using only local connections and previously propagated signals. For each layer l:
The final layer is set to its exact diagonal (e.g., softmax+cross-entropy: ). This algorithm matches the per-iteration cost of classical backpropagation (O(n)), in sharp contrast to finite-difference or Hutchinson-style MC schemes (AdaHessian), which exhibit significantly higher computational costs and variance (Elsayed et al., 2022, Elsayed et al., 2024).
- Empirical evaluations reveal HesScale to achieve state-of-the-art fidelity to the true diagonal, outperforming MC or GGN-based alternatives in both supervised (CIFAR, MNIST) and RL (MuJoCo, real-robot) settings, and remains robust in all tested architectures (Elsayed et al., 2024).
6. Advanced Analytic and Bayesian Approaches to Hessian Approximation
Several problem classes support further analytic or data-driven reduction of the Hessian:
- Polarization tensor-based approximations: In PDE inverse problems with well-separated inclusions, the asymptotic polarization tensor describes nonlinearity saturation. By differentiating the tensor analytically, a highly efficient diagonal Hessian can be assembled, which captures critical curvature and saturation effects not available to the naive Gauss–Newton diagonal (Watson et al., 2021). These diagonal Hessians dramatically accelerate BFGS-style convergence and enhance the separation of close inclusions in imaging tasks.
- Bayesian learning of Hessian: In stochastic/quasi-Newton settings, the Hessian can be learned as the MAP estimate under a likelihood based on the noisy secant equations (matching curvature pairs sₖ and yₖ) and a prior encoding the spectral constraints (e.g., barriers on extreme eigenvalues) and proximity to the previous iterate. The resulting Hessian subproblem is solved by an interior-point method (Newton–CG) and empirically ensures stable, condition-controlled preconditioning for stochastic gradient descent, even in highly ill-conditioned scenarios (Carlon et al., 2022).
7. Applications and Impact Across Scientific Computing and Learning
Hessian approximations pervade areas where second-order information is needed but matrix-forming or storing is infeasible:
- In large-scale nonlinear least-squares and inverse PDE problems, fast Hessian approximations (hierarchical, diagonal, polarization-tensor, or adjoint-based) deliver tractable Newton or quasi-Newton steps and enable Bayesian uncertainty quantification by making posterior sampling practical (Awwal et al., 2020, Watson et al., 2021, Hartland et al., 2023, Ambartsumyan et al., 2020, Ito et al., 2019).
- In stochastic zeroth-order and reinforcement learning, diagonal (e.g., HesScale), secant-equation–based Bayesian Hessians, or fast Stein-identity estimators enable robust convergence and stability of optimizers at minimal computational overhead (Elsayed et al., 2022, Elsayed et al., 2024, Carlon et al., 2022, Zhu, 2021).
- In derivative-free optimization, nested-set and simplex-based Hessian approximations offer systematic error control and adaptability for high dimensionality and partial estimation regimes (Hare et al., 2020, Hare et al., 2023, Jarry-Bolduc, 2021, Jarry-Bolduc et al., 2023).
- In deep learning, scalable Hessian approximations are critical for uncertainty quantification (e.g., Laplace/free Laplace variances via regularization perturbation (McInerney et al., 2024)), adaptive step-size scaling, and controlling the geometry of optimization trajectories.
Empirical and theoretical results across these domains demonstrate that properly designed Hessian approximations can simultaneously deliver efficiency, robustness, spectral (eigenvalue) control, and problem-specific structural exploitation—without sacrificing global convergence guarantees in practical regimes.