Hessian Vector Approximation Techniques

Updated 16 November 2025

Hessian Vector Approximation is a set of numerical and algorithmic techniques that compute Hessian and inverse Hessian operations without forming the full matrix.
Methods such as Pearlmutter’s trick, finite differences, and hierarchical low-rank approximations enable efficient directional derivative computations with reduced computational cost.
These techniques are pivotal for scalable second-order optimization, Bayesian inference, and deep learning model compression, balancing accuracy with computational efficiency.

Hessian vector approximation refers to the collection of numerical, algorithmic, and matrix-approximation techniques that enable the application of a Hessian operator (or its inverse or square root) to a vector, without explicitly forming or storing the full Hessian matrix. This is a fundamental building block for second-order optimization, scalable Bayesian inference, scientific inverse problems, deep learning model compression, and uncertainty quantification in large parameter spaces. Owing to Hessian complexity ( $O(n^2)$ storage, $O(n^3)$ factorization), modern methods utilize structure, directional sampling, matrix-free products, and compression to obtain computationally tractable Hessian–vector products (HvP), inverse-Hessian–vector products (iHvP), or Hessian-root-vector products.

1. Core Algorithms for Hessian–Vector Products

Direct evaluation of $Hv$ via explicit differentiation or matrix assembly is rarely practical for modern applications, prompting a range of schemes as follows:

Method	Cost per Product	Storage
Pearlmutter’s trick	$O(n)$	$O(n)$
Finite difference	$2\times$ gradient	$O(n)$
Forward-mode AD (multivariate dual)	$O(n)$	$O(n)$ (multivariate dual numbers)
Matrix-free adjoint/forward	$O(n)$ – $O(n^2)$	$O(n)$
Low-/Hierarchical-rank matrix	$O(rn)$ or $O(n\log n)$	$O(rn)$ or $O(n\log n)$

In deep learning, $Hv$ is commonly evaluated using Pearlmutter’s trick (forward-over-reverse AD), the finite-difference of gradients, or via hierarchical/block approximation (Wang et al., 2020, Ranjan et al., 29 Oct 2024, Mathieu et al., 2014).
In ODE/PDE-constrained optimization, adjoint-based approaches compute $Hv$ via a sequence of variational, adjoint, and second-order adjoint integrations, often using memory-efficient checkpointing (Ito et al., 2019, Ambartsumyan et al., 2020).

Finite-difference approximations approximate $Hv \approx (\nabla f(x+\epsilon v) - \nabla f(x))/\epsilon$ for small $\epsilon$ , requiring two gradient evaluations and providing a practical route where second derivatives are not directly available (Uddin et al., 9 Nov 2025, Balboni et al., 2023).

In matrix-free PDE contexts, $Hv$ is obtained by repeated application of tangential and adjoint solvers, with high-level structure (sparsity, local low rank, or operator separability) exploited for efficiency (Ambartsumyan et al., 2020, Hu et al., 14 Jul 2025).

2. Matrix Approximation, Compression, and Parallelization

To overcome scaling bottlenecks for large $n$ , a variety of structure-exploiting schemes are used to compress the Hessian and accelerate $Hv$ operations:

Hierarchical/Block low-rank (H-matrix, HODLR):

Hierarchical, off-diagonal low-rank (HODLR) (Hartland et al., 2023) and general H-matrices (Ambartsumyan et al., 2020) partition the Hessian recursively, compress off-diagonal blocks by local low-rank factorization ( $H_{ts} \approx U_{ts} S_{ts} V_{ts}^T$ ), and enable $Hv$ and even inverse Hessian solves in $O(n \log n)$ or $O(r n \log n)$ time for fixed rank $r$ . These methods rely on geometric partitioning and randomized SVD for block basis construction.

Product-convolution and pseudodifferential symbol methods:

For high-rank PDE Hessians, point spread function (PSF) and pseudo-differential operator probing (PDO) sample local spatial and frequency response, interpolate in phase space, and express $Hv$ as sums of convolutions or localized FFTs (Hu et al., 14 Jul 2025).

FFT-style sparse rotation factorization:

Fast rotation learning with linearithmic layers of Givens rotations (FFT-like “butterfly” networks) with a diagonal core, $H \approx Q D Q^T$ , achieves $O(n \log n)$ cost per $Hv$ after training on $Hx$ samples. The factorization is learned by minimizing a reconstruction objective over training pairs (Mathieu et al., 2014).

Block-diagonal and Kronecker product (KFAC/EKFAC) preconditioners:

In deep learning, Kronecker-factored approximate curvature (KFAC/EKFAC) constructs blockwise eigen-decompositions of curvature matrices (sometimes the Fisher, sometimes Gauss-Newton or Hessian approximations) for each layer and provides cheap blockwise or diagonal pseudo-inverses for iHvP (Wang et al., 19 Jul 2025).

Forward-mode AD chunking and GPU parallelization:

The CHESSFAD library (Ranjan et al., 29 Oct 2024) chunks the forward-mode dual number representation so that each chunk computes a small batch of Hessian entries; this enables massive parallelism across Hessian rows and chunks, especially suited for GPUs.

3. Stochastic and Black-Box Hessian Approximation

When only function or noisy gradient evaluations are available (“black-box”), Hessian or Hessian-vector approximations must be constructed by stochastic probing, Stein identities, or derivative-free quadratic modeling:

Stein identity three-query Hessian estimation:

Using $u \sim \mathcal{N}(0,I)$ , Stein's identity enables the construction of an unbiased (up to $O(c^2)$ bias) estimator for $Hv$ via three queries: $f(x+cu)$ , $f(x-cu)$ , $f(x)$ , and contracting $(uu^T - I)v$ under a suitable scaling (Zhu, 2021).

Matrix algebra approach to Hessian approximation:

Quadratic interpolation or difference-based matrix algebra forms an approximate Hessian from pointwise function evaluations over structured samples, and extracts $Hv$ directly from a linear solve, suitable for derivative-free optimization (Hare et al., 2023).

Stochastic quasi-Newton methods as Hessian proxies:

In adaptive SG-MCMC, limited-memory L-BFGS approximations store $M$ pairs $(s_k, \bar y_k)$ to approximate the inverse Hessian recursively via two-loop algorithms, with positive-definite constraints and stochastic Robbins–Monro updates controlling the bias due to minibatching and truncation (Wang et al., 2020).

Bias–variance trade-offs in stochastic settings:

Black-box methods exhibit bias decreasing with step-size or sampling radius and variance that blows up as this parameter becomes small. Efficiency thus depends on balancing step-sizes, batch-sizes, and sample complexity (Zhu, 2021, Wang et al., 2020).

4. Inverse Hessian–Vector Product (iHvP) and Preconditioning

Inverse Hessian–vector computation ( $H^{-1}v$ ) is central to influence-function computation, training-data attribution, Newton-CG, and refined Bayesian MCMC proposals:

Neumann series and stochastic recursion:

Using a damped quadratic form $A=G+\lambda I$ , the Neumann series and stochastic recursion implement the power method or fixed-point iteration $u_{k+1}=u_k-\alpha A u_k + \alpha v$ , converging to $A^{-1} v$ ; typically terminated after $J$ steps or by residual (Wang et al., 19 Jul 2025).

Preconditioned Neumann (ASTRA):

Preconditioning by EKFAC—eigenvalue-corrected Kronecker blocks—improves condition number and allows the preconditioned update $\omega_{k+1} = \omega_k - \alpha P^{-1}(G_k+\lambda I)\omega_k + \alpha P^{-1} v$ . This reduces the number of required iterations by orders of magnitude versus unpreconditioned stochastic recursion for iHvP (Wang et al., 19 Jul 2025).

Quasi-Newton and L-BFGS smoothed inverse updates:

In sampling or optimization, positive-definite L-BFGS updates for Hessian and inverse Hessian are merged with stochastic smoothing and bias control via step-sizes, as in preconditioned Riemann-manifold Langevin MCMC (Wang et al., 2020).

HODLR and H-matrix inverse and square-root:

Once built, hierarchical and HODLR matrices support $O(N\log^2 N)$ inversion and square-root application, enabling Laplace proposal sampling for MCMC, and fast preconditioned conjugate-gradient solves where low-rank or banded structure is not applicable (Hartland et al., 2023, Ambartsumyan et al., 2020).

5. Applications in Optimization, Bayesian Inference, and Deep Learning

Hessian vector approximation methodology is integral to a spectrum of applications:

Second-order stochastic optimization and preconditioned SGD:

Adaptive learning-rate selection (ADLER) leverages fast Hessian–vector products to build a local quadratic model and solve for the minimizer in closed form, matching the accuracy of grid search with only twice the cost of SGD (Balboni et al., 2023).

Marginal likelihood estimation and Bayesian posterior sampling:

In Bayesian inverse problems and uncertainty quantification, both Gauss–Newton and high-rank Hessian approximations (PSF, PDO, or PSF+) underpin scalable Laplace approximations and Hessian-informed proposal distributions for MCMC. High-rank approximations are essential when the Hessian spectrum is extended due to highly informative data (Hu et al., 14 Jul 2025).

SG-MCMC sampling with geometry-adapted proposals:

Adaptive Hessian-preconditioned MCMC exploits limited-memory BFGS and stochastic updating to incorporate local geometric information, controlling bias and computational overhead. This yields improved posterior recovery and reduced autocorrelation in high-dimensional problems (Wang et al., 2020).

Model pruning and compression:

In neural network pruning, Hessian-vector products approximate the top eigenvalue/eigenvector (curvature direction) via power iteration, guiding cyclic pair-merging strategies that preserve sensitive weights and enable high sparsity with minimal accuracy loss, as evidenced in CAMP-HiVe (Uddin et al., 9 Nov 2025).

Training-data attribution:

Influence functions and related algorithms depend on accurate iHvP in deep (large-scale) models. Preconditioned Neumann methods (e.g., ASTRA) using EKFAC are shown to substantially improve training-data attribution scores relative to blockwise direct inversion and to reduce sensitivity to hyperparameters (Wang et al., 19 Jul 2025).

6. Computational Complexity and Implementation Trade-offs

Key trade-offs across techniques include:

Scheme	Time per $Hv$	Memory	Accuracy/bias	Parallelism
Pearlmutter/AD	$O(n)$	$O(n)$	Machine precision	Moderate
Finite difference	$2\times$ grad	$O(n)$	$O(\epsilon)$ bias	High
H-matrix/HODLR	$O(n \log n)$	$O(n \log n)$	Controllable, global	Yes
PSF/PDO (FWI)	$O(r n \log n)$	$O(r n)$	Local, phase-space interp	High, FFT-based
KFAC/EKFAC precond.	$O(n)$	$O(n)$	Block approx	Blockwise
Quasi-Newton (L-BFGS)	$O(M n)$	$O(M n)$	$O(M^{-1})$ bias	Moderate
Stochastic black-box	$O(n)$ per query	$O(n)$	$O(c^2)$ bias, $O(c^{-4})$ variance	High
Forward-mode chunked	$O(n^2/c)$	$O(c n)$	Exact up to AD+arith	Massive (GPU)

Low-rank or H-matrix schemes excel when the Hessian exhibits compressibility in spatially separated blocks (e.g., elliptic or parabolic PDEs, inverse problems), while high-rank PSF/PDO hybrid schemes are preferred for strongly informative, non-compressible operators (Hu et al., 14 Jul 2025, Hartland et al., 2023, Ambartsumyan et al., 2020). Power-iteration eigenvector methods provide scalable access to dominant directions for sensitivity and pruning. Stochastic and zero-order approaches offer practical Hessian-vector products when gradients or Hessians are unavailable.

Practical implementation usually includes empirical tuning of approximation parameters—chunk size for chunked AD (Ranjan et al., 29 Oct 2024), preconditioner damping for Neumann preconditioning (Wang et al., 19 Jul 2025), or memory $M$ for quasi-Newton schemes (Wang et al., 2020). Parallelization is critical; fine-grained decomposition across instances, rows, and chunks is directly exploited on GPUs in modern libraries.

7. Numerical Performance and Empirical Benchmarks

Quantitative results from the literature demonstrate that:

Limited-memory quasi-Newton Hessian-vector approximations lead to lower covariance estimation error and reduced sampling autocorrelation in SG-MCMC (Wang et al., 2020).
In CHESSFAD, forward-mode AD with chunking delivers 5–50% speedup compared to standard autodiff libraries on CPUs, and significant (sometimes order-of-magnitude) acceleration on batched large-scale GPU tasks (Ranjan et al., 29 Oct 2024).
Hierarchical/HODLR and PSF+/PDO schemes result in $2$– $10\times$ reductions in PDE-solve cost or effective sample size improvements compared to global low-rank approximations for PDE inversion, seismic FWI, or UQ (Hartland et al., 2023, Hu et al., 14 Jul 2025, Ambartsumyan et al., 2020).
In data-driven pruning and model compression, curvature-informed Hessian-vector approximations retain $3$– $5\%$ higher accuracy after significant sparsification, and greatly improve feature-map preservation metrics (Uddin et al., 9 Nov 2025).

These results emphasize that the combination of structured approximation, algebraic and stochastic estimation, and matrix-free product techniques yields scalable Hessian vector methods applicable to extreme-scale models and scientific applications.