Diagonal Hessian Estimation Techniques
- Diagonal Hessian estimation is the process of approximating a function's Hessian diagonal using gradient, function, or limited Hessian-vector evaluations, enabling efficient curvature analysis.
- Methods such as finite differences, the CSHD scheme, and stochastic algorithms like curvature propagation balance computational cost with estimation accuracy in optimization and machine learning.
- Advanced approaches including HesScale, quantum spectral methods, and block-diagonal approximations offer robust solutions for high-dimensional inference and deep neural network training.
Diagonal Hessian estimation is the process of approximating or computing the diagonal entries of the Hessian matrix, , of a scalar function —either exactly or approximately—using only gradient, function, or limited Hessian-vector (or matrix-vector) information. The diagonal often serves as a computationally tractable proxy for the full Hessian in optimization, machine learning, numerical PDEs, and statistical estimation, where access to the entire matrix is prohibitive due to dimensionality, cost, or structural constraints.
1. Foundational Methods: Finite Differences, Function Evaluation, and Sample Set Design
Diagonal entries can be individually approximated by classical second-order finite difference schemes in the coordinate directions: where is the standard basis and a small step. This approach requires function evaluations and is widely adopted in blackbox/derivative-free optimization and negative curvature detection (Hare et al., 2022). In settings where only function values are available, such as black-box optimization or simulation-based models, the diagonal can be systematically estimated for all with $2n+1$ evaluations.
More advanced approaches, such as the Generalized Centered Simplex Hessian Diagonal (CSHD) (Jarry-Bolduc, 2021), use a set of sample directions and combine evaluations at using an explicit formula: where (elementwise square), and encodes symmetrized quadratic increments. Achieving optimal quadratic accuracy, i.e., with , requires the sample set to be a full row rank "lonely matrix"—each aligned with a single coordinate direction. If the off-diagonal (strictly upper triangular) terms in the error bound are nonzero due to the sampling geometry, accuracy and convergence rates degrade regardless of the number of evaluations or step size (Jarry-Bolduc, 2021, Hare et al., 2023).
| Method | Complexity | Accuracy | Sample Set Requirement |
|---|---|---|---|
| Centered FD | Axis-aligned, equally spaced | ||
| CSHD | Lonely matrix, full row rank | ||
| GSH/GCSH | Arbitrary/projected for GSH |
2. Stochastic and Algorithmic Approaches in Machine Learning
In large-scale machine learning models, especially deep neural networks, exact computation or explicit storage of full Hessians is prohibitive. Multiple scalable, approximate methods for diagonal Hessian estimation have been developed.
Curvature Propagation (CP)
Curvature Propagation (Martens et al., 2012) is a stochastic algorithm that backpropagates random probe vectors through the computational graph. Each pass provides an unbiased rank-1 estimator of the Hessian, and the diagonal is estimated as , where results from the recursion with random starting points at each node. A single sample costs about two reverse-mode (gradient) passes; variance decays as with samples. CP achieves significantly lower variance on the diagonal compared to Hessian-vector product-based outer product estimators.
HesScale and Modern Diagonal Recursions
HesScale (Elsayed et al., 2022, Elsayed et al., 5 Jun 2024) revisits and refines deterministic diagonal backpropagation methods (e.g., Becker & LeCun, 1989). The key insight is to use the exact Hessian diagonal for the output layer (when available, e.g., in softmax-cross-entropy), which seeds a precise layerwise recursion: The method ignores off-diagonal second-derivative terms, yielding linear scaling and high-diacronal accuracy. Empirical studies show that HesScale provides the best approximation to the true Hessian diagonal compared to stochastic estimators (AdaHessian, Hutchinson/MC-GGN, etc.), and its efficiency enables use in large-scale settings.
This family of methods is supported by theoretical and empirical analyses of layerwise diagonal dominance in DNN Hessians, especially for wide/deep architectures (Elsayed et al., 2022, Elsayed et al., 5 Jun 2024).
| Estimator | Unbiased | Complexity | Diag. Error (ML) | Comments |
|---|---|---|---|---|
| CP (stochastic) | Yes | 2grad | Low | Need averaging |
| HesScale | No | ~2grad | Very Low | Deterministic, best |
| BL89 | No | ~2grad | Medium | Older approx. |
| AdaHessian | Yes | grad | High unless | Stochastic |
3. Structural and Block-diagonal Hessian Approximations
Recent theoretical work (Dong et al., 5 May 2025) has illuminated why diagonal and block-diagonal Hessian estimators are particularly effective for modern neural networks, particularly those with many output classes (as in LLMs). Both empirical and rigorous analyses show that as the number of output classes , off-diagonal block norms decay as in output layer Hessians, and as for hidden-layer blocks. This "static force" arises from architectural design (layer, class decomposition) and is supplemented by further "dynamic" suppression during training. Thus, block-diagonal or even diagonal approximations become asymptotically near-exact for large classification or LLMs.
Block-diagonal estimators, as instantiated in Block Diagonal Hessian-Free optimization (Zhang et al., 2017), generalize the diagonal approach by modeling intra-block curvature while ignoring inter-block interactions. This framework yields robustness, improved optimization, and efficient parallelization, particularly for large minibatch training. The diagonal is recovered as the limiting case of blocks.
| Structure | Decay of off-diagonal/block | Implication |
|---|---|---|
| Output (classes) | , | Large : nearly exact |
| Hidden layer | Large : block-diag |
4. Derivative-Free, Blackbox, and Quantum Settings
In derivative-free optimization (DFO), or when only blackbox function evaluations are possible, diagonal Hessian estimation must avoid gradients/Hessian-vector products. Matrix algebra (Hare et al., 2023) and generalized simplex formulations (Jarry-Bolduc, 2021) enable explicit, order-optimal diagonal estimation using only targeted function values. Accuracy is critically determined by the geometry and rank of the sample set; axis-aligned or "lonely" (one nonzero per column) sample sets achieve quadratic accuracy in the sample set radius.
For high-efficiency detection of negative curvature, algorithms first check Hessian diagonals (via finite differences or exact values) and proceed to build off-diagonal elements only as needed; negative diagonal entries immediately certify negative eigenvalues (Hare et al., 2022).
Quantum algorithms have achieved exponential speedup for diagonal and sparse Hessian estimation in settings where function access is via phase oracle (Zhang et al., 4 Jul 2024). The quantum spectral method achieves query complexity for diagonal Hessian estimation, exponentially faster than classical approaches for -dimensional problems, provided the Hessian is diagonal. For generic (dense) Hessians, quantum methods achieve at most quadratic speedup.
| Approach | Query Complexity (diag) | Assumptions |
|---|---|---|
| Finite difference | Classical, adjust | |
| Quantum spectral | Quantum, diagonal/sparse case | |
| Quantum finite diff | Quantum, dense case |
5. Applications in Optimization, Quantization, and High-dimensional Inference
Diagonal Hessian estimation is broadly used for:
- Second-order optimization: Diagonal preconditioners support adaptive step sizes (e.g., Adam, RMSProp, AdaHesScale) and stabilize convergence in deep learning (Elsayed et al., 2022, Elsayed et al., 5 Jun 2024, Zhang et al., 2017).
- Blackbox optimization/derivative-free algorithms: Directional curvature estimation (HE-ES) (Glasmachers et al., 2020, Glasmachers et al., 2020) uses finite-difference directional probing to tune covariance matrices and adapt search distributions.
- Quantization and compression: In data-free post-training quantization (DFQ), diagonal and block-diagonal Hessian approximations inform sensitivity-aware quantizer placement, as in SQuant (Guo et al., 2022). Data-free, architecture-agnostic algorithms leverage a progression of diagonal/kernal/global channel estimates to optimize quantization objectives without fine-tuning or data access.
- High-dimensional statistics: Diagonal elements of sparse precision (inverse covariance) matrices are critical for inference and structure learning. Residual variance, maximum likelihood, and symmetry-enforcing estimators provide risk-optimal or robust diagonal estimates, depending on accuracy of regression coefficient estimation (Balmand et al., 2015).
- Distributed optimization: Newton-like methods with diagonal correction balance communication and performance by (block-)diagonal Hessian inversion and efficient local updates (Bajovic et al., 2015).
6. Limitations, Error Bounds, and Best Practices
Theoretical analyses (Jarry-Bolduc, 2021, Hare et al., 2023) show that:
- Accuracy of diagonal estimation depends critically on sample set geometry: With "lonely" (coordinate) directions, diagonal finite-difference/CSHD achieves error, matching optimal order in sample size.
- Off-diagonal coupling is a major source of error: If the sampling scheme or approximation discards high-magnitude off-diagonal Hessian terms, error does not vanish with smaller sample radii.
- Stochastic estimators (CP, AdaHessian, Hutchinson-type) are unbiased but may require many samples to reach the accuracy of deterministic, structure-aware recursions like HesScale.
- Block/diagonal approximations become near-exact in large-class, wide, or deep neural network regimes, a direct consequence of architectural decoupling at scale (Dong et al., 5 May 2025).
For robust implementation in DFO or automated optimization code:
- Use symmetric, coordinate-based or well-poised sample sets to guarantee quadratic accuracy in finite-difference frameworks (Jarry-Bolduc, 2021, Hare et al., 2023).
- For large-scale deep learning, use HesScale for Hessian diagonals, or block-diagonal approximations if grouped parameter structure is significant (Elsayed et al., 2022, Zhang et al., 2017).
- In quantum or blackbox settings, adopt the quantum spectral method for functions where phase oracles are available and sparsity permits, otherwise use classic finite difference diagonals (Zhang et al., 4 Jul 2024).
- In high-dimensional statistical inference, when regression coefficients are accurately estimated, use symmetry-enforcing maximum likelihood; otherwise use the residual variance estimator for safety (Balmand et al., 2015).
References
- (Jarry-Bolduc, 2021) Approximating the diagonal of a Hessian: which sample set of points should be used
- (Elsayed et al., 2022) HesScale: Scalable Computation of Hessian Diagonals
- (Elsayed et al., 5 Jun 2024) Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning
- (Martens et al., 2012) Estimating the Hessian by Back-propagating Curvature
- (Hare et al., 2023) A matrix algebra approach to approximate Hessians
- (Zhang et al., 2017) Block-diagonal Hessian-free Optimization for Training Neural Networks
- (Dong et al., 5 May 2025) Towards Quantifying the Hessian Structure of Neural Networks
- (Hare et al., 2022) Detecting negative eigenvalues of exact and approximate Hessian matrices in optimization
- (Zhang et al., 4 Jul 2024) Quantum spectral method for gradient and Hessian estimation
- (Balmand et al., 2015) On estimation of the diagonal elements of a sparse precision matrix
- (Glasmachers et al., 2020) The Hessian Estimation Evolution Strategy
- (Awwal et al., 2020) Iterative algorithm with structured diagonal Hessian approximation for solving nonlinear least squares problems
- (Bajovic et al., 2015) Newton-like method with diagonal correction for distributed optimization
- (Guo et al., 2022) SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation
- (Fernandez et al., 28 Sep 2025) Sketching Low-Rank Plus Diagonal Matrices