Papers
Topics
Authors
Recent
2000 character limit reached

Hessian-Corrected Momentum (HCM)

Updated 22 December 2025
  • Hessian-Corrected Momentum is a second-order method that incorporates Hessian-vector products to correct classical momentum updates.
  • It improves convergence rates and reduces bias and variance by integrating local curvature information into momentum adjustments.
  • HCM is applied in various domains including stochastic optimization, distributed learning, reinforcement learning, and MCMC sampling.

Hessian-Corrected Momentum (HCM) is a class of second-order algorithms and Markov Chain Monte Carlo (MCMC) proposal mechanisms that enhance the performance of momentum-based optimization and sampling methods by explicitly incorporating local curvature information via the Hessian of the objective or log-posterior. HCM systematically corrects the classical momentum or proposal update by adding terms involving Hessian-vector products, yielding substantial improvements in convergence rate, variance reduction, and adaptation to high-curvature regions. This approach has been instantiated in stochastic nonconvex optimization, distributed learning, policy gradient reinforcement learning, MCMC, and accelerated gradient flows.

1. Core Principles of Hessian-Corrected Momentum

The defining feature of HCM is the explicit use of the Hessian or its approximations to correct the bias or inefficiency present in first-order momentum updates. The classical Polyak or Nesterov momentum methods track a moving average of previous gradients, leading to an O(xtxt1)O(\|x_t - x_{t-1}\|) discrepancy between the momentum and the actual gradient at the new point. HCM injects a Hessian-vector product term to transport the momentum more accurately:

mt=(1α)mt1+αf(xt+k(xtxt1))m_{t} = (1-\alpha)m_{t-1} + \alpha \nabla f(x_t + k(x_t - x_{t-1}))

Hessian correction:mtmt+2f(xt)(xtxt1)\text{Hessian correction:}\quad m_t \leftarrow m_{t} + \nabla^2 f(x_t)(x_t - x_{t-1})

or, for stochastic methods:

g^t=(1α)[g^t1+2f(xt,zt)(xtxt1)]+αf(xt,zt)\hat g_t = (1-\alpha)\left[\hat g_{t-1} + \nabla^2 f(x_t, z_t)(x_t - x_{t-1})\right] + \alpha \nabla f(x_t, z_t)

This ensures the transported momentum approximates the gradient at xtx_t up to O(xtxt12)O(\|x_t-x_{t-1}\|^2), yielding faster bias decay and variance reduction (Tran et al., 2021).

The correction appears in various forms: as a discrete finite-difference in optimization flows (Hadjisavvas et al., 18 Jun 2025); as a variance-reduction term in stochastic cubic Newton methods (Chayti et al., 25 Oct 2024, Yang et al., 17 Jul 2025); as a quadratic Taylor approximation in Langevin-based MCMC (House, 2015); or as a curvature-adaptive mass matrix in HMC (Karimi et al., 2023, House, 2017, Jin et al., 2019).

2. HCM in Stochastic and Deterministic Optimization

HCM methodology improves convergence rates for smooth, possibly nonconvex, objectives. The second-order momentum update achieves an O(K1/3)O(K^{-1/3}) convergence rate in the expected gradient norm, which matches the lower bound for nonconvex optimization with second-order oracles, and surpasses the O(K1/4)O(K^{-1/4}) rate of first-order Polyak momentum (Khirirat et al., 15 Dec 2025, Sadiev et al., 18 Nov 2025). The general update (in arbitrary norm):

mk+1=(1αk)(mk+2fξk+1(x^k+1)(xk+1xk))+αkfξk+1(xk+1)m_{k+1} = (1-\alpha_k)(m_k + \nabla^2 f_{\xi_{k+1}}(\hat x_{k+1})(x_{k+1} - x_k)) + \alpha_k \nabla f_{\xi_{k+1}}(x_{k+1})

where x^k+1\hat x_{k+1} is a convex combination of xk,xk+1x_k, x_{k+1}, generalizes across Euclidean and non-Euclidean geometries.

Key theoretical guarantees are established for broad classes of smooth nonconvex functions with Lipschitz Hessian or under relaxed smoothness in arbitrary norms (Khirirat et al., 15 Dec 2025). HCM remains robust even under high Hessian noise and maintains parameter-agnostic convergence if step-sizes decay appropriately (Sadiev et al., 18 Nov 2025).

The incorporation of the Hessian correction eliminates the O(η)O(\eta) transport error of classical momentum by replacing it with an O(η2)O(\eta^2) or O(ηf)O(\eta\|\nabla f\|) error. This justifies the improved theoretical rates and practical gains in deep learning models such as MLPs and LSTMs (Khirirat et al., 15 Dec 2025).

3. HCM in Stochastic Cubic Regularized Newton Methods

Cubic-regularized Newton (CRN) and its stochastic variants (SCRN) achieve stronger complexity bounds compared to first-order methods but suffer from high Hessian estimation costs. HCM dramatically improves efficiency in this context by maintaining a momentum-corrected Hessian estimator MkM_k using two schemes:

  • Polyak-style Hessian-momentum:

Mk=(1θk1)Mk1+θk1H(xk;ξk)M_k = (1-\theta_{k-1})M_{k-1} + \theta_{k-1} H(x^k; \xi^k)

  • Recursive (SPIDER-type) Hessian-momentum:

Mk=(1θk1)Mk1+H(xk;ξk)(1θk1)H(xk1;ξk)M_k = (1-\theta_{k-1})M_{k-1} + H(x^k; \xi^k) - (1-\theta_{k-1}) H(x^{k-1}; \xi^k)

A cubic-regularized subproblem is solved with the gradient and Hessian-momentum estimators, yielding improved convergence rates:

Scheme Iteration Complexity Reference
SCRN + Polyak HCM O~(max{ϵg7/4,ϵH7})\widetilde O(\max\{\epsilon_g^{-7/4},\,\epsilon_H^{-7}\}) (Yang et al., 17 Jul 2025)
SCRN + Recursive HCM O~(max{ϵg5/3,ϵH5})\widetilde O(\max\{\epsilon_g^{-5/3},\,\epsilon_H^{-5}\}) (Yang et al., 17 Jul 2025)
SCN + HCM (mini-batch = 1) Global 2nd-order stationary point, T3/7T^{-3/7} gradient decay (Chayti et al., 25 Oct 2024)

Empirical results demonstrate that HCM-based SCRN achieves iteration efficiency similar to full-batch CRN at substantially lower per-iteration cost, outperforming state-of-the-art first-order methods in nonconvex regimes (Yang et al., 17 Jul 2025, Chayti et al., 25 Oct 2024).

4. HCM in Markov Chain Monte Carlo and Sampling Methods

HCM offers an algorithmic bridge between first-order MCMC methods (e.g., MALA, vanilla HMC) and full manifold-adaptive samplers (e.g., Riemannian Manifold HMC). In Langevin-type algorithms (HMALA), local quadratic Taylor expansion of the log-posterior leads to a Gaussian proposal:

q(xx)=N(x;x+m,S)q(x' \mid x) = \mathcal{N}(x' ; x + m, S)

m=(e12HδI)H1v,S=(eHδI)H1,v=L(x)m = (e^{\frac12 H\delta} - I) H^{-1} v,\quad S = (e^{H\delta} - I)H^{-1},\quad v = \nabla L(x)

A Metropolis–Hastings correction restores exactness (House, 2015). This produces higher effective sample sizes and rapid mixing, especially in multimodal or high-curvature regimes.

Hessian-corrected HMC (H-HMC, HCM-HMC) variants (House, 2017, Karimi et al., 2023, Jin et al., 2019) replace the standard mass matrix with an inverse Hessian, either "locally" (updated per trajectory) or "nonlocally" (fixed at a MAP point or from an ensemble L-BFGS estimate). This curvature-informed kinetic energy accelerates exploration of narrow directions, improves acceptance rates (up to 0.92 versus 0.86 in vanilla HMC), and drastically reduces autocorrelation in posterior samples, while avoiding the computational expense and complexity of RMHMC.

Method Hessian Usage Cost Advantages Reference
HMALA Local Hessian, proposal O(d3)O(d^3) Ambitious moves, geometric ergodicity (House, 2015)
HCM-HMC (House) Hessian at trajectory ends 2×O(d2)2 \times O(d^2) Preconditioned momentum, fast mixing (House, 2017)
HCM-HMC (Karimi) Local or MAP Hessian, kinetic O(d3)O(d^3) setup Robustness, efficient in high dimensions (Karimi et al., 2023)
Ensemble QN-HMC L-BFGS, ensemble Hessian O(LV)O(LV) Data-driven, scalable, ensemble exchange (Jin et al., 2019)

5. Mechanisms for Variance Reduction and Stability

HCM mechanisms provide both bias correction and variance reduction in stochastic settings. In distributed or compressed settings (EF21-HM), the update

vit+1=(1ηt)(vit+2fi(xt+1;ξi)(xt+1xt))+ηtfi(xt+1;ξi)v_i^{t+1} = (1-\eta_t)(v_i^t + \nabla^2 f_i(x^{t+1}; \xi_i)(x^{t+1} - x^t)) + \eta_t \nabla f_i(x^{t+1}; \xi_i)

matches the lower bound O(1/T1/3)O(1/T^{1/3}) for L-smooth nonconvex optimization, even under communication constraints (Sadiev et al., 18 Nov 2025). In SGD-based methods (SGDHess), the correction yields O(ϵ3)O(\epsilon^{-3}) sample complexity to ϵ\epsilon-stationarity without large batches (Tran et al., 2021).

In the context of strongly quasiconvex functions, the continuous-time ODE

x¨+αx˙+β2h(x)x˙+h(x)=0\ddot{x} + \alpha \dot{x} + \beta \nabla^2 h(x) \dot{x} + \nabla h(x) = 0

discretizes to an HCM scheme whose Hessian correction term, θ(h(xk)h(xk1))-\theta(\nabla h(x_k) - \nabla h(x_{k-1})), suppresses oscillations and ensures linear convergence to the minimizer, outperforming classical Heavy Ball and Nesterov accelerations (Hadjisavvas et al., 18 Jun 2025).

6. Applications and Empirical Results

Empirical studies demonstrate HCM's effectiveness in several domains:

  • MCMC and Bayesian inference: In high-dimensional inverse problems (e.g., log-normal permeability with d=936d=936), HCM-HMC achieves rapid decorrelation (autocorrelation drops to zero in lag 5\approx5) and delivers accurate posterior intervals after far fewer samples than standard HMC (Karimi et al., 2023).
  • Nonconvex stochastic optimization: HCM-based SCRN and LMO methods deliver faster convergence in training deep architectures (MLPs, LSTMs), consistently achieving lower loss and gradient norm than first-order and classical momentum baselines (Khirirat et al., 15 Dec 2025).
  • Distributed and compressed SGD: HCM variants in error feedback frameworks outperform alternative momentum approaches under both theory and practice (Sadiev et al., 18 Nov 2025).
  • Reinforcement learning: NPG-HM, using HCM for variance reduction in natural policy gradient estimates, provides optimal O(ϵ2)O(\epsilon^{-2}) global last-iterate sample complexity and empirically surpasses MNPG, PPO, and other advanced baselines on Mujoco continuous control tasks (Feng et al., 2 Jan 2024).

7. Computational Considerations and Trade-offs

Hessian-corrected momentum variants typically incur greater per-iteration costs due to Hessian or Hessian-vector product evaluations. For moderate problem sizes (d50d \lesssim 50), direct computation is practical; for larger dd, efficient approaches include:

  • Krylov subspace methods for ϕ\phi-functions and Hessian-vector products (House, 2015)
  • Low-rank or diagonal approximations of the Hessian
  • Memory-efficient L-BFGS approximations in ensemble MCMC (Jin et al., 2019)
  • Batched or approximate Hessian operations in stochastic settings (Tran et al., 2021)

Despite the increased cost, HCM often yields a net reduction in computational effort due to faster convergence and improved mixing, particularly in stiff, anisotropic, or multimodal regimes where first-order momentum methods stagnate (House, 2015, Karimi et al., 2023, Khirirat et al., 15 Dec 2025).


In summary, Hessian-Corrected Momentum provides a principled and broadly applicable second-order extension of momentum-based algorithms for optimization and sampling, combining bias correction, variance reduction, and local curvature adaptation. Its variants consistently improve theoretical guarantees and empirical efficiency across optimization, distributed learning, policy gradient methods, and MCMC, subject to the cost of accessing second-order information. The ongoing trend is toward scalable implementations leveraging Hessian-vector products and approximate curvature, making HCM a key technique in the toolset for large-scale machine learning and Bayesian computation.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Hessian-Corrected Momentum (HCM).