Papers
Topics
Authors
Recent
2000 character limit reached

Stochastic Gradient Hamiltonian Monte Carlo

Updated 29 November 2025
  • SGHMC is a scalable Bayesian inference algorithm that leverages stochastic gradients, momentum dynamics, and noise injection for efficient sampling.
  • It utilizes the underdamped Langevin diffusion framework to achieve rapid exploration and robust convergence in high-dimensional, nonconvex problems.
  • Practical enhancements such as variance reduction, parallel tempering, and low-precision implementations improve performance in deep learning and distributed inference.

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) is a scalable Markov Chain Monte Carlo (MCMC) algorithm designed for efficient approximate Bayesian inference in high-dimensional and large-data regimes. SGHMC generalizes Hamiltonian Monte Carlo (HMC) by replacing exact gradients with stochastic estimates, typically via minibatching, and augments these with carefully calibrated momentum, friction, and noise-injection mechanisms. This approach preserves rapid, non-random-walk exploration, while maintaining tractable per-iteration complexity, making SGHMC central to Bayesian deep learning and modern stochastic optimization.

1. Mathematical Foundations and Algorithm

SGHMC is built from the continuous stochastic dynamics of underdamped Langevin diffusion. For a target distribution π(θ)exp(U(θ))\pi(\theta) \propto \exp(-U(\theta)), where U(θ)U(\theta) encodes the negative log-posterior, SGHMC introduces an auxiliary momentum pp for sampling the extended distribution: π(θ,p)exp(U(θ)K(p)),K(p)=12pM1p,\pi(\theta,p) \propto \exp(-U(\theta) - K(p)), \quad K(p) = \frac{1}{2} p^\top M^{-1} p, with MM the mass matrix.

The continuous dynamics are: dθt=M1ptdt,dpt=U(θt)dtγptdt+2γdWt,d\theta_t = M^{-1} p_t \, dt, \qquad dp_t = -\nabla U(\theta_t) \, dt - \gamma p_t \, dt + \sqrt{2\gamma} \, dW_t, where γ\gamma is the friction, and dWtdW_t is Brownian motion.

In practical large-scale applications, the full gradient U(θ)\nabla U(\theta) is replaced by an unbiased stochastic estimate U~(θ)\nabla \tilde U(\theta), typically computed on a minibatch. The discretized SGHMC updates (symplectic Euler–Maruyama scheme) are: pk+1=(1ϵγ)pkϵU~(θk)+2γϵξk,ξkN(0,I), θk+1=θk+ϵM1pk+1,\begin{aligned} p_{k+1} &= (1 - \epsilon \gamma) p_k - \epsilon \nabla \tilde U(\theta_k) + \sqrt{2\gamma \epsilon} \, \xi_k, \quad\xi_k \sim \mathcal{N}(0, I), \ \theta_{k+1} &= \theta_k + \epsilon M^{-1} p_{k+1}, \end{aligned} where ϵ\epsilon is the step size.

A friction term and matching diffusion corrects the violation of detailed balance introduced by noisy gradients (Chen et al., 2014). If the injected noise does not match the gradient noise, bias is introduced; ideally one would set CC (friction) to match the minibatch noise covariance.

2. Theoretical Guarantees and Convergence Rates

Recent non-asymptotic analyses provide rates for SGHMC under nonconvex, non-log-concave, and even discontinuous-gradient settings. For the SGHMC Euler–Maruyama discretization: Vn+1η=Vnηη[γVnη+H(θnη,Xn+1)]+2γηξn+1, θn+1η=θnη+ηVnη,\begin{aligned} V_{n+1}^\eta &= V_n^\eta - \eta [\gamma V_n^\eta + H(\theta_n^\eta, X_{n+1})] + \sqrt{2\gamma\eta} \xi_{n+1}, \ \theta_{n+1}^\eta &= \theta_n^\eta + \eta V_n^\eta, \end{aligned} with step size η\eta and unbiased stochastic gradient HH.

The Wasserstein-2 distance between the law of the iterates and the invariant law πβ\pi_\beta decays as: W2(Law(θnη,Vnη),πβ)C1η1/2+C2η1/4+C3eC4ηnW_2(\mathrm{Law}(\theta_n^\eta, V_n^\eta), \pi_\beta) \leq C_1 \eta^{1/2} + C_2 \eta^{1/4} + C_3 e^{-C_4 \eta n} for explicit constants CiC_i (Akyildiz et al., 2020, Liang et al., 25 Sep 2024). This uniform in nn bound reveals that error from discretization and gradient noise can be controlled to arbitrary precision with small enough η\eta.

SGHMC thus admits explicit excess risk bounds for nonconvex problems: E[U(θnη)]UCˉ1η1/2+Cˉ2η1/4+Cˉ3eCˉ4ηn+statistical term\mathbb{E}[U(\theta_n^\eta)] - U_* \leq \bar C_1 \eta^{1/2} + \bar C_2 \eta^{1/4} + \bar C_3 e^{-\bar C_4 \eta n} + \mathrm{statistical\ term} with the statistical term scaling as O((d/β)log())O((d/\beta) \log(\cdots)) (Akyildiz et al., 2020, Liang et al., 25 Sep 2024).

Momentum-based acceleration in SGHMC leads to a spectral gap improvement: the underdamped (momentum) version mixes at μ\mu_*, which can be an order λ\sqrt{\lambda_*} faster than the overdamped Langevin dynamics (SGLD) (Gao et al., 2018), yielding ϵ\epsilon-dependence of O(ϵ2)O(\epsilon^{-2}) vs O(ϵ4)O(\epsilon^{-4}) for the number of gradient evaluations in nonconvex settings.

3. Algorithmic Extensions and Variants

High-order Integrators

SGHMC can be integrated using symmetric-splitting (second-order) integrators, such as ABOBA/BAOAB, giving improved local truncation error O(ϵ3)O(\epsilon^3) and global bias O(ϵ2)O(\epsilon^2) (Chen et al., 2016): θθ+ϵ2p, peγϵ2p, ppϵU~(θ)+2γϵξ, peγϵ2p, θθ+ϵ2p.\begin{aligned} \theta &\leftarrow \theta + \frac{\epsilon}{2} p, \ p &\leftarrow e^{-\frac{\gamma\epsilon}{2}} p, \ p &\leftarrow p - \epsilon \nabla \tilde U(\theta) + \sqrt{2\gamma\epsilon}\xi, \ p &\leftarrow e^{-\frac{\gamma\epsilon}{2}} p, \ \theta &\leftarrow \theta + \frac{\epsilon}{2} p. \end{aligned} Second-order integrators achieve mean-squared error bounds O(L4/5)O(L^{-4/5}) in LL iterations—faster than O(L2/3)O(L^{-2/3}) for the Euler method (Chen et al., 2016, Li et al., 2018).

Variance Reduction

Variance-reduced SGHMC (incorporating SVRG or SAGA frameworks) reduces estimator variance in stochastic gradients, accelerating convergence and improving finite-sample accuracy (Li et al., 2018).

Parallel and Distributed SGHMC

Parallel-Tempered SGHMC (PT-SGHMC) uses R temperature-stratified replicas, each with an adaptive Nosé–Hoover thermostat, and occasional replica-exchange steps to enable exploration of multimodal posteriors. This enhances mixing beyond what is possible with any single SGHMC chain, especially under mini-batch noise (Luo et al., 2018).

Elastic-Coupled SGHMC runs multiple asynchronously coupled SGHMC chains tied to a dynamically evolving "center" variable. This enables efficient distributed or parallel MCMC with improved mixing and robustness to communication delays or stale gradients (Springenberg et al., 2016).

Decentralized SGHMC (DE-SGHMC) extends the algorithm to distributed inference across multiple agents, each maintaining local variables and consensus updates, with provable linear convergence in Wasserstein-2 distance under strongly convex objectives (Gürbüzbalaban et al., 2020).

4. Robustness, Practical Implementations, and Diagnostics

Hyperparameter Selection

Key parameters include:

  • Step size (ϵ\epsilon or η\eta): Small enough to ensure numerical stability and control discretization error; typical ϵ104\epsilon \sim 10^{-4}10210^{-2}.
  • Friction coefficient (γ\gamma): Moderate values (e.g., $0.05$--$1$) balance exploration and noise damping.
  • Mass matrix (MM): Often diagonal, sometimes tuned to parameter variance.
  • Minibatch size: Must be large enough that the CLT approximation for gradient noise holds; commonly B100B \geq 100.

Adaptive friction (e.g., via the Nosé–Hoover mechanism) and dynamically tuned temperature ladders (in parallel-tempered variants) further improve exploration and bias correction (Luo et al., 2018).

Low-Precision Computation

SGHMC is robust to low-precision quantization: momentum-based updates low-pass filter gradient and quantization noise, achieving faster convergence and greater error tolerance than SGLD in reduced-precision regimes. Empirically, full-precision and low-precision variants of SGHMC match or outperform SGLD under aggressive quantization, even on large-scale deep learning tasks (Wang et al., 2023).

Bias and Correctness

SGHMC is not generally correctable by Metropolis–Hastings (MH) accept/reject steps due to the lack of microscopic reversibility in the common Euler–Maruyama integrators, meaning its asymptotic bias cannot be eliminated by standard MH corrections (Garriga-Alonso et al., 2021). Alternatives such as the reversible BAOAB/OBABO integrators or occasionally applied MH corrections (e.g., AMAGOLD) can restore exactness at increased computational cost (Garriga-Alonso et al., 2021, Zhang et al., 2020).

Nonsmooth and Discontinuous Gradients

SGHMC convergence extends to settings with discontinuous stochastic gradients, as arise in ReLU neural network training or quantile estimation. Under average Lipschitz continuity (in expectation), SGHMC retains its non-asymptotic convergence rates in Wasserstein distance and excess risk, enabling theoretical control for nonsmooth deep learning tasks (Liang et al., 25 Sep 2024).

5. Applications and Empirical Results

SGHMC is widely deployed in Bayesian neural networks, variational deep Gaussian processes, stochastic nonconvex learning, and distributed inference. Empirical studies consistently report:

Advanced extensions such as meta-learned SGHMC ("NNSGHMC") further improve adaptation and generalization by learning state-dependent drifts and diffusions (Gong et al., 2018).

6. Limitations and Prospects

  • Bias Control: Without properly tuned noise-injection or corrective schemes (e.g., BAOAB with MH), bias in SGHMC cannot be entirely eliminated at finite step size. Asymptotic correctness requires either vanishing step sizes or exact reversibility, which may impact speed (Garriga-Alonso et al., 2021, Zhang et al., 2020).
  • Hyperparameter Sensitivity: Choice of ϵ\epsilon, γ\gamma, and mass matrix MM is critical; requires careful calibration.
  • Dimension Dependence: Theoretical constants can scale exponentially with problem dimension and inverse temperature in worst-case settings (Akyildiz et al., 2020, Gao et al., 2018).
  • Model-Specific Adaptation: Generic SGHMC may be suboptimal for specific models; meta-learned dynamics can mitigate this (Gong et al., 2018).
  • Mixing Efficiency: For highly multimodal, isolated energy basins, parallel tempering and adaptive thermostats are effective, but at increased computational or communication cost (Luo et al., 2018).

Ongoing developments include high-order and adaptive integrators, variance-reduction techniques, distributed and low-precision algorithms, and meta-learned or geometry-aware stochastic dynamics.


Table: SGHMC Variant Summary

Variant Key Feature Reference
Standard SGHMC Euler–Maruyama, fixed friction (Chen et al., 2014)
Symmetric splitting Higher-order ABOBA/BAOAB integrator (Chen et al., 2016)
Variance-reduced SGHMC SVRG/SAGA control variates (Li et al., 2018)
Parallel-tempered Nosé–Hoover + replica exchange (Luo et al., 2018)
Elastic-coupled Asynchronous, distributed coupling (Springenberg et al., 2016)
Low-precision Quantized weights/gradients (Wang et al., 2023)
Meta-learned (NNSGHMC) Neural, state-dependent dynamics (Gong et al., 2018)
Decentralized Network-averaged MCMC (Gürbüzbalaban et al., 2020)
Amortized Metropolis Deferred MH correction (Zhang et al., 2020)
Nonsmooth gradients Discontinuity, ReLU networks (Liang et al., 25 Sep 2024)

Stochastic Gradient Hamiltonian Monte Carlo remains a cornerstone for scalable Bayesian computation, with rigorous non-asymptotic theory, algorithmic flexibility, and broad applicability from deep learning to nonconvex optimization and distributed inference.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Stochastic Gradient Hamiltonian Monte Carlo (SGHMC).