Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stochastic Gradient Langevin Dynamics (SGLD)

Updated 22 April 2026
  • SGLD is a scalable MCMC algorithm combining stochastic optimization and Langevin diffusion to perform Bayesian inference on large models.
  • It replaces full-data gradients with mini-batch approximations, introducing trade-offs between discretization bias, gradient noise, and iteration variance.
  • Numerous variants, including control variates and Laplacian smoothing, have been developed to improve its convergence and reduce computational limitations.

Stochastic Gradient Langevin Dynamics (MCMC)

Stochastic Gradient Langevin Dynamics (SGLD) is a Markov chain Monte Carlo (MCMC) algorithm designed for scalable Bayesian inference in large-scale models. SGLD combines concepts from stochastic optimization with continuous-time Langevin diffusions, using data subsampling to approximate gradients, enabling efficient sampling from posterior distributions that are computationally prohibitive for traditional MCMC. SGLD is the archetype of the broader stochastic gradient MCMC (SGMCMC) family, with numerous theoretical developments, algorithmic variants, and applied extensions. Despite its efficiency, SGLD introduces subtle statistical and computational trade-offs, particularly with regard to stationary bias, gradient noise, and scalability.

1. Algorithmic Foundations of SGLD

SGLD approximates the posterior distribution by discretizing the overdamped Langevin diffusion,

dθt=12θlogp(θx)dt+dWt,d\theta_t = \frac{1}{2} \nabla_\theta \log p(\theta|x) \, dt + dW_t,

where θ\theta are model parameters and WtW_t is standard Brownian motion. The Euler–Maruyama discretization yields the unadjusted Langevin algorithm (ULA) for sampling:

θk+1=θk+ϵ2θlogp(θkx)+ϵξk,\theta_{k+1} = \theta_k + \frac{\epsilon}{2} \nabla_\theta \log p(\theta_k|x) + \sqrt{\epsilon}\,\xi_k,

with ξkN(0,I)\xi_k \sim \mathcal{N}(0, I), and ϵ\epsilon the step size. Traditional ULA necessitates a full‐data gradient evaluation per iteration.

SGLD replaces the full gradient with a mini-batch‐based unbiased stochastic gradient:

^θlogp(θkx)=θlogp(θk)+NniBkθlogp(xiθk),\hat{\nabla}_\theta \log p(\theta_k|x) = \nabla_\theta \log p(\theta_k) + \frac{N}{n} \sum_{i\in\mathcal{B}_k} \nabla_\theta \log p(x_i |\theta_k),

where NN is the data size, nn the mini-batch size, and Bk\mathcal{B}_k a randomly sampled mini-batch at iteration θ\theta0 (Nemeth et al., 2019). The SGLD step is:

θ\theta1

The omission of a Metropolis–Hastings accept/reject correction is justified asymptotically, provided the step size is annealed to zero and the stochastic gradients remain unbiased (Teh et al., 2014, Vollmer et al., 2015).

2. Statistical Efficiency and Bias-Variance Trade-Offs

SGLD introduces an additional stochastic noise term due to mini-batch subsampling. The error in the Markov chain consists of three main sources:

  • Discretization bias: θ\theta2, from Euler–Maruyama integration absent a Metropolis correction.
  • Gradient noise bias: θ\theta3, unique to stochastic gradients.
  • Variance: θ\theta4 for θ\theta5 steps with constant θ\theta6 (Vollmer et al., 2015, Nemeth et al., 2019).

If the step size θ\theta7 is decreased appropriately, such that θ\theta8 and θ\theta9, SGLD averages are strongly consistent and satisfy a central limit theorem, with optimally tuned error decay of WtW_t0 under power-law scheduling (WtW_t1) (Teh et al., 2014, Nemeth et al., 2019).

However, in the fixed step size regime typical in practical machine learning,

WtW_t2

balancing at WtW_t3 yields optimal WtW_t4 scaling (Vollmer et al., 2015, Chen et al., 2016). The constant factor for the bias term is dictated by the variance of the stochastic gradient and increases when the mini-batch size is reduced. The mean squared error (MSE) balancing is nontrivial in large-scale regimes and sensitive to both discretization and gradient noise.

3. Scalability and the “No Free Lunch” Phenomenon

Despite its lower per-iteration cost, SGLD cannot fundamentally evade the cost-to-accuracy barrier inherent in MCMC for large data. Theoretical lower bounds (Pillai et al., 2024, Nagapetyan et al., 2017) demonstrate that unless the mini-batch size scales with the dataset size, speed-ups in per-iteration cost by reducing WtW_t5 are offset by a corresponding increase in the number of iterations required for a fixed total variation distance from the posterior. Formally, for strongly log-concave posteriors and smooth likelihoods:

  • To achieve TV distance WtW_t6, the total number of data gradient evaluations is WtW_t7, matching the computational complexity of full-gradient MCMC.
  • Any “speed-up” from using WtW_t8 cannot beat this lower bound; reducing WtW_t9 simply increases the mixing time required to reach the same statistical accuracy (Pillai et al., 2024).

Thus, SGLD enjoys wall-clock practical advantages only in computational regimes constrained below full-pass throughput or when per-iteration wall-time is the principal bottleneck.

4. Algorithmic Variants and Enhancements

Many variants have been developed to ameliorate the variance, bias, and mixing properties of SGLD:

  • Control variates: SGLD-CV and SGLD Fixed Point (SGLDFP) employ a precomputed gradient at the posterior mode. This reduces the leading-order variance of the stochastic gradients from θk+1=θk+ϵ2θlogp(θkx)+ϵξk,\theta_{k+1} = \theta_k + \frac{\epsilon}{2} \nabla_\theta \log p(\theta_k|x) + \sqrt{\epsilon}\,\xi_k,0 to θk+1=θk+ϵ2θlogp(θkx)+ϵξk,\theta_{k+1} = \theta_k + \frac{\epsilon}{2} \nabla_\theta \log p(\theta_k|x) + \sqrt{\epsilon}\,\xi_k,1 or below, enabling per-iteration costs independent of θk+1=θk+ϵ2θlogp(θkx)+ϵξk,\theta_{k+1} = \theta_k + \frac{\epsilon}{2} \nabla_\theta \log p(\theta_k|x) + \sqrt{\epsilon}\,\xi_k,2 while retaining θk+1=θk+ϵ2θlogp(θkx)+ϵξk,\theta_{k+1} = \theta_k + \frac{\epsilon}{2} \nabla_\theta \log p(\theta_k|x) + \sqrt{\epsilon}\,\xi_k,3 accuracy under strong convexity assumptions (Baker et al., 2017, Brosse et al., 2018).
  • Preferential (importance) subsampling: Adaptive non-uniform minibatch selection, sometimes coupled with adaptive batch-size scheduling, targets high-variance data points and further reduces variance (Putcha et al., 2022).
  • Laplacian smoothing: LS-SGLD applies a circulant (FFT-efficient) Laplacian matrix to the minibatch gradients, reducing noise in the update and yielding smaller discretization errors in Wasserstein-2 distance with only modest impact on mixing (Wang et al., 2019).
  • High-order integrators: Symmetric splitting integrators (second order) for variants of SGLD/SGHMC can achieve improved convergence rates: θk+1=θk+ϵ2θlogp(θkx)+ϵξk,\theta_{k+1} = \theta_k + \frac{\epsilon}{2} \nabla_\theta \log p(\theta_k|x) + \sqrt{\epsilon}\,\xi_k,4 for θk+1=θk+ϵ2θlogp(θkx)+ϵξk,\theta_{k+1} = \theta_k + \frac{\epsilon}{2} \nabla_\theta \log p(\theta_k|x) + \sqrt{\epsilon}\,\xi_k,5 steps, vs.\ θk+1=θk+ϵ2θlogp(θkx)+ϵξk,\theta_{k+1} = \theta_k + \frac{\epsilon}{2} \nabla_\theta \log p(\theta_k|x) + \sqrt{\epsilon}\,\xi_k,6 for first-order Euler integrators (Chen et al., 2016, Matthews et al., 2018, Garriga-Alonso et al., 2021).
  • Preconditioning and geometric approaches: Quasi-Newton preconditioners (HAMCMC), Riemannian metrics, and anisotropic step-size matrices (e.g.\ STANLEY) address mis-specified geometry, leading to better scaling for highly-correlated targets (Şimşekli et al., 2016, Karimi et al., 2023).
  • Distributed and federated extensions: DE-SGLD and FSGLD adapt SGLD for decentralized or non-i.i.d.\ data and correct for local gradient heterogeneity via control variates or consensus schemes (Gürbüzbalaban et al., 2020, Mekkaoui et al., 2020, Chen et al., 2016).
  • Structured dependency-breaking: Self-averaged energy functions and blocking/dropout strategies reduce mixing times in high-dimensional neural posteriors (Alexos et al., 2021).

5. Theoretical Guarantees and Limitations

Consistency and Central Limit Theorem

Under Lyapunov and drift conditions, SGLD ergodically converges (in the law of empirical averages) to the true posterior when run with decreasing step sizes (Teh et al., 2014, Vollmer et al., 2015). For constant step size:

  • The invariant law is stationary for the discretized SGLD chain, but this law can be significantly biased with respect to the posterior unless the step size and batch size are tuned to drive both discretization and stochastic-gradient errors to zero (Brosse et al., 2018).
  • The leading-order bias for functional averages has the form θk+1=θk+ϵ2θlogp(θkx)+ϵξk,\theta_{k+1} = \theta_k + \frac{\epsilon}{2} \nabla_\theta \log p(\theta_k|x) + \sqrt{\epsilon}\,\xi_k,7 (Vollmer et al., 2015, Nagapetyan et al., 2017), while variance falls as θk+1=θk+ϵ2θlogp(θkx)+ϵξk,\theta_{k+1} = \theta_k + \frac{\epsilon}{2} \nabla_\theta \log p(\theta_k|x) + \sqrt{\epsilon}\,\xi_k,8.

Scaling Laws

  • Error bounds in Wasserstein-2 and total variation distances propagate the impact of both Euler discretization and stochastic gradient noise (Nemeth et al., 2019).
  • The SGLD mean squared error matches the minimax rate only if θk+1=θk+ϵ2θlogp(θkx)+ϵξk,\theta_{k+1} = \theta_k + \frac{\epsilon}{2} \nabla_\theta \log p(\theta_k|x) + \sqrt{\epsilon}\,\xi_k,9 and ξkN(0,I)\xi_k \sim \mathcal{N}(0, I)0 are balanced so that both the bias and variance terms are at the same order, typically with ξkN(0,I)\xi_k \sim \mathcal{N}(0, I)1 and ξkN(0,I)\xi_k \sim \mathcal{N}(0, I)2 chosen to keep ξkN(0,I)\xi_k \sim \mathcal{N}(0, I)3.
  • Nonasymptotic error coupling arguments (Pillai et al., 2024, Jin et al., 2023) show that pure subsampling error in diffusion-based MCMC has an algebraic convergence rate in the waiting time between mini-batch switches or in the batch size. Discrete-time SGLD accumulates both discretization and mini-batch error.

Limitations

  • For constant step sizes and ξkN(0,I)\xi_k \sim \mathcal{N}(0, I)4, SGLD’s equilibrium distribution can be much more diffuse than the true posterior; its covariance does not contract at ξkN(0,I)\xi_k \sim \mathcal{N}(0, I)5 rate, and its mean can be biased (Brosse et al., 2018, Nagapetyan et al., 2017).
  • High-dimension, strongly coupled, or multimodal targets may yield slow mixing due to geometry mismatch. This can be mitigated (but not eliminated) with geometric preconditioning (Şimşekli et al., 2016, Karimi et al., 2023).

6. Applications, Implementation, and Tuning Guidelines

SGLD and variants excel in contexts where per-iteration cost, memory scaling, or wall-clock time are principal constraints. Benchmarks confirm order-of-magnitude speedups versus full-data MCMC (e.g., Hamiltonian Monte Carlo) for large ξkN(0,I)\xi_k \sim \mathcal{N}(0, I)6 in Bayesian logistic regression, deep Bayesian neural networks (e.g., MNIST, CIFAR-10), and probabilistic matrix factorization, with negligible loss in predictive accuracy under proper tuning (Nemeth et al., 2019, Wang et al., 2019, Karimi et al., 2023, Alexos et al., 2021).

Tuning Recommendations:

  • Step size ξkN(0,I)\xi_k \sim \mathcal{N}(0, I)7: Should satisfy ξkN(0,I)\xi_k \sim \mathcal{N}(0, I)8 for the log-concavity parameter ξkN(0,I)\xi_k \sim \mathcal{N}(0, I)9, and be pilot-tuned using diagnostics such as kernel Stein discrepancy, as ESS and ϵ\epsilon0 can miss bias (Nemeth et al., 2019).
  • Mini-batch size ϵ\epsilon1: To maintain stochastic gradient noise at ϵ\epsilon2, decrease ϵ\epsilon3 only as ϵ\epsilon4 shrinks.
  • Control variate/reference point: Precompute a mode for SGLDFP/CV, and store differences for variance reduction (Baker et al., 2017, Brosse et al., 2018).
  • Diagnostic checks: Prefer kernel-based discrepancies to ESS for MCMC error assessment in SGLD (Nemeth et al., 2019).
  • Distributed/async settings: With staleness ϵ\epsilon5, bias and MSE remain controlled if step size is scaled as ϵ\epsilon6, yielding linear variance reduction in the number of workers (Chen et al., 2016).
  • Implementation: Exact pseudocode for LS-SGLD (with FFT-efficient Laplacian smoothing) or preference-sampling SGLD extensions is available and requires negligible memory/computational overhead beyond standard mini-batch processing (Wang et al., 2019, Putcha et al., 2022).

7. Current Research Directions and Outlook

SGLD remains a subject of active investigation, with current directions focused on:

  • Precise trade-offs in scaling regimes, including the exploration of the SGLDiff limit and the optimal batching/time-sharing curves in practice (Jin et al., 2023, Pillai et al., 2024).
  • Theory and implementation of geometric ergodicity in extensions such as STANLEY for energy-based models, where anisotropic preconditioning and gradient-informed covariance enable improved mixing and fast convergence in high-dimensional, non-Euclidean parameter spaces (Karimi et al., 2023).
  • High-order, reversible stochastic integrators (symmetric splitting, GGMC), which allow for Metropolis-adjusted corrections with stochastic gradients, guaranteeing exactness with positive MH acceptance even in the presence of subsampling (Garriga-Alonso et al., 2021, Matthews et al., 2018).
  • Adaptive control over mini-batch selection, dynamic batch size scheduling, and distributed/federated learning variants that maintain statistical efficiency under challenging non-i.i.d.\ or private data scenarios (Mekkaoui et al., 2020, Gürbüzbalaban et al., 2020, Putcha et al., 2022).
  • Structured dependency-breaking and block-factorized Langevin approaches for mixing acceleration in high-dimensional posterior landscapes, particularly in deep Bayesian neural networks (Alexos et al., 2021).

The theoretical guarantees, lower bounds on error-vs-computation trade-offs, and the diversity of algorithmic enhancements position SGLD as a foundational tool for scalable MCMC in modern Bayesian computation, with continued empirical successes in both statistical learning and high-dimensional modeling (Nemeth et al., 2019, Karimi et al., 2023, Alexos et al., 2021, Pillai et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stochastic Gradient Langevin Dynamics (MCMC).