Papers
Topics
Authors
Recent
2000 character limit reached

SA-SGLD: Adaptive MCMC Sampling

Updated 19 November 2025
  • SA-SGLD is an adaptive MCMC method that employs time-rescaling to adjust stepsizes based on local gradient norms for unbiased posterior sampling.
  • It enhances stability and mixing in high-dimensional models, effectively managing varying curvature in Bayesian neural networks.
  • The algorithm guarantees ergodicity with controlled discretization bias, offering a robust alternative to conventional SGLD and pSGLD.

SA-SGLD (Stochastic Adams-Stochastic Gradient Langevin Dynamics) refers to an adaptive, time-rescaled Markov chain Monte Carlo (MCMC) algorithm designed for efficient Bayesian posterior sampling in high-dimensional parameter spaces such as Bayesian neural networks (BNNs). SA-SGLD leverages the methodology of time rescaling, adaptively modulating the discretization stepsize based on the local geometry of the posterior landscape as measured by stochastic gradients. This mechanism enables robust, unbiased approximation of the correct invariant measure, addressing challenges of stability, mixing, and tuning complexity inherent to classical SGLD and preconditioned variants (Rajpal et al., 11 Nov 2025).

1. Foundations and Motivation

Stochastic Gradient Langevin Dynamics (SGLD) is a method for sampling from posteriors of the form p(θD)exp{U(θ)}p(\theta|D) \propto \exp\{-U(\theta)\}, where U(θ)U(\theta) comprises both negative log-likelihood and log-prior terms. The overdamped Langevin diffusion is

dθ=U(θ)dt+2β1dW(t)d\theta = -\nabla U(\theta)\,dt + \sqrt{2\,\beta^{-1}}\,dW(t)

with invariant density proportional to eβU(θ)e^{-\beta U(\theta)}. In practice, SGLD applies an Euler–Maruyama discretization:

θn+1=θnϵnU~(θn)+2ϵnβ1ξn+1\theta_{n+1} = \theta_n - \epsilon_n\,\nabla\tilde U(\theta_n) + \sqrt{2\epsilon_n\beta^{-1}}\,\xi_{n+1}

where ξn+1N(0,I)\xi_{n+1} \sim \mathcal{N}(0,I) and U~\nabla\tilde U is a mini-batch stochastic gradient. Most applications forgo vanishing stepsizes in favor of a fixed ϵ\epsilon, risking a tradeoff between poor mixing in flat regions and instability in high-curvature regions.

Preconditioned SGLD (pSGLD) seeks to address curvature sensitivity by using a diagonal metric GnG_n (often RMSprop-style), replacing ϵI\epsilon I with ϵGn1\epsilon G_n^{-1}. However, omitting the necessary divergence correction term Γ(θ)i=jjGij1(θ)Γ(\theta)_i = \sum_j \partial_j G_{ij}^{-1}(\theta) in high dimensions breaks detailed balance and induces bias—an issue insurmountable at scale due to the term's O(d2)O(d^2) cost.

2. SA-SGLD: Time Rescaling and Algorithmic Design

SA-SGLD is derived via a Sundman-type time-rescaling mechanism wherein the stepsize adapts according to a monitored scalar function—conventionally the instantaneous squared norm of the local mini-batch stochastic gradient,

g(θn):=U~(θn)2+δ,δ>0.g(\theta_n) := \|\nabla\tilde U(\theta_n)\|^2 + \delta, \quad \delta>0.

An auxiliary variable (“clock”) ζn\zeta_n tracks a running average of gg:

ζn+1=ρζn+(1ρ)g(θn)α,ρ=eαh\zeta_{n+1} = \rho\,\zeta_n + (1-\rho)\frac{g(\theta_n)}{\alpha}, \quad \rho = e^{-\alpha h}

where hh is a base time step and α>0\alpha>0 controls memory decay. The overall time rescaling is governed by a user-specified bounded, Lipschitz map ψ()\psi(\cdot) (e.g., ψ(ζ)=m+(Mm)/(1+ζr)\psi(\zeta) = m + (M-m)/(1+\zeta^r)):

Δtn+1=ψ(ζn+1)h.\Delta t_{n+1} = \psi(\zeta_{n+1}) h.

This leads to the SA-SGLD update:

θn+1=θnΔtn+1U~(θn)+2β1Δtn+1ξn+1.\theta_{n+1} = \theta_n - \Delta t_{n+1}\,\nabla\tilde U(\theta_n) + \sqrt{2\beta^{-1}\Delta t_{n+1}}\,\xi_{n+1}.

Because the time-rescaled SDE is a reparameterization, the invariant measure remains unchanged, eliminating the bias incurred by naive adaptation schemes.

3. Theoretical Guarantees

Under standard assumptions—UC4U\in C^4, Lipschitz U\nabla U, dissipativity, unbiased stochastic gradients with bounded variance, and bounded, Lipschitz ψ()\psi(\cdot)—SA-SGLD enjoys uniform moment bounds for the chain and ergodicity to the correct invariant measure. Key properties include:

  • Uniform moment stability: There exists hmaxh_{\max} so that for h<hmaxh<h_{\max}, supnE[θn2]<\sup_n \mathbb{E}[\|\theta_n\|^2] < \infty.
  • Ergodicity and bias: Weighted averages with respect to the variable stepsizes converge almost surely, with the discretization bias πh(f)π(f)=O(h)|\pi_h(f) - \pi(f)| = O(h) for test functions ff, and no additional bias is introduced by the adaptation mechanism itself.

This distinguishes SA-SGLD from pSGLD and similar variants, whose failure to correct the metric change introduces persistent bias in the samples (Rajpal et al., 11 Nov 2025).

4. Algorithmic Workflow

The SA-SGLD algorithm proceeds as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for n = 0,1,2,...
    # 1. Stochastic mini-batch gradient
    G_n  U(θ_n)
    # 2. Monitor
    g_n = ||G_n||^2 + δ
    # 3. Update exponential moving average
    ζ_{n+1} = ρ ζ_n + (1ρ) (g_n/α), 𝜌 = exp(αh)
    # 4. Adaptive stepsize
    Δt = ψ(ζ_{n+1}) · h
    # 5. Gaussian noise
    ξ  N(0, I)
    # 6. Parameter update
    θ_{n+1} = θ_n  Δt·G_n + sqrt(2β^{-1}Δt)·ξ
end

The computational overhead relative to SGLD is negligible: only one additional running scalar ζn\zeta_n, one norm, and a few arithmetic operations per iteration.

5. Empirical Performance

Performance is evaluated on both synthetic high-curvature potentials and practical BNNs:

  • 2D toy examples (Müller–Brown, Star potential): SA-SGLD dynamically shrinks stepsizes in narrow, high-curvature regions, enabling sampling to cross barriers and traverse “funnels” that stall fixed-ε SGLD.
  • BNNs on MNIST:
    • Architecture: Fully-connected, 784–1200–1200–10, trained for 200 epochs (100 burn-in), reporting NLL, test accuracy, and expected calibration error (ECE).
    • Under Gaussian prior, SA-SGLD matches SGLD performance; under the sharper Horseshoe prior, SA-SGLD yields lower NLL, higher accuracy, and better calibration.
    • Stability is retained for large base stepsizes in SA-SGLD, with SGLD diverging beyond its stability threshold.
Prior Method NLL (↓) Accuracy (↑) ECE (↓)
Gaussian SGLD 0.192±0.005 95.23%±0.06 5.72%±0.27
SA-SGLD 0.193±0.004 95.25%±0.03 5.76%±0.22
Horseshoe SGLD 0.086±0.004 98.03%±0.04 3.64%±0.11
SA-SGLD 0.080±0.003 98.12%±0.03 3.49%±0.09

SA-SGLD improves on SGLD in posterior quality, mixing, and stability under sharp priors (Rajpal et al., 11 Nov 2025).

6. Practical Guidance and Implementation Details

Recommendations for hyperparameters include:

  • Base step hh: As large as feasible in flat regions, typically $0.1–1.0$.
  • Monitor function: g(θ)=U~2+δg(\theta)=\|\nabla\tilde U\|^2 + \delta, with δ106\delta\approx 10^{-6}10410^{-4}.
  • Exponential decay α\alpha: Set ρ=exp(αh)0.90.99ρ = \exp(-αh) ≈ 0.9–0.99 (medium memory).
  • Rescaling map ψ()\psi(\cdot): Common choice is ψ(ζ)=m+(Mm)/(1+ζr)\psi(ζ)=m + (M-m)/(1+ζ^r) with r=0.251r=0.25–1, m=0.5m=0.5, M=2M=2.
  • Overhead: Minimal, dominated by backpropagation cost in gradient computation.

A plausible implication is that SA-SGLD presents a scalable and robust approach for large-scale BNN posterior sampling, particularly beneficial when the geometry of the posterior exhibits regions of disparate curvature.

In the literature, "SA" (stochastic approximation) also refers to temporal averaging in Langevin Monte Carlo for multi-armed bandit problems, as exemplified by TS-SA. In that context, stochastic approximation smooths iterates to achieve improved posterior sampling and regret bounds in the non-stationary setting (Wang et al., 6 Oct 2025).

The core distinction is that SA-SGLD specifically denotes the Sundman-adapted SGLD via time rescaling for posterior sampling in continuous parameter spaces, as constructed in Leimkuhler, Lohmann, and Whalley (2025) (Rajpal et al., 11 Nov 2025). In contrast, stochastic approximation in the TS-SA framework is applied to averaging in bandit decision processes. Both exploit adaptivity, but the formalism and objectives differ. Coordination of stepsize to posterior geometry through time rescaling, as in SA-SGLD, addresses long-standing instability and bias issues in stochastic gradient MCMC.

For SGMCMC in high dimensions, SA-SGLD supplies a provably correct, tune-free, and computationally efficient alternative to existing adaptive schemes.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SA-SGLD.