2000 character limit reached

Adaptive Weighted SGLD

Updated 18 November 2025

Adaptive Weighted SGLD is a class of scalable SG-MCMC algorithms that dynamically modulate gradient updates and injected noise to enhance posterior sampling in high-dimensional, non-convex problems.
They employ techniques such as time-rescaling, state-dependent preconditioning, and adaptive drift mechanisms to overcome poor mixing, pathological curvature, and saddle points.
Empirical evaluations show these methods achieve faster convergence, improved effective sample size, and superior calibration in applications like Bayesian deep learning and large-scale probabilistic modeling.

Adaptive Weighted Stochastic Gradient Langevin Dynamics (SGLD) refers to a class of scalable stochastic gradient Markov Chain Monte Carlo (SG-MCMC) algorithms where both the magnitude and direction of gradient-driven updates and injected noise are dynamically modulated, typically to address poor mixing, pathological curvature, and saddle points in high-dimensional, non-convex Bayesian inference problems. These algorithms encompass time-rescaling (stepsize adaptation), state-dependent preconditioning, adaptive drift mechanisms, and data-dependent importance weighting, aiming to improve posterior sampling efficiency, stability, and bias control in domains such as Bayesian deep learning and large-scale probabilistic modeling.

1. Langevin Diffusion, Discretization, and Gradient Noise

At the core of SGLD is the discretization of the overdamped Langevin diffusion targeting a posterior law,

$p(\theta \mid \mathcal{D}) \propto \exp(-U(\theta)),$

where $U(\theta) = -\log p(\mathcal{D} \mid \theta) - \log p(\theta)$ encodes the energy landscape. The continuous dynamics,

$d\theta_t = -\nabla U(\theta_t)\,dt + \sqrt{2}\,dW_t,$

are replaced by the Euler-Maruyama scheme using noisy stochastic gradients $G_n \approx \nabla U(\theta_n)$ :

$\theta_{n+1} = \theta_n - h\,G_n + \sqrt{2h}\,\xi_n, \qquad \xi_n\sim\mathcal{N}(0,I).$

Adaptive methods alter $h$ (stepsize), employ curvature-dependent preconditioners, or introduce state-dependent modifications to this standard iterative update (Rajpal et al., 11 Nov 2025).

2. Adaptive Time-Rescaling and Stepsize Selection

SA-SGLD, built upon the SamAdams framework (Rajpal et al., 11 Nov 2025), utilizes adaptive time-rescaling to modulate the stepsize based on a monitor function, typically the exponential average of the local gradient norm:

Auxiliary time $\tau$ and state-dependent Sundman transform $\psi(\zeta)$ control

$dt/d\tau = \psi(\zeta_\tau).$

Monitor $\zeta_\tau$ aggregates local curvature information via $g(\theta) = \|\nabla_\theta U(\theta)\|^2 + \delta$ .
Rescaled stepsize

$\Delta t_{n+1} = \psi(\zeta_{n+1})\,\Delta\tau.$

The update becomes

$\theta_{n+1} = \theta_n - \Delta t_{n+1} G_n + \sqrt{2\Delta t_{n+1}}\,\epsilon_{n+1}.$

This adaptive shrinking/enlargement of step length prevents overshooting in steep regions and improves mixing in flat regions, avoiding bias since only physical time is rescaled, not the geometry of the parameter space (Rajpal et al., 11 Nov 2025).

3. State-Dependent Preconditioning and Gradient Weighting

Preconditioned or weighted SGLD variants employ matrix-valued or diagonal preconditioners to adaptively scale both the gradient and noise:

RMSProp/Adam-style diagonal preconditioning uses running moment estimates (Bhardwaj, 2019, Li et al., 2015):

$V_t = \alpha V_{t-1} + (1-\alpha)(g_t)^2,\quad G_t = \mathrm{diag}(1/(\lambda+\sqrt{V_t})).$

$\theta_{t+1} = \theta_t - \frac{\epsilon_t}{2}G_t \hat{\nabla} U + \sqrt{\epsilon_t}G_t^{1/2}\xi_t.$

More sophisticated blockwise or quasi-diagonal approximations (e.g., K-FAC, QDOP) further exploit model structure for efficient curvature adaptation (Palacci et al., 2018, Marceau-Caron et al., 2017).
Importance-weighted gradient sampling assigns non-uniform inclusion probabilities to data points according to their gradient magnitude or Lipschitz constant, and adaptively tunes batch size for variance control (Putcha et al., 2022).

4. Adaptive Drift, Dynamic Biases, and Escape from Saddles

Adaptive drift-infused SGLD methods introduce a learnable bias term, which is dynamically adjusted according to running gradient statistics:

Momentum-SGLD (MSGLD) and Adam-style SGLD (ASGLD) maintain exponential running averages and second moments (Kim et al., 2020, Sang et al., 2018):

$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla\tilde{U}(\theta_{t-1}),\quad V_t = \beta_2 V_{t-1} + (1-\beta_2)(\nabla\tilde{U})^2.$

$\theta_{t+1} = \theta_t - \epsilon[\nabla\tilde{U}(\theta_t)+a\,A_t] + \sqrt{2\epsilon\tau}\,\eta_{t+1}.$

This explicit drift modification enhances rapid escape from saddle points, improves multimodal traversal, and provably preserves ergodicity under suitable smoothness/dissipativity assumptions.
Escape from strict saddle points occurs within $O(\log d)$ iterations in adaptive schemes, essentially independent of problem dimension (Sang et al., 2018).

5. Contour Flattening, Importance Sampling, and Weighted Ergodic Estimation

CSGLD and related contour/adaptive weighting methods flatten the energy landscape by dynamically learning a piecewise-exponential weight function on the energy axis, transforming the sampling density:

The target is sampled from $\omega_\Psi(x) \propto \pi(x)/\Psi(U(x))^\zeta$ , with $\Psi(u)$ adapted via a stochastic approximation procedure.
Ergodic averages of test functions must be reweighted to correct for the bias introduced by contour flattening (Deng et al., 2020):

$\hat{\mu}_k = \frac{\sum_{i=1}^k \theta_i(J_{\widetilde{U}}(x_i))^\zeta f(x_i)}{\sum_{i=1}^k \theta_i(J_{\widetilde{U}}(x_i))^\zeta}.$

Under stability and mean-field convergence theory, the learned weights approach their true bin probabilities, yielding provably small bias in posterior expectation estimation.

6. Thermostats, Friction Adaptation, and Superconvergence

The use of auxiliary thermostat variables (Nosé–Hoover or SGNHT) provides robust temperature control for SG-MCMC, dynamically offsetting unknown gradient noise:

The friction variable $\xi$ is tuned so that the system's kinetic energy matches the target thermostat temperature; no prior estimate of gradient noise variance is required (Leimkuhler et al., 2015).
Symmetric splitting schemes such as BADODAB achieve fourth-order “superconvergence” for configurational averages in the high-friction regime, enabling much larger stable timesteps than standard SGLD.
For Bayesian inference, this adaptation yields exact marginal sampling in the limit $\Delta t \to 0$ , with dramatically reduced numerical overhead and improved efficiency.

7. Empirical Evaluation, Mixing, Generalization, and Calibration

Adaptive weighted SGLD methods have been systematically benchmarked across curated toy problems and modern architectures:

In high-curvature potentials, adaptive stepsize SGLD (SA-SGLD) shows unbiased well traversal, stable contraction in steep regions, and superior mixing in plateaux (Rajpal et al., 11 Nov 2025).
On image classification with Bayesian neural networks (BNNs), adaptive weighting outperforms SGLD under sharp horseshoe priors, achieving lower negative log-likelihood, higher accuracy, reduced expected calibration error, and greater robustness to stepsize escalation (Rajpal et al., 11 Nov 2025).
Preconditioned SGLD demonstrates up to 3× improvement in effective sample size, accelerated convergence vs. unweighted SGLD, and competitive or superior generalization accuracy relative to SGD, Adam, and AdaGrad (Bhardwaj, 2019, Li et al., 2015).
Adaptive drift and contour-flattening methods provide substantially enhanced multimodal exploration and reliable uncertainty quantification, especially in highly non-convex and multimodal landscapes (Kim et al., 2020, Deng et al., 2020, Putcha et al., 2022).

Variant	Mechanism	Key Empirical Findings
SA-SGLD (Rajpal et al., 11 Nov 2025)	Time-rescaled stepsize	Superior mixing, unbiased sampling
pSGLD (Li et al., 2015)	Diag. preconditioning	Faster convergence, ESS up to 3× SGLD
ASGLD (Bhardwaj, 2019)	1st/2nd moment cumulants	Speed matches Adam/AdaGrad, gen. ≈ SGD
CSGLD (Deng et al., 2020)	Adaptive contour weighting	Avoids multimodal local traps
SGNHT (Leimkuhler et al., 2015)	Friction/thermostat	O(Δt⁴) bias, stepsizes 2–10× vs. SGLD

In summary, adaptive weighted SGLD methodologies incorporate curvature-sensing, state-dependent stepsize modulation, adaptive gradient and noise preconditioning, and advanced dynamical control to yield scalable, robust, and unbiased posterior sampling in high-dimension Bayesian inference, enabling marked improvements in mixing, exploration, and downstream predictive performance over classical SGLD.