Natural Gradient SGLD for Scalable Bayesian Inference

Updated 19 May 2026

Natural Gradient SGLD is a Bayesian sampling method that preconditions both gradient updates and noise using Fisher information to capture the local geometry of high-dimensional models.
Scalable approximations, including diagonal, quasi-diagonal, and K-FAC, balance computational efficiency with accurate adaptation to parameter curvature in deep neural networks.
While NG-SGLD improves uncertainty estimation and out-of-distribution detection, it demands careful hyperparameter tuning and ensembling to fully realize its Bayesian benefits.

Natural Gradient Stochastic Gradient Langevin Dynamics (NG-SGLD) is a class of stochastic differential equation-based optimization and sampling algorithms used for Bayesian inference in high-dimensional models, particularly deep neural networks. It generalizes the standard SGLD by replacing isotropic noise and Euclidean-gradient updates with updates preconditioned by an approximation to the local geometry of the parameter space, typically captured by the Fisher information matrix. NG-SGLD thus fuses the principles of Amari's natural gradient descent and Langevin Monte Carlo, enabling scalable Bayesian posterior sampling that more accurately reflects parameter sensitivity and curvature than standard SGLD.

1. Foundations: SGLD and the Motivation for Natural Gradients

Standard SGLD, introduced by Welling & Teh (2011), targets the posterior $p(\theta|D) \propto p(\theta) \prod_i p(y_i|x_i; \theta)$ by performing stochastic gradient descent where Gaussian noise is injected at each step. The classic update is

$\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \left[ \log p(\theta_t) + \frac{n}{J} \sum_{(x, y) \in B_t} \log p(y|x; \theta_t) \right] + \xi_t,\quad \xi_t \sim \mathcal{N}(0, \epsilon_t I).$

Here, $B_t$ is a mini-batch of size $J$ , and $\epsilon_t$ is the step size. Under appropriate step size decay, the Markov Chain defined by these updates has the posterior as its stationary distribution.

However, in high-dimensional models typical of modern deep learning, parameters can vary substantially in scale and be highly correlated. Isotropic noise, as in vanilla SGLD, cannot adapt to these anisotropies, resulting in inefficient exploration and poor mixing properties. Amari's natural gradient methodology addresses these shortcomings by endowing parameter space with a Riemannian metric given by the Fisher information, conferring invariance to reparameterization and tailoring both gradient updates and noise to local geometry (Palacci et al., 2018, Marceau-Caron et al., 2017, Bhardwaj, 2019).

2. Theoretical Formulation of Natural Gradient SGLD

NG-SGLD replaces the isotropic metric with the Fisher information or its scalable approximations. The Fisher information at $\theta$ is

$F(\theta) = \mathbb{E}_{x, y \sim p(\cdot|\theta)} \left[ \nabla_\theta \log p(y|x; \theta) \nabla_\theta \log p(y|x; \theta)^T \right].$

The NG-SGLD update uses a preconditioner $G(\theta) \approx F(\theta)$ :

$\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2}G(\theta_t)^{-1} \nabla_\theta \log p(\theta_t|D) + \eta_t, \quad \eta_t \sim \mathcal{N}(0, \epsilon_t G(\theta_t)^{-1}).$

In continuous time,

$d\theta_t = -G(\theta_t)^{-1} \nabla_\theta U(\theta_t) dt + \sqrt{2 G(\theta_t)^{-1}} dW_t,$

for potential $\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \left[ \log p(\theta_t) + \frac{n}{J} \sum_{(x, y) \in B_t} \log p(y|x; \theta_t) \right] + \xi_t,\quad \xi_t \sim \mathcal{N}(0, \epsilon_t I).$ 0. The preconditioning matches the local Gaussian structure of the posterior (Bernstein–von Mises theorem), yielding theoretically justified noise scaling (Marceau-Caron et al., 2017, Bhardwaj, 2019).

3. Scalable Approximations: Diagonal, Quasi-Diagonal, and K-FAC

The exact Fisher information is intractable for modern networks ( $\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \left[ \log p(\theta_t) + \frac{n}{J} \sum_{(x, y) \in B_t} \log p(y|x; \theta_t) \right] + \xi_t,\quad \xi_t \sim \mathcal{N}(0, \epsilon_t I).$ 1 memory for $\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \left[ \log p(\theta_t) + \frac{n}{J} \sum_{(x, y) \in B_t} \log p(y|x; \theta_t) \right] + \xi_t,\quad \xi_t \sim \mathcal{N}(0, \epsilon_t I).$ 2 parameters). NG-SGLD thus relies on realistic matrix-structured approximations:

Diagonal (pSGLD, DOP, RMSProp-style): $\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \left[ \log p(\theta_t) + \frac{n}{J} \sum_{(x, y) \in B_t} \log p(y|x; \theta_t) \right] + \xi_t,\quad \xi_t \sim \mathcal{N}(0, \epsilon_t I).$ 3 using coordinate-wise second moments of gradients; $\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \left[ \log p(\theta_t) + \frac{n}{J} \sum_{(x, y) \in B_t} \log p(y|x; \theta_t) \right] + \xi_t,\quad \xi_t \sim \mathcal{N}(0, \epsilon_t I).$ 4 storage and cost (Palacci et al., 2018, Marceau-Caron et al., 2017, Bhardwaj, 2019).
Quasi-diagonal (QDOP): Stores diagonal and first-row for each neuron’s incoming weights plus bias—preserving bias-weight correlations within blocks—allowing $\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \left[ \log p(\theta_t) + \frac{n}{J} \sum_{(x, y) \in B_t} \log p(y|x; \theta_t) \right] + \xi_t,\quad \xi_t \sim \mathcal{N}(0, \epsilon_t I).$ 5 cost with better adaptation to curvature (Marceau-Caron et al., 2017).
Block-diagonal, K-FAC (Kronecker-Factored): For fully connected layers, estimates Fisher block as $\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \left[ \log p(\theta_t) + \frac{n}{J} \sum_{(x, y) \in B_t} \log p(y|x; \theta_t) \right] + \xi_t,\quad \xi_t \sim \mathcal{N}(0, \epsilon_t I).$ 6 (Kronecker product of activation and gradient moment matrices). Enables within-layer correlations at $\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \left[ \log p(\theta_t) + \frac{n}{J} \sum_{(x, y) \in B_t} \log p(y|x; \theta_t) \right] + \xi_t,\quad \xi_t \sim \mathcal{N}(0, \epsilon_t I).$ 7 cost, with $\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \left[ \log p(\theta_t) + \frac{n}{J} \sum_{(x, y) \in B_t} \log p(y|x; \theta_t) \right] + \xi_t,\quad \xi_t \sim \mathcal{N}(0, \epsilon_t I).$ 8, $\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \left[ \log p(\theta_t) + \frac{n}{J} \sum_{(x, y) \in B_t} \log p(y|x; \theta_t) \right] + \xi_t,\quad \xi_t \sim \mathcal{N}(0, \epsilon_t I).$ 9 (where $B_t$ 0 and $B_t$ 1 are activations and backpropagated gradients). This approach is tractable for deep nets ( $B_t$ 2 memory, compared to $B_t$ 3 full-Fisher) (Palacci et al., 2018).

A table summarizes these:

Preconditioner	Structure	Complexity
Diagonal (pSGLD/DOP)	Per-parameter	$B_t$ 4
Quasi-Diagonal (QDOP)	Blockwise/first row	$B_t$ 5
K-FAC	Kronecker-factored	$B_t$ 6

Natural-Gradient SGLD preconditions both drift and noise, while some adaptive approximations (e.g., ASGLD) precondition only the injected noise (Bhardwaj, 2019).

4. Algorithmic Implementation and Practical Considerations

The NG-SGLD loop proceeds as follows (Palacci et al., 2018, Marceau-Caron et al., 2017, Bhardwaj, 2019):

Sample a mini-batch and compute the stochastic gradient $B_t$ 7.
Update preconditioner $B_t$ 8 (diagonal, K-FAC, or quasi-diagonal using exponential moving averages).
Form preconditioner inverse $B_t$ 9.
Sample noise $J$ 0.
Parameter update: $J$ 1.

Practical recommendations:

Step size scheduling is critical; asymptotically, use a Robbins–Monro decay, e.g., $J$ 2 (Marceau-Caron et al., 2017).
For quasi-diagonal and diagonal preconditioners, each parameter block or coordinate can be updated and noise sampled in $J$ 3 time.
Initialization and decay rates of both the step size and the preconditioner’s running averages (e.g., $J$ 4) require tuning.

Complexity is $J$ 5 for diagonal and quasi-diagonal, $J$ 6 for K-FAC, and generally similar to adaptive optimizers such as Adam or RMSProp (Palacci et al., 2018, Marceau-Caron et al., 2017, Bhardwaj, 2019).

5. Empirical Evaluation: Mixing, Regularization, OOD Detection, Robustness

Multiple empirical studies compared SGLD variants under criteria including mixing times, regularization effect, covariate shift detection, and adversarial resistance (Palacci et al., 2018, Marceau-Caron et al., 2017, Bhardwaj, 2019):

Mixing times: All methods (vanilla SGLD, diagonal, K-FAC) display poor chain mixing rates (low effective sample size, ESS) even after many epochs; this is attributed to data subsampling fundamentally breaking detailed balance (Palacci et al., 2018).
Regularization (small data): On reduced-size MNIST (SmallMNIST), SGLD variants underperform SGD due to the dominance of Langevin noise in early convergence, which prevents reaching sharp minima efficiently (Palacci et al., 2018).
Covariate shift / Out-of-distribution detection: On MNIST/notMNIST, all Langevin variants (including NG-SGLD) yield broader, lower-confidence predictive distributions on OOD data, outperforming standard SGD in detection capability (Palacci et al., 2018).
Adversarial robustness: No significant advantage is observed for NG-SGLD over SGD on Fast Gradient Sign Method (FGSM) adversarial attacks; e.g., KSGLD achieves 2.0% test accuracy on adversarial MNIST vs. 2.9% for SGD (Palacci et al., 2018).
Regularization vs. dropout: On standard MNIST, ensemble predictions from QDOP NG-SGLD nearly match dropout in test accuracy (98.38% vs. 98.61%), but only if full ensembling (not single parameter averaging) is performed (Marceau-Caron et al., 2017).
Comparison to adaptive methods: ASGLD, using a fast diagonal preconditioner, matches the convergence speed of Adam/AdaGrad and achieves generalization equal to SGD on deep architectures, while retaining the efficiency of first-order methods (Bhardwaj, 2019).

6. Limitations, Practical Guidelines, and Application Domains

Principal points established in the literature include:

Mixing limitations: Improvement in posterior mixing is minimal even using sophisticated preconditioners. Poor chain behavior is attributable to minibatch stochasticity and discretization error, limiting Bayesian guarantees in practice, especially in deep networks (Palacci et al., 2018).
Tuning sensitivity: Performance is highly sensitive to learning rate and metric update scheduling; optimal empirical results demand careful tuning (Marceau-Caron et al., 2017, Palacci et al., 2018).
Ensembling requirement: The Bayesian regularization effect of NG-SGLD is only realized when posterior predictions are averaged across many sampled parameter sets; the posterior mean alone is insufficient, particularly for negative log-likelihood performance (Marceau-Caron et al., 2017).
Cost-benefit analysis: Quasi-diagonal and K-FAC preconditioners enable exploitation of local parameter correlations at sharpened memory and computational cost; however, diagonal methods scale more gracefully to large models (Marceau-Caron et al., 2017, Palacci et al., 2018, Bhardwaj, 2019). ASGLD exploits this tradeoff, retaining $J$ 7 scaling and matching SGD generalization (Bhardwaj, 2019).
Use cases: SGLD-type methods, including NG-SGLD and ASGLD, are most useful for uncertainty estimation and out-of-distribution detection in large-scale models. For adversarial robustness or improved mixing, these methodologies do not provide practical gains over deterministic optimization (Palacci et al., 2018).

The following table summarizes key findings:

Application	NG-SGLD Benefit	Note
Posterior mixing	Minimal	Low ESS across variants
OOD detection	Substantial	Broader, more dispersed predictive modes
Regularization	Mixed	Needs ensembling for Bayesian benefits
Adversarial defense	Negligible	No practical improvement

7. Synthesis and Outlook

Natural-Gradient SGLD presents a principled extension of SGLD to account for parameter-space geometry via (approximate) Fisher preconditioning, with well-established theoretical motivation in both Bayesian inference and information geometry. Recent works provide scalable instantiations—diagonal, quasi-diagonal, and Kronecker-factored (K-FAC)—with practical cost profiles suitable for modern deep learning. The principal advantages are seen in softened confidence estimates and OOD detection. However, the anticipated improvement in sampling quality and adversarial robustness does not materialize in practice, due to limitations inherent to minibatch-based simulation of continuous Langevin dynamics.

Contemporary trends, such as adaptively preconditioned SGLD (ASGLD), bridge the practical efficiency of adaptive optimizers with geometric noise-scaling, achieving accelerated training and strong generalization on standard architectures (Bhardwaj, 2019). Nevertheless, key limitations—sensitivity to hyperparameters, dependence on ensembling for Bayesian effect, and mixing inefficiency—define the current landscape and delimit the practical domain of NG-SGLD techniques (Palacci et al., 2018, Marceau-Caron et al., 2017, Bhardwaj, 2019).