Papers
Topics
Authors
Recent
Search
2000 character limit reached

SLANG: Scalable Bayesian Inference

Updated 27 April 2026
  • SLANG is a scalable Bayesian variational inference algorithm that uses a diagonal-plus-low-rank covariance structure for efficient uncertainty quantification in deep neural networks.
  • It employs natural gradient updates and randomized SVD to maintain tractable computations while preserving key posterior correlations.
  • Empirical evaluations on logistic regression, UCI benchmarks, and MNIST classification demonstrate that SLANG outperforms mean-field methods in accuracy and uncertainty calibration.

The SLANG algorithm refers to several distinct computational frameworks under the same acronym but targeting different tasks across NLP and machine learning. The most technically prominent among these—"SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient"—is a scalable Bayesian variational inference strategy for deep neural networks, addressing the challenge of uncertainty quantification with non-diagonal structured covariance approximations. Formal structure, algorithmic details, complexity, and empirical evidence for this framework are summarized below. For frameworks in slang generation, word-formation, and interpretation, see corresponding references, as the term "SLANG" is overloaded in the literature.

1. Problem Definition and Motivation

The SLANG algorithm (Mishkin et al., 2018) is a method for variational inference (VI) in Bayesian deep learning. Given a large neural model parameterized by θRD\theta\in\mathbb{R}^D and dataset DD, the goal is to approximate the intractable posterior p(θD)p(\theta|D). Standard VI approaches use a Gaussian approximation q(θ)=N(θm,Σ)q(\theta)=\mathcal{N}(\theta|m,\Sigma), but full-covariance Σ\Sigma becomes computationally infeasible (O(D2)O(D^2) storage, O(D3)O(D^3) updates). Mean-field (diagonal Σ\Sigma) provides tractability but produces miscalibrated uncertainties. SLANG introduces a structured, tractable posterior with "diagonal plus low-rank" covariance, achieving efficient uncertainty representation and propagation.

2. Mathematical Formulation and Algorithm

The evidence lower bound (ELBO) optimized in SLANG is:

L(m,Σ)=Eq[logp(Dθ)]KL(q(θ)p(θ))\mathcal{L}(m,\Sigma) = \mathbb{E}_{q}\left[\log p(D\mid\theta)\right] - \mathrm{KL}\left(q(\theta)\,\|\,p(\theta)\right)

with Gaussian prior p(θ)=N(θ0,λ1I)p(\theta)=\mathcal{N}(\theta|0,\lambda^{-1}I). The core variational update uses a natural gradient on the Gaussian manifold, recursively updating the mean DD0 and structured precision DD1.

The structured precision update at iteration DD2 is:

DD3

where DD4 is the empirical Fisher matrix:

DD5

for minibatch gradients DD6 at DD7. The approximation DD8 (diagonal plus low-rank rank DD9) is maintained at each iteration. The low-rank term is kept to rank p(θD)p(\theta|D)0 via truncated randomized SVD on the structured part p(θD)p(\theta|D)1, and the diagonal is corrected to match the marginals:

p(θD)p(\theta|D)2

where p(θD)p(\theta|D)3. The mean update is:

p(θD)p(\theta|D)4

with step sizes p(θD)p(\theta|D)5. Efficient Woodbury identities and randomized SVD are used for sampling and updates (Mishkin et al., 2018).

3. Computational Complexity and Implementation

SLANG provides a significant efficiency-accuracy-complexity tradeoff, summarized in the table below:

Method Memory Time per update Covariance
Mean-field p(θD)p(\theta|D)6 p(θD)p(\theta|D)7 Diagonal
Full-cov. VI p(θD)p(\theta|D)8 p(θD)p(\theta|D)9 Full
SLANG (q(θ)=N(θm,Σ)q(\theta)=\mathcal{N}(\theta|m,\Sigma)0) q(θ)=N(θm,Σ)q(\theta)=\mathcal{N}(\theta|m,\Sigma)1 q(θ)=N(θm,Σ)q(\theta)=\mathcal{N}(\theta|m,\Sigma)2 Diag+q(θ)=N(θm,Σ)q(\theta)=\mathcal{N}(\theta|m,\Sigma)3

Here q(θ)=N(θm,Σ)q(\theta)=\mathcal{N}(\theta|m,\Sigma)4 is parameter dimension, q(θ)=N(θm,Σ)q(\theta)=\mathcal{N}(\theta|m,\Sigma)5 the minibatch size, q(θ)=N(θm,Σ)q(\theta)=\mathcal{N}(\theta|m,\Sigma)6 the number of MC samples, and q(θ)=N(θm,Σ)q(\theta)=\mathcal{N}(\theta|m,\Sigma)7 the chosen low-rank. Rank q(θ)=N(θm,Σ)q(\theta)=\mathcal{N}(\theta|m,\Sigma)8 is typically in q(θ)=N(θm,Σ)q(\theta)=\mathcal{N}(\theta|m,\Sigma)9; bottlenecks are mitigated by per-example/mini-batch gradient tricks and randomized SVD (Mishkin et al., 2018).

4. Empirical Performance

SLANG was evaluated on:

  • Bayesian logistic regression (libsvm datasets): SLANG outperforms mean-field VI in ELBO, test log-loss, and symmetric KL to full covariance. As Σ\Sigma0 increases, performance closely tracks full-covariance VI.
  • UCI regression benchmarks (one-layer BNN): SLANG (Σ\Sigma1) outperforms Bayes by Backprop (BBB) on 7/8 datasets in RMSE and 5/8 in log-likelihood.
  • MNIST classification (two hidden layers): SLANG with Σ\Sigma2 attains test error Σ\Sigma3 (vs. Σ\Sigma4 for BBB).

SLANG combines improved uncertainty quantification (over mean-field) with practical scalability on large, high-dimensional problems (Mishkin et al., 2018).

5. Theoretical Insights and Algorithmic Design

SLANG leverages the empirical Fisher matrix as a positive-definite approximation to the Hessian (Generalized Gauss-Newton), which aligns with the natural-gradient direction in parameter space. The diagonal plus low-rank structure targets the dominant posterior correlations—often few in number compared to model dimensionality—thus preserving key uncertainty directions without incurring quadratic or cubic scaling.

Randomized SVD is used for extracting leading eigenmodes at each iterated update. The update maintains marginals via a diagonal correction, ensuring calibration is not lost. Woodbury-based sampling provides tractable parameter draws for predictive posterior propagation. SLANG also enables separate learning rates for mean and precision, and supports classic step decay and momentum on the mean (recommended Σ\Sigma5) (Mishkin et al., 2018).

6. Relation to Other SLANG Frameworks and Model Extensions

Other algorithms under the "SLANG" label include:

  • SLANG for Slang Generation (Sun et al., 2021): Probabilistic Bayesian choice with contrastive semantic encoder, syntactic and contextual priors for slang word choice in context, not related to structured covariance VI.
  • SLANG for Slang Word Formation (Kulkarni et al., 2018): Generative models for blends, clippings, and reduplicatives in English, focusing on structural and phonological word-formation—distinct from Bayesian VI.
  • SLANG Benchmark for LLM Concept Comprehension (Mei et al., 2024): Autonomous evaluation suite for emergent concept comprehension in LLMs, unrelated to variational inference.
  • Semantically Informed Slang Interpretation (SSI) (Sun et al., 2022): Neural models for interpreting slang via semantic extensions; again, distinct in scope and method.

SLANG (covariance approximation) is most directly related to scalable Bayesian inference and uncertainty quantification, providing unique advantages over existing methods in computational tractability and calibration for deep networks (Mishkin et al., 2018).

7. Summary and Societal Impact

In summary, SLANG (Mishkin et al., 2018) addresses the challenge of representing posterior uncertainty in large-scale deep Bayesian models, improving over mean-field approximations by incorporating low-rank corrections at near-linear cost in parameter dimension. Empirically, SLANG provides accurate, robust performance on multiple benchmarks. A plausible implication is that advanced structured VI methods such as SLANG may set a new standard for scalable uncertainty-aware learning, particularly as model scales continue to increase in practice.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SLANG Algorithm.