SLANG: Scalable Bayesian Inference
- SLANG is a scalable Bayesian variational inference algorithm that uses a diagonal-plus-low-rank covariance structure for efficient uncertainty quantification in deep neural networks.
- It employs natural gradient updates and randomized SVD to maintain tractable computations while preserving key posterior correlations.
- Empirical evaluations on logistic regression, UCI benchmarks, and MNIST classification demonstrate that SLANG outperforms mean-field methods in accuracy and uncertainty calibration.
The SLANG algorithm refers to several distinct computational frameworks under the same acronym but targeting different tasks across NLP and machine learning. The most technically prominent among these—"SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient"—is a scalable Bayesian variational inference strategy for deep neural networks, addressing the challenge of uncertainty quantification with non-diagonal structured covariance approximations. Formal structure, algorithmic details, complexity, and empirical evidence for this framework are summarized below. For frameworks in slang generation, word-formation, and interpretation, see corresponding references, as the term "SLANG" is overloaded in the literature.
1. Problem Definition and Motivation
The SLANG algorithm (Mishkin et al., 2018) is a method for variational inference (VI) in Bayesian deep learning. Given a large neural model parameterized by and dataset , the goal is to approximate the intractable posterior . Standard VI approaches use a Gaussian approximation , but full-covariance becomes computationally infeasible ( storage, updates). Mean-field (diagonal ) provides tractability but produces miscalibrated uncertainties. SLANG introduces a structured, tractable posterior with "diagonal plus low-rank" covariance, achieving efficient uncertainty representation and propagation.
2. Mathematical Formulation and Algorithm
The evidence lower bound (ELBO) optimized in SLANG is:
with Gaussian prior . The core variational update uses a natural gradient on the Gaussian manifold, recursively updating the mean 0 and structured precision 1.
The structured precision update at iteration 2 is:
3
where 4 is the empirical Fisher matrix:
5
for minibatch gradients 6 at 7. The approximation 8 (diagonal plus low-rank rank 9) is maintained at each iteration. The low-rank term is kept to rank 0 via truncated randomized SVD on the structured part 1, and the diagonal is corrected to match the marginals:
2
where 3. The mean update is:
4
with step sizes 5. Efficient Woodbury identities and randomized SVD are used for sampling and updates (Mishkin et al., 2018).
3. Computational Complexity and Implementation
SLANG provides a significant efficiency-accuracy-complexity tradeoff, summarized in the table below:
| Method | Memory | Time per update | Covariance |
|---|---|---|---|
| Mean-field | 6 | 7 | Diagonal |
| Full-cov. VI | 8 | 9 | Full |
| SLANG (0) | 1 | 2 | Diag+3 |
Here 4 is parameter dimension, 5 the minibatch size, 6 the number of MC samples, and 7 the chosen low-rank. Rank 8 is typically in 9; bottlenecks are mitigated by per-example/mini-batch gradient tricks and randomized SVD (Mishkin et al., 2018).
4. Empirical Performance
SLANG was evaluated on:
- Bayesian logistic regression (libsvm datasets): SLANG outperforms mean-field VI in ELBO, test log-loss, and symmetric KL to full covariance. As 0 increases, performance closely tracks full-covariance VI.
- UCI regression benchmarks (one-layer BNN): SLANG (1) outperforms Bayes by Backprop (BBB) on 7/8 datasets in RMSE and 5/8 in log-likelihood.
- MNIST classification (two hidden layers): SLANG with 2 attains test error 3 (vs. 4 for BBB).
SLANG combines improved uncertainty quantification (over mean-field) with practical scalability on large, high-dimensional problems (Mishkin et al., 2018).
5. Theoretical Insights and Algorithmic Design
SLANG leverages the empirical Fisher matrix as a positive-definite approximation to the Hessian (Generalized Gauss-Newton), which aligns with the natural-gradient direction in parameter space. The diagonal plus low-rank structure targets the dominant posterior correlations—often few in number compared to model dimensionality—thus preserving key uncertainty directions without incurring quadratic or cubic scaling.
Randomized SVD is used for extracting leading eigenmodes at each iterated update. The update maintains marginals via a diagonal correction, ensuring calibration is not lost. Woodbury-based sampling provides tractable parameter draws for predictive posterior propagation. SLANG also enables separate learning rates for mean and precision, and supports classic step decay and momentum on the mean (recommended 5) (Mishkin et al., 2018).
6. Relation to Other SLANG Frameworks and Model Extensions
Other algorithms under the "SLANG" label include:
- SLANG for Slang Generation (Sun et al., 2021): Probabilistic Bayesian choice with contrastive semantic encoder, syntactic and contextual priors for slang word choice in context, not related to structured covariance VI.
- SLANG for Slang Word Formation (Kulkarni et al., 2018): Generative models for blends, clippings, and reduplicatives in English, focusing on structural and phonological word-formation—distinct from Bayesian VI.
- SLANG Benchmark for LLM Concept Comprehension (Mei et al., 2024): Autonomous evaluation suite for emergent concept comprehension in LLMs, unrelated to variational inference.
- Semantically Informed Slang Interpretation (SSI) (Sun et al., 2022): Neural models for interpreting slang via semantic extensions; again, distinct in scope and method.
SLANG (covariance approximation) is most directly related to scalable Bayesian inference and uncertainty quantification, providing unique advantages over existing methods in computational tractability and calibration for deep networks (Mishkin et al., 2018).
7. Summary and Societal Impact
In summary, SLANG (Mishkin et al., 2018) addresses the challenge of representing posterior uncertainty in large-scale deep Bayesian models, improving over mean-field approximations by incorporating low-rank corrections at near-linear cost in parameter dimension. Empirically, SLANG provides accurate, robust performance on multiple benchmarks. A plausible implication is that advanced structured VI methods such as SLANG may set a new standard for scalable uncertainty-aware learning, particularly as model scales continue to increase in practice.