Bayesian Computation in Deep Learning (2502.18300v3)
Abstract: This review paper is intended for the 2nd edition of the Handbook of Markov chain Monte Carlo. We provide an introduction to approximate inference techniques as Bayesian computation methods applied to deep learning models. We organize the chapter by presenting popular computational methods for Bayesian neural networks and deep generative models, explaining their unique challenges in posterior inference as well as the solutions.
Summary
- The paper surveys Bayesian computation techniques in deep learning, focusing on addressing uncertainty in Bayesian Neural Networks and training Deep Generative Models via approximate inference methods.
- It details approximate inference for Bayesian Neural Networks, covering methods like Stochastic Gradient MCMC and Variational Inference, explaining their optimization and approaches for improved posterior approximation.
- The survey discusses applying Bayesian computation to Deep Generative Models such as Energy-Based Models, score-based models, diffusion models, and deep latent variable models, outlining relevant training and sampling techniques.
The paper provides an overview of Bayesian computation techniques within deep learning, addressing uncertainty quantification in deep neural networks (DNNs) and training deep generative models (DGMs). It emphasizes the use of approximate inference methods to handle the challenges posed by high-dimensional data, non-linear likelihood functions, and large datasets.
The paper is organized into two main sections: Bayesian Neural Networks (BNNs) and Deep Generative Models.
Bayesian Neural Networks
The section on BNNs discusses Markov Chain Monte Carlo (MCMC) and Variational Inference (VI) techniques applied to BNNs, highlighting the use of stochastic gradient-based optimization for scalability.
Key points covered include:
- Introduction of BNNs as a principled framework for obtaining reliable uncertainty estimates in DNNs, contrasting them with standard DNNs.
- Discussion of the intractability of the exact posterior distribution in BNNs and the need for approximate inference techniques.
- Explanation of Stochastic Gradient MCMC (SG-MCMC) methods, including Stochastic Gradient Langevin Dynamics (SGLD), Cyclical SG-MCMC, and Stochastic Gradient Hamiltonian Monte Carlo.
- Detailed explanation of VI, including the Evidence Lower Bound (ELBO) optimization, mean-field Gaussian approximation, and the reparameterization trick. The mean-field VI for BNNs doubles the amount of parameters to be optimized.
- Exploration of alternative divergences like α-divergence to mitigate underestimation of posterior uncertainty and improve posterior mass coverage, where the α-divergence is defined as:
Dα[p(θ∣D)∥qϕ(θ)]=(α−1)1log∫p(θ∣D)αqϕ(θ)1−αdθ
where:
- Dα[p(θ∣D)∥qϕ(θ)] is the α-divergence between the posterior distribution p(θ∣D) and the approximate distribution qϕ(θ).
- α is a hyper-parameter that controls the behavior of the divergence.
- p(θ∣D) is the posterior distribution of the model parameters θ given the data D.
- qϕ(θ) is the approximate posterior distribution parameterized by ϕ.
- Discussion of distribution families for VI, including low-rank + diagonal covariance structures, matrix normal distributions, and non-Gaussian posterior approximations using DNNs as flexible transformations.
Deep Generative Models
The section on DGMs discusses the application of Bayesian computation to energy-based models, score-based models, diffusion models, and deep latent variable models.
Key points covered include:
- Overview of DGMs and their categorization into models directly parameterizing the data distribution and models using latent variables.
- Discussion of Energy-Based Models (EBMs) and the challenges in maximum likelihood estimation due to the intractable normalizing constant Zθ, and the gradients of the logarithm of pθ(x) contains an intractable term ∇θlogZθ: ∇θlogpθ(x)=−∇θEθ(x)−∇θlogZθ.
- Explanation of Langevin sampling for generating samples from EBMs, where the update rule for x is:
xk+1←xk−αk∇xkEθ(xk)+2αk⋅ϵk,
where:
- xk+1 is the updated sample at iteration k+1.
- xk is the current sample at iteration k.
- αk is the step size at iteration k.
- ∇xkEθ(xk) is the gradient of the energy function Eθ with respect to xk.
- ϵk∼N(0,I) is a standard Gaussian noise vector at iteration k.
- Discussion of score matching for training score-based models, including denoising score matching and noise conditional score networks.
- Introduction of diffusion models and their connection to score-based models and stochastic differential equations (SDEs).
- Explanation of reverse-time SDEs for data generation in diffusion models, which is:
$d\bm{x}=[\bm{f}(\bm{x},t)-g(t)^2\nabla_{\bm{x}\log q_t(\bm{x})]dt+g(t)d\bar{\bm{w}}$,
where:
- dx is the infinitesimal change in x.
- f(x,t) is the drift function.
- g(t) is the diffusion function.
- $\nabla_{\bm{x}\log q_t(\bm{x})$ is the score function.
- dt is an infinitesimal negative time step.
- dwˉ is a standard Wiener process when time flows backward.
- Discussion of deep latent variable models (DLVMs), variational autoencoders (VAEs), and amortized inference. The model parameters θ in a DLVM is also estimated via maximum likelihood estimation.
- Explanation of the ELBO objective in VAEs and the use of the reparameterization trick for training.
- Discussion of combining VI and MCMC in DLVMs to improve the accuracy of posterior samples.
The paper concludes by discussing research challenges, such as addressing weight symmetries in DNNs and understanding the bias in training deep generative models with approximate posterior inference.