Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bayesian Computation in Deep Learning (2502.18300v3)

Published 25 Feb 2025 in cs.LG and stat.ML

Abstract: This review paper is intended for the 2nd edition of the Handbook of Markov chain Monte Carlo. We provide an introduction to approximate inference techniques as Bayesian computation methods applied to deep learning models. We organize the chapter by presenting popular computational methods for Bayesian neural networks and deep generative models, explaining their unique challenges in posterior inference as well as the solutions.

Summary

  • The paper surveys Bayesian computation techniques in deep learning, focusing on addressing uncertainty in Bayesian Neural Networks and training Deep Generative Models via approximate inference methods.
  • It details approximate inference for Bayesian Neural Networks, covering methods like Stochastic Gradient MCMC and Variational Inference, explaining their optimization and approaches for improved posterior approximation.
  • The survey discusses applying Bayesian computation to Deep Generative Models such as Energy-Based Models, score-based models, diffusion models, and deep latent variable models, outlining relevant training and sampling techniques.

The paper provides an overview of Bayesian computation techniques within deep learning, addressing uncertainty quantification in deep neural networks (DNNs) and training deep generative models (DGMs). It emphasizes the use of approximate inference methods to handle the challenges posed by high-dimensional data, non-linear likelihood functions, and large datasets.

The paper is organized into two main sections: Bayesian Neural Networks (BNNs) and Deep Generative Models.

Bayesian Neural Networks

The section on BNNs discusses Markov Chain Monte Carlo (MCMC) and Variational Inference (VI) techniques applied to BNNs, highlighting the use of stochastic gradient-based optimization for scalability.

Key points covered include:

  • Introduction of BNNs as a principled framework for obtaining reliable uncertainty estimates in DNNs, contrasting them with standard DNNs.
  • Discussion of the intractability of the exact posterior distribution in BNNs and the need for approximate inference techniques.
  • Explanation of Stochastic Gradient MCMC (SG-MCMC) methods, including Stochastic Gradient Langevin Dynamics (SGLD), Cyclical SG-MCMC, and Stochastic Gradient Hamiltonian Monte Carlo.
  • Detailed explanation of VI, including the Evidence Lower Bound (ELBO) optimization, mean-field Gaussian approximation, and the reparameterization trick. The mean-field VI for BNNs doubles the amount of parameters to be optimized.
  • Exploration of alternative divergences like α\alpha-divergence to mitigate underestimation of posterior uncertainty and improve posterior mass coverage, where the α\alpha-divergence is defined as:

    Dα[p(θD)qϕ(θ)]=1(α1)logp(θD)αqϕ(θ)1αdθD_{\alpha}[ p(\theta|D) \| q_{\phi}(\theta) ] = \frac{1}{(\alpha-1)} \log \int p(\theta|D)^{\alpha} q_{\phi}(\theta)^{1-\alpha} d\theta

    where:

    • Dα[p(θD)qϕ(θ)]D_{\alpha}[ p(\theta|D) \| q_{\phi}(\theta) ] is the α\alpha-divergence between the posterior distribution p(θD)p(\theta|D) and the approximate distribution qϕ(θ)q_{\phi}(\theta).
    • α\alpha is a hyper-parameter that controls the behavior of the divergence.
    • p(θD)p(\theta|D) is the posterior distribution of the model parameters θ\theta given the data DD.
    • qϕ(θ)q_{\phi}(\theta) is the approximate posterior distribution parameterized by ϕ\phi.
  • Discussion of distribution families for VI, including low-rank + diagonal covariance structures, matrix normal distributions, and non-Gaussian posterior approximations using DNNs as flexible transformations.

Deep Generative Models

The section on DGMs discusses the application of Bayesian computation to energy-based models, score-based models, diffusion models, and deep latent variable models.

Key points covered include:

  • Overview of DGMs and their categorization into models directly parameterizing the data distribution and models using latent variables.
  • Discussion of Energy-Based Models (EBMs) and the challenges in maximum likelihood estimation due to the intractable normalizing constant ZθZ_{\theta}, and the gradients of the logarithm of pθ(x)p_{\theta}(\bm{x}) contains an intractable term θlogZθ\nabla_{\theta}\log Z_{\theta}: θlogpθ(x)=θEθ(x)θlogZθ\nabla_{\theta}\log p_{\theta}(\bm{x})=-\nabla_{\theta}E_{\theta}(\bm{x})-\nabla_{\theta}\log Z_{\theta}.
  • Explanation of Langevin sampling for generating samples from EBMs, where the update rule for xx is: xk+1xkαkxkEθ(xk)+2αkϵk\bm{x}_{k+1} \gets \bm{x}_k-\alpha_k\nabla_{\bm{x}_k}E_{\theta}(\bm{x}_k)+\sqrt{2\alpha_k}\cdot\bm{\epsilon}_k, where:
    • xk+1\bm{x}_{k+1} is the updated sample at iteration k+1k+1.
    • xk\bm{x}_k is the current sample at iteration kk.
    • αk\alpha_k is the step size at iteration kk.
    • xkEθ(xk)\nabla_{\bm{x}_k}E_{\theta}(\bm{x}_k) is the gradient of the energy function EθE_{\theta} with respect to xk\bm{x}_k.
    • ϵkN(0,I)\bm{\epsilon}_k\sim\mathcal{N}(\bm{0},\bm{I}) is a standard Gaussian noise vector at iteration kk.
  • Discussion of score matching for training score-based models, including denoising score matching and noise conditional score networks.
  • Introduction of diffusion models and their connection to score-based models and stochastic differential equations (SDEs).
  • Explanation of reverse-time SDEs for data generation in diffusion models, which is: $d\bm{x}=[\bm{f}(\bm{x},t)-g(t)^2\nabla_{\bm{x}\log q_t(\bm{x})]dt+g(t)d\bar{\bm{w}}$, where:
    • dxd\bm{x} is the infinitesimal change in x\bm{x}.
    • f(x,t)\bm{f}(\bm{x},t) is the drift function.
    • g(t)g(t) is the diffusion function.
    • $\nabla_{\bm{x}\log q_t(\bm{x})$ is the score function.
    • dtdt is an infinitesimal negative time step.
    • dwˉd\bar{\bm{w}} is a standard Wiener process when time flows backward.
  • Discussion of deep latent variable models (DLVMs), variational autoencoders (VAEs), and amortized inference. The model parameters θ\theta in a DLVM is also estimated via maximum likelihood estimation.
  • Explanation of the ELBO objective in VAEs and the use of the reparameterization trick for training.
  • Discussion of combining VI and MCMC in DLVMs to improve the accuracy of posterior samples.

The paper concludes by discussing research challenges, such as addressing weight symmetries in DNNs and understanding the bias in training deep generative models with approximate posterior inference.

X Twitter Logo Streamline Icon: https://streamlinehq.com