Papers
Topics
Authors
Recent
2000 character limit reached

Mean-field Variational Inference

Updated 30 December 2025
  • Mean-field Variational Inference is a technique that approximates the full Bayesian posterior by assuming a fully factorized (product-form) distribution for computational tractability.
  • It employs coordinate-ascent updates to optimize the evidence lower bound (ELBO), ensuring efficient and convergent inference.
  • Recent advances integrate geometric tools like gradient flows, Fokker–Planck PDEs, and interacting diffusions to provide theoretical guarantees and inspire new algorithmic variants.

Mean-field variational inference (MFVI) is a foundational technique in Bayesian inference, where the posterior distribution is approximated by restricting to fully factorized (product-form) probability measures. MFVI has emerged as the workhorse of scalable variational inference due to its algorithmic simplicity and the tractability of its coordinate-ascent updates. Recent research has provided a geometric, analytic, and computational unification of MFVI using the language of gradient flows, partial differential equations (PDEs), and interacting particle systems, placing the classical approach on a rigorous foundation and enabling new algorithmic variants (Ghosh et al., 2022). This article presents a comprehensive account of these representations, theoretical guarantees, and algorithmic implications.

1. MFVI: Formulation, Objective, and Coordinate-Ascent

Given data xRnx\in\mathbb{R}^n and latent variables θRd\theta\in\mathbb{R}^d with prior π(θ)\pi(\theta) and likelihood p(xθ)p(x|\theta), the exact posterior is p(θx)=π(θ)p(xθ)/Zp(\theta|x) = \pi(\theta)p(x|\theta)/Z. The Bayesian inference problem is recast as minimizing the Kullback-Leibler divergence over all probability measures ν\nu: p=arg minνP(Rd)D(ν  p).p = \operatorname*{arg\,min}_{\nu\in\mathcal{P}(\mathbb{R}^d)} D(\nu\|\;p). Alternatively, one may equivalently optimize the functional

J(ν)=Eν[logp(x,θ)]H(ν),J(\nu) = \mathbb{E}_\nu[-\log p(x,\theta)] - H(\nu),

where H(ν)=Eν[logν]H(\nu) = -\mathbb{E}_\nu[\log \nu] is the Shannon entropy. This is the negative evidence lower bound (ELBO).

MFVI restricts ν\nu to the mean-field family: M={ν(θ)=i=1dνi(θi)}P(Rd).\mathcal{M} = \{ \nu(\theta) = \prod_{i=1}^d \nu_i(\theta_i) \} \subset \mathcal{P}(\mathbb{R}^d). The ELBO for a product-form ν\nu can be written as: J(ν)=i=1dJi(νi;νi),Ji(νi;νi)=Eνi[Ψi(;νi)]H(νi),J(\nu) = \sum_{i=1}^d J_i(\nu_i; \nu_{-i}), \quad J_i(\nu_i; \nu_{-i}) = \mathbb{E}_{\nu_i}[\Psi_i(\cdot; \nu_{-i})] - H(\nu_i), where Ψi(θi;νi)=Eνi[logp(x,θ)]\Psi_i(\theta_i; \nu_{-i}) = \mathbb{E}_{\nu_{-i}}[-\log p(x,\theta)].

Coordinate-ascent variational inference (CAVI) alternately minimizes each JiJ_i while keeping the other factors fixed: νik=arg minνiJi(νi;νik1).\nu_i^{k} = \operatorname*{arg\,min}_{\nu_i} J_i(\nu_i; \nu_{-i}^{k-1}). This yields a closed-form update: νi(θi)exp(Ψi(θi;νi)).\nu_i(\theta_i) \propto \exp(-\Psi_i(\theta_i; \nu_{-i})). Cyclic iterations are repeated until convergence.

2. Geometric Representations: Gradient Flows, PDEs, and Diffusions

Three analytic and probabilistic representations of MFVI are established (Ghosh et al., 2022):

a. Gradient Flow on Product Wasserstein Space.

Let P2(R)\mathcal{P}_2(\mathbb{R}) denote probability measures on R\mathbb{R} with finite second moment, equipped with the 2-Wasserstein metric W2W_2. The mean-field space is the product

M2=i=1dP2(R),\mathcal{M}_2 = \prod_{i=1}^d \mathcal{P}_2(\mathbb{R}),

with product metric d2(ν,μ)=i=1dW22(νi,μi)d^2(\nu, \mu) = \sum_{i=1}^d W_2^2(\nu_i, \mu_i). The MFVI energy Φ(ν)=(J1(ν1;ν1),,Jd(νd;νd))\Phi(\nu) = (J_1(\nu_1; \nu_{-1}), \dots, J_d(\nu_d; \nu_{-d})) induces a gradient flow: tν(t)=WΦ(ν(t)),\partial_t \nu(t) = - \nabla_W \Phi(\nu(t)), where, for each ii,

tνi(t)+W2,iJi(νi(t);νi(t))=0.\partial_t \nu_i(t) + \nabla_{W_2,i} J_i(\nu_i(t); \nu_{-i}(t)) = 0.

b. Fokker–Planck–Type PDEs.

Writing νi(t)=ρi(t,θi)dθi\nu_i(t) = \rho_i(t, \theta_i) d\theta_i, the marginal densities satisfy the coupled quasilinear parabolic PDE system: tρi(t,θi)=θi[ρi(t,θi)θiΨi(θi;ρi(t))]+θi2ρi(t,θi).\partial_t \rho_i(t, \theta_i) = \partial_{\theta_i} \left[ \rho_i(t, \theta_i) \, \partial_{\theta_i} \Psi_i(\theta_i; \rho_{-i}(t)) \right] + \partial^2_{\theta_i} \rho_i(t, \theta_i). This is interpreted as a continuity (transport) equation plus isotropic diffusion.

c. McKean–Vlasov Interacting Diffusion Process.

The PDE system above arises as the forward Kolmogorov equation for the interacting diffusion system: dθi(t)=θiΨi(θi(t);ρi(t))dt+dWti,d\theta_i(t) = -\partial_{\theta_i} \Psi_i(\theta_i(t); \rho_{-i}(t)) \, dt + dW_t^i, where WiW^i are independent Brownian motions. Under sufficient regularity, the time-marginal laws of this SDE are exactly the solutions of the MFVI PDE.

3. Discretized Algorithms: Proximal-JKO Scheme and CAVI Convergence

The time-discretized version of the MFVI gradient flow corresponds to a proximal-point (JKO) step in the product Wasserstein metric: νh,ik=arg minνiP2(R){12W22(νi,νh,ik1)+hJi(νi;νh,ik1)},\nu_{h,i}^k = \operatorname*{arg\,min}_{\nu_i \in \mathcal{P}_2(\mathbb{R})} \left\{ \frac{1}{2} W_2^2(\nu_i, \nu_{h,i}^{k-1}) + h J_i(\nu_i; \nu_{h,-i}^{k-1}) \right\}, for step size h>0h>0. Piecewise-constant interpolation between iterates yields convergence (as h0h\to0) to the continuous Wasserstein gradient flow solution [(Ghosh et al., 2022), Theorem 4.3]. The proof uses tightness of energy-dissipation, the energy-dissipation inequality, and uniqueness via geodesic convexity.

4. Theoretical Guarantees: PDE Limits, Global Convergence, and Geometric Conditions

The continuous-time limit of parametric or particle-based MFVI is described by a system of coupled one-dimensional parabolic PDEs: tρi=θi[ρiθiΨi(θi;ρi)]+θi2ρi,\partial_t \rho_i = \partial_{\theta_i}[ \rho_i \, \partial_{\theta_i} \Psi_i(\theta_i; \rho_{-i}) ] + \partial^2_{\theta_i} \rho_i, for i=1,,di=1,\dots,d. The convergence of time-discretized JKO/CAVI to this flow, and the associated SDE, is guaranteed under standard convexity conditions.

The key assumption is λ\lambda-convexity of logp(x,θ)-\log p(x, \theta) in each variable (i.e., the negative log-joint has Hessian λI\succeq \lambda I in each coordinate). This ensures geodesic convexity of JiJ_i on P2(R)\mathcal{P}_2(\mathbb{R}) and uniqueness plus exponential contractivity of the mean-field flow: W2(ν(t),ν(t))eλtW2(ν(0),ν(0)).W_2(\nu(t), \nu'(t)) \leq e^{-\lambda t} W_2(\nu(0), \nu'(0)).

These conditions yield both the correctness of the gradient-flow and SDE representations and the global convergence of practical algorithms (CAVI and its proximal-JKO variants) (Ghosh et al., 2022).

5. Practical Algorithmic Frameworks

The geometric perspective yields several classes of implementable MFVI algorithms:

  • Parametric MFVI: If variational factors are chosen from exponential families (e.g., Gaussian, Gaussian mixtures), the JKO coordinate-wise subproblems reduce to closed-form or tractable proximal updates.
  • Particle-Based MFVI: Each variational factor νi\nu_i is represented empirically using particles; proximal-Wasserstein (JKO) updates are performed via interacting particle systems. In the continuous-time limit, one recovers the McKean–Vlasov SDE described above, permitting analysis of ergodicity and convergence rates.
  • Discretization and Numerical Realization: Depending on the chosen space, one may discretize either the measure dynamics (e.g., via particles or parametric surrogates) or the coupled PDEs (e.g., by finite-difference or finite-element methods). Algorithm design principles and convergence checks are inherited from the Wasserstein gradient flow literature.

6. Extensions, Connections, and Future Directions

The framework described in (Ghosh et al., 2022) enables several new theoretical and practical avenues:

  • Alternative Metric and Divergence Choices: By replacing the Wasserstein metric or regularizer (e.g., with Hellinger or Stein discrepancies), new gradient-flow and diffusion representations are derived, broadening the scope of tractable VI.
  • Accelerated and Higher-Order Schemes: The geometric viewpoint suggests importing inertial (momentum) or higher-order splitting schemes, promising faster convergence and improved exploration.
  • Weakening Convexity and Infinite-Dimensional Models: Ongoing work includes the analysis of cases with only displacement-semicovexity, extension to infinite-dimensional latent spaces (e.g., Gaussian-process models), and quantification of convergence rates that account for model dimension dd.
  • Rigorous Unification of MFVI Algorithms: The equivalence between coordinate ascent updates, gradient flows in product Wasserstein spaces, Fokker–Planck–type PDEs, and McKean–Vlasov diffusions provides a unified geometric and probabilistic foundation, enabling systematic derivation and analysis of MFVI algorithms and their numerical approximations.

Overall, this analytic and geometric unification of MFVI not only justifies the standard coordinate-ascent protocol but also invites importing powerful techniques from stochastic analysis, PDE theory, and optimal transport into the study and numerical implementation of scalable variational inference (Ghosh et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Mean-field Variational Inference.