Expectation Propagation (EP)
- Expectation Propagation (EP) is an iterative algorithm for approximate inference that constructs tractable Gaussian approximations for complex, intractable posteriors.
- It refines local approximations via cavity and tilted distributions, employing moment matching to update each site factor consistently.
- EP connects to smoothed Newton’s methods and variational frameworks, offering scalable, accurate Bayesian inference with systematic perturbative corrections.
Expectation Propagation (EP) is an iterative algorithm for approximate inference which constructs and refines a tractable approximation to a complex posterior distribution by locally projecting intractable factors onto a computationally convenient exponential family. Historically, EP was introduced as a generalization of assumed-density filtering and belief propagation, providing broader applicability to hybrid discrete-continuous models and the flexibility to propagate correlated (non-factorized) approximations. The algorithm is notably well-suited for Bayesian inference and statistical learning, especially when the exact posterior is intractable due to non-conjugate or high-dimensional factors. EP’s empirical success, especially when using Gaussian approximations, has motivated intensive efforts to characterize its fixed points, convergence, theoretical exactness, and connections to variational and optimization-based frameworks.
1. Bayesian Approximate Inference and EP’s Construction
Consider a posterior density
where , is twice differentiable and typically a sum of local terms: , with each factor . Direct computation of or its marginals is usually intractable.
EP introduces an approximating exponential-family density , commonly chosen to be multivariate Gaussian:
with being Gaussian. Each approximates the original factor 0.
To refine each site factor 1, EP proceeds as follows:
- Cavity Distribution: Remove 2 from 3, forming 4.
- Tilted Distribution: Reintroduce the true factor: 5.
- Moment Matching/Projection: Project 6 back to the exponential family by matching its moments (typically mean and covariance for Gaussians).
- Update: Set the new 7 such that the moments of 8 align with those of 9.
This local-moment matching iteratively refines all 0 until global consistency or convergence is achieved (Minka, 2013).
2. EP as Smoothed Newton’s Method and Variational Characterization
Each EP site update can be interpreted as a smoothed Newton (second-order) step on a “smoothed” energy landscape. Instead of performing Newton’s method directly on the raw energy 1, EP operates on a convolution of 2 with the current Gaussian approximation, amounting to
3
The expectation and covariance used in Newton's method are replaced by their averages under 4, i.e., under the current Gaussian approximating measure. Each EP site update is effectively a Newton step along this locally averaged (smoothed) energy (Dehaene, 2016).
The fixed points of EP coincide with solutions to the problem of minimizing the "reverse" Kullback–Leibler divergence (i.e., 5) within the chosen exponential family—specifically, the stationary equations for a Gaussian 6:
- 7
- 8 where 9 denotes the Hessian and 0 the covariance of 1.
This equivalence explains why, in convex and log-concave cases, EP converges to the unique solution minimizing 2 among Gaussians (Dehaene, 2016).
3. Corrections, Exactness, and Cumulant Expansions
Though EP is exact only under certain conditions (e.g., when the true posterior is in the approximating family or each tilted distribution is Gaussian), in general it neglects higher-order cumulants (third and above). This neglect can be systematically corrected via a perturbative expansion:
3
where 4 is the normalization of 5, 6 is the estimate from the EP approximation, and the 7 are corrections involving cumulant differences between tilted and EP marginals. Notably, the first-order term vanishes and the leading correction (second order) is quadratic in site-pair cumulant differences, computable in 8 time for 9 sites (Opper et al., 2013). If all higher cumulants vanish, EP is exact. These corrections allow for quantifying the error of the EP approximation polynomially in 0.
4. Convergence and Stability Properties
Convergence guarantees are available in several settings:
- For convex, log-concave posteriors, a full parallel EP sweep converges to a unique fixed point, viewed as block-coordinate descent on a concave lower bound (Dehaene, 2016).
- In the large-data (large-1) and large-cavity (high-precision) limit, both standard EP and averaged-EP (aEP) behave as Newton’s method, performing local quadratic updates on the log-posterior. Near modes of the posterior, EP’s fixed point converges to the canonical Gaussian approximation at 2 rate in total variation, under mild regularity and identifiability (Dehaene et al., 2015).
- However, EP can be globally unstable in multimodal or poorly initialized scenarios, mirroring the known potential for Newton’s method to diverge far from a mode. In practice, using damping, Laplace initialization, or contraction diagnostics is essential for robust performance.
Because EP site updates involve Newton-like steps on smoothed objectives, they typically accelerate convergence in directions of gentle curvature and naturally recover the Laplace approximation in the limit of many weak factors (Dehaene, 2016).
5. Implementation, Numerical Considerations, and Variants
Efficient implementation of EP, especially in models with Gaussian-approximated posteriors, relies on the following:
- Each site update requires computing the mean and covariance of the local hybrid distribution 3. For univariate or low-dimensional likelihoods, these can be computed via quadrature or local MCMC. The complexity per full sweep is 4 for 5 sites and 6-dimensional 7.
- For very large models, sparse reparameterizations and distributed algorithms allow full EP inference to scale linearly with the number of sites, as well as support distributed memory architectures (see (Zhou et al., 2024) for mixed-effects regression and distributed EP frameworks).
- Variants such as 8-EP implement alternative divergences by adjusting the smoothing kernel (for example, using 9 kernels recovers reverse-KL-VB as 0 and Hellinger-EP as 1) (Dehaene, 2016).
6. Applications, Empirical Performance, and Extensions
EP has been applied to a broad range of models including generalized linear models, Gaussian process classification, sparse regression, and feature selection in high-dimensional classifiers. Empirically, EP produces accurate (often state-of-the-art) marginal likelihood and posterior moment estimates, outperforming Laplace, variational Bayes, and certain sampling approaches in terms of both accuracy and computational efficiency, especially in large-scale and non-conjugate settings (Minka, 2013, Dehaene, 2016, Opper et al., 2013).
Additionally, higher-order perturbative corrections can improve log-marginal likelihood accuracy by orders of magnitude, as validated in Gaussian process and Ising model test cases (Opper et al., 2013).
7. Summary and Theoretical Implications
Expectation Propagation provides a unifying framework that connects Newton-type optimization, variational inference, and message-passing. Rigorous analysis shows:
- EP is precisely Newton’s method applied to a locally averaged energy landscape.
- Its fixed points are the unique Gaussian minimizers of reverse-KL divergence to the posterior.
- In high-data or high-cavity-limit regimes, EP is asymptotically exact.
- Perturbative expansions yield systematic corrections and quantifiable error measures.
- EP’s convergence and accuracy are explained by these unifying principles, demystifying both its empirical effectiveness and limitations (Dehaene, 2016, Opper et al., 2013, Dehaene et al., 2015).
References:
- “Expectation Propagation performs a smoothed gradient descent” (Dehaene, 2016)
- “Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models” (Opper et al., 2013)
- “Expectation Propagation in the large-data limit” (Dehaene et al., 2015)
- “Expectation Propagation for approximate Bayesian inference” (Minka, 2013)