Stochastic Expectation Propagation (SEP)

Updated 1 January 2026

Stochastic Expectation Propagation (SEP) is an approximate Bayesian inference method that replaces N per-datapoint factors with a single global factor, achieving constant memory usage.
SEP employs stochastic moment-matching updates and Robbins–Monro step-size schedules to iteratively refine the shared factor, maintaining high-quality posterior estimates.
SEP enables scalable learning on large datasets and has been extended for deep Gaussian processes and differential privacy, although it may trade off flexibility in heterogeneous data settings.

Stochastic Expectation Propagation (SEP) is an approximate Bayesian inference algorithm that merges the local refinement advantages of classical Expectation Propagation (EP) with the memory efficiency of variational methods. It achieves this by replacing the set of $N$ per-datapoint factors used in EP with a single, shared global factor—enabling constant-memory scalable inference, while largely retaining the quality of posterior uncertainty quantification characteristic of EP. SEP was first comprehensively described by Li, Hernández-Lobato, and Turner (2015) (Li et al., 2015) and has since seen numerous theoretical and practical extensions (Hernández-Lobato et al., 2015, Bui et al., 2015, Vinaroz et al., 2021).

1. Principles and Motivation

Expectation Propagation (EP) approximates an intractable posterior density

$p(\theta|\mathcal{D}) \propto p_0(\theta)\prod_{n=1}^N p(x_n|\theta)$

with a tractable distribution

$q_\mathrm{EP}(\theta) \propto p_0(\theta)\prod_{n=1}^N f_n(\theta)$

where each $f_n$ is an exponential-family site parameter refined by local moment-matching. This yields accurate posterior variances but incurs $O(N)$ memory, as all site factors must be stored.

Variational Inference (VI), by contrast, posits a single global approximate posterior and fits it to minimize KL divergence. VI’s memory cost is $O(1)$ in $N$ , but it often underestimates posterior uncertainty due to factorized (mean-field) approximations.

SEP enforces a tied parameterization: $q_\mathrm{SEP}(\theta) \propto p_0(\theta)\,[f(\theta)]^N$ maintaining a single average site factor $f(\theta)$ , which undergoes stochastic, EP-like moment-matching updates. This achieves the $O(1)$ memory footprint of VI while preserving the local update mechanism of EP, and empirically delivers posterior calibrations competitive with full EP (Li et al., 2015, Bui et al., 2015, Vinaroz et al., 2021).

2. Algorithmic Structure and Updates

SEP operates by iteratively sampling datapoints (or minibatches) and performing update steps analogous to EP but aggregated into the global site. The canonical single-data SEP update proceeds as follows:

Cavity Distribution: Remove one copy of the global factor,

$q_{-1}(\theta) \propto p_0(\theta)\,[f(\theta)]^{N-1}$

Tilted Distribution: Form the local posterior for datapoint $x_n$ ,

$\tilde{p}_n(\theta) \propto p(x_n|\theta)\,q_{-1}(\theta)$

Moment Matching: Project $\tilde{p}_n$ back to the exponential family, yielding an intermediate site $f_n$ by matching moments:

$f_n(\theta) \leftarrow \operatorname{proj}[\tilde{p}_n(\theta)] / q_{-1}(\theta)$

Damped Global Update: Update the shared factor by moving towards the intermediate site,

$f(\theta) \leftarrow f(\theta)^{1-\epsilon}\,f_n(\theta)^\epsilon$

where $\epsilon = 1/N$ is typical, but can be adapted or annealed.

In practice, minibatch updates generalize this scheme, with $M$ intermediate sites $f_{n_j}$ being aggregated: $f(\theta) \leftarrow f(\theta)^{1-M/N}\prod_{j=1}^M [f_{n_j}(\theta)]^{1/N}$ This framework allows for Robbins–Monro step-size schedules and stochastic optimization techniques, supporting scalable learning on large datasets (Li et al., 2015, Bui et al., 2015).

3. Theoretical Characteristics and Convergence

SEP is formally a stochastic approximation to the fixed-point equations of Averaged-EP (AEP), which itself shares fixed points with full EP in the large- $N$ regime under mild regularity conditions (e.g., log-concave sites, slow parameter evolution). Under diminishing step sizes $\epsilon_t$ satisfying standard Robbins–Monro conditions, SEP converges almost surely to a stationary point of the EP energy function (Li et al., 2015).

The use of a single shared factor means SEP approximates the geometric mean of all likelihood terms, so in the limit of infinitesimal learning rate, SEP's global approximation minimizes the KL divergence between $p_0(\theta) f(\theta)^N$ and the true posterior.

A notable limitation is that the tied global factor introduces a slight loss in modeling flexibility relative to per-datapoint factors, which can be observed in pathological cases (e.g., highly heterogeneous likelihoods). However, empirical results show that SEP retains near-EP calibration for posterior uncertainty on a wide spectrum of canonical models (probit regression, mixture models, Bayesian neural nets) (Li et al., 2015, Hernández-Lobato et al., 2015).

4. Computational Benefits and Scaling

SEP’s major computational advantage is its $O(1)$ memory cost with respect to the number of datapoints, in contrast to EP's $O(N)$ . For Gaussian factors with parameter dimension $d$ , SEP stores only the parameters of a single global factor (e.g., mean and covariance for Gaussian $f(\theta)$ ), while EP must maintain $N$ such factors.

Per-iteration computational complexity for online updates is $O(M \cdot C_{\text{upd}})$ for minibatch size $M$ , which matches that of EP amortized over the dataset but without the memory overhead. For large-scale latent Gaussian models, such as sparse GP classification, this translates to $O(m^2)$ memory and $O(m^3)$ -dominant computational step (inverting a covariance of dimension $m$ ) per iteration—enabling application to datasets with $n \gg 10^5$ (Hernández-Lobato et al., 2015, Bui et al., 2015).

5. Extensions and Practical Variants

Significant algorithmic and practical refinements have built upon SEP’s foundation:

Differentially Private SEP (DP-SEP): By injecting Gaussian noise into the global factor's parameter updates and clipping the norm of intermediate sites, SEP can be made $(\epsilon,\delta)$ -differentially private. The analysis demonstrates that with increasing $N$ , privacy-induced approximation error vanishes as $N^{-1}$ , allowing for strong privacy guarantees with minimal loss in posterior accuracy (Vinaroz et al., 2021).
Deep Gaussian Processes and Probabilistic Backpropagation: SEP has been used to construct scalable Bayesian learning algorithms for deep GP models. Here, the site updates are integrated with probabilistic backpropagation, exploiting SEP's memory efficiency and stochastic updates for layer-wise uncertainty propagation (Bui et al., 2015).
Minibatched/Distributed Implementations: SEP’s update structure supports efficient parallelization via minibatches. Its global factor update admits distributed architectures, fitting streaming or cluster settings (Bui et al., 2015).

Empirically, SEP closely tracks EP in test log-likelihoods, KL to NUTS ground truth, and posterior coverage metrics, across synthetic and real datasets (e.g., Bayesian probit regression, mixture models, Bayesian neural networks, deep GPs, sparse GP classification tasks on large UCI and MNIST datasets) (Li et al., 2015, Hernández-Lobato et al., 2015, Bui et al., 2015).

6. Best Practices and Limitations

Numerical stability is critical, requiring damping of site updates, addition of jitter to covariance matrices, and normalization of natural parameters to ensure positive-definiteness. The global factor update should use an appropriately annealed step-size or batch-damped inclusion, especially for high-dimensional or ill-conditioned factors.

In deep or hierarchical Gaussian models, per-iteration overhead can be dominated by $O(d^3)$ operations (for full-covariance transformations). Practical implementations often utilize low-rank approximations or parallel aggregation.

SEP is less flexible in modeling highly heterogeneous data than full EP due to the tied global factor. However, in regimes where the per-data likelihoods are relatively homogeneous, or in high-data settings where averaging is effective, SEP achieves strong uncertainty quantification with minimal resources (Li et al., 2015, Hernández-Lobato et al., 2015).

7. Relationship with Other Stochastic EP Variants

SEP is complementary to stochastic natural-gradient EP (SNEP) (Hasenclever et al., 2015), which instead retains per-site parameters but employs natural-gradient updates with stochastic moment estimates and explicit convergence-theory under Monte Carlo noise. While SNEP offers theoretical guarantees and supports distributed architectures, SEP is preferred when memory is the limiting factor and per-site storage is prohibitive.

Recent advances in natural-gradient perspectives on EP further refine stochastic updates for site parameters, enhancing stability and tuning by directly leveraging natural-parameter or mean-parameter NGD, as in the EP-η and EP-μ schemes (So et al., 2024). These methods enable stable single-sample Monte Carlo estimation for each site, trading off small bias versus noisy progress, and can be integrated in SEP-style frameworks for further memory reduction.

Key References

Paper Title	arXiv ID	Main Contribution
Stochastic Expectation Propagation	(Li et al., 2015)	Introduced SEP, theory, and empirical validation
Stochastic EP for Large Scale Gaussian Process Classification	(Hernández-Lobato et al., 2015)	Scalable SEP in GP settings, memory/performance analysis
DP-SEP	(Vinaroz et al., 2021)	Differentially private SEP, theory and experiments
Training DGPs with SEP and Probabilistic Backpropagation	(Bui et al., 2015)	SEP-based deep GP learning, implementation insights
Fearless Stochasticity in Expectation Propagation	(So et al., 2024)	Natural-gradient EP, robust single-sample updates
Distributed Bayesian Learning with SNEP	(Hasenclever et al., 2015)	Stochastic natural-gradient EP, theory, distributed arch