Papers
Topics
Authors
Recent
2000 character limit reached

MISELBO: Multiple Importance Sampling ELBO

Updated 8 December 2025
  • The paper demonstrates that MISELBO yields a strictly tighter lower bound than standard ELBO by leveraging ensemble diversity quantified via the Jensen–Shannon divergence.
  • It employs a mixture of variational proposals and novel estimators (A2A, S2A, S2S) to reduce estimator variance and computational cost in both classical and amortized inference settings.
  • Empirical evaluations on benchmarks like MNIST and Bayesian phylogenetics reveal significant improvements in negative log-likelihood and runtime efficiency.

The Multiple Importance Sampling Evidence Lower Bound (MISELBO) is a framework for variational inference (VI) that leverages ensembles of independently trained variational approximations to provide a strictly tighter lower bound on the marginal log-likelihood than standard approaches. MISELBO arises by combining variational proposals into a mixture and applying multiple importance sampling strategies, yielding improved estimator tightness, variance reduction, and empirical gains in density estimation and Bayesian inference tasks in both classical and amortized VI contexts (Kviman et al., 2022, Hotti et al., 11 Jun 2024).

1. Background: Variational Bounds and Ensembles

In VI, the marginal likelihood logpθ(x)\log p_\theta(x) of an observed variable xx under a latent variable model pθ(x,z)p_\theta(x, z) is typically lower bounded by the evidence lower bound (ELBO): LELBO=Ezqϕ(zx)[logpθ(x,z)qϕ(zx)]logpθ(x).L_{\rm ELBO} = \mathbb{E}_{z \sim q_\phi(z|x)}\left[\log \frac{p_\theta(x, z)}{q_\phi(z|x)}\right] \leq \log p_\theta(x). The importance-weighted ELBO (IWELBO) generalizes this using LL samples: LL=Ez1:Lqϕ[log(1L=1Lw)],w=pθ(x,z)qϕ(zx).L_L = \mathbb{E}_{z_{1:L} \sim q_\phi}\left[ \log\left( \frac{1}{L} \sum_{\ell=1}^L w_\ell\right)\right], \quad w_\ell = \frac{p_\theta(x, z_\ell)}{q_\phi(z_\ell|x)}. Both bounds suffer limitations when the variational approximation qϕ(zx)q_\phi(z|x) is unimodal or mismatched to the true posterior, motivating the use of ensembles of variational distributions and mixtures (Kviman et al., 2022, Hotti et al., 11 Jun 2024).

2. Definition and Derivation of MISELBO

Suppose an ensemble of SS independently inferred variational approximations QS={qϕs(zx)}s=1S\mathcal Q_S = \{q_{\phi_s}(z|x)\}_{s=1}^S, with mixture proposal

qmix(zx)=1Ss=1Sqϕs(zx).q_{\rm mix}(z|x) = \frac{1}{S} \sum_{s=1}^S q_{\phi_s}(z|x).

MISELBO replaces the individual proposal qϕq_\phi in IWELBO by the mixture, yielding (for LL samples per member): LMISL=1Ss=1SEzs,1:Lqϕs[log1L=1Lpθ(x,zs,)qmix(zs,x)].\mathcal{L}^L_{\rm MIS} = \frac{1}{S}\sum_{s=1}^S \mathbb{E}_{z_{s, 1:L} \sim q_{\phi_s}} \left[ \log \frac{1}{L} \sum_{\ell=1}^L \frac{p_\theta(x, z_{s, \ell})}{q_{\rm mix}(z_{s, \ell}|x)} \right]. For single samples (L=1L=1), this reduces to: LMIS1=1Ss=1SEqϕs[logpθ(x,z)qmix(zx)].\mathcal{L}^1_{\rm MIS} = \frac{1}{S}\sum_{s=1}^S \mathbb{E}_{q_{\phi_s}} \left[ \log \frac{p_\theta(x, z)}{q_{\rm mix}(z|x)} \right]. This formulation is directly aligned with the deterministic mixture weights or balance heuristic of multiple importance sampling (MIS) literature and generalizes to mixture families in variational inference settings (Kviman et al., 2022, Hotti et al., 11 Jun 2024).

3. Tightness and Theoretical Guarantees

Let LL=1Ss=1SLL(qϕs)\overline{L}_L = \frac{1}{S} \sum_{s=1}^S L_L(q_{\phi_s}) denote the average of per-member IWELBOs. For L=1L=1: Δ1=LMIS1L1=JSD(QS)[0,logS],\Delta_1 = \mathcal{L}^1_{\rm MIS} - \overline{L}_1 = \mathrm{JSD}(\mathcal{Q}_S) \in [0, \log S], where JSD\mathrm{JSD} is the Jensen–Shannon divergence across QS\mathcal Q_S. Thus, MISELBO is provably tighter than the average ELBO unless all proposals coincide, and the gap directly quantifies their diversity. For L>1L>1, empirical results uniformly find LMISLLL\mathcal{L}^L_{\rm MIS} \geq \overline{L}_L. Moreover,

LMISLLMISL1LMIS1\mathcal{L}^L_{\rm MIS} \geq \mathcal{L}^{L-1}_{\rm MIS} \geq \cdots \geq \mathcal{L}^1_{\rm MIS}

and LMISLlogpθ(x)\mathcal{L}^L_{\rm MIS} \nearrow \log p_\theta(x) as LL \rightarrow \infty. Thus, MISELBO supplies a valid, monotonically tightening lower bound whose improvement is quantified by JSD (Kviman et al., 2022).

4. Practical Implementation: Deep Ensembles and Efficient Estimators

Classical and Amortized VI

  • Classical VI: Independently optimized variational parameters {ϕs}\{\phi_s\} for each ensemble member, yielding diversely supported proposals.
  • Amortized VI: Multiple independent encoders are trained for a fixed decoder, e.g., in a VAE, by re-initializing each encoder separately and optimizing only encoder parameters, producing a "deep ensemble" with shared generation (Kviman et al., 2022).

Black-Box VI and Mixtures

Given a mixture of AA components,

qϕ(zx)=1Aa=1Aqϕa(zx),q_\phi(z|x) = \frac{1}{A} \sum_{a=1}^A q_{\phi_a}(z|x),

the naive "All-to-All" (A2A) estimator for MISELBO requires O(A2)O(A^2) density evaluations: L~A2A=1Aa=1Alogpθ(x,za)1Aa=1Aqϕa(zax)\widetilde{\mathcal{L}}_{\mathrm{A2A}} = \frac{1}{A} \sum_{a=1}^A \log \frac{p_\theta(x, z_a)}{\frac{1}{A} \sum_{a'=1}^A q_{\phi_{a'}}(z_a|x)} where zaqϕa(zx)z_a \sim q_{\phi_a}(z|x). To address this inefficiency, sub-sampling based estimators are introduced:

  • Some-to-All (S2A): Subsets SAS \le A are evaluated using the full mixture in the denominator, yielding unbiased estimates at O(SA)O(SA) cost.
  • Some-to-Some (S2S): Denominator summation is restricted to the subset, reducing cost to O(S2)O(S^2) but inducing a downward bias; expectation strictly less than LMIS\mathcal{L}_{\mathrm{MIS}} (Hotti et al., 11 Jun 2024).

Complexity Table

Estimator Cost per Data Point Bias
A2A ACp+A2CqA C_p + A^2 C_q Unbiased
S2A SCp+SACqS C_p + S A C_q Unbiased
S2S SCp+S2CqS C_p + S^2 C_q Downward

Here CpC_p is the cost of evaluating logpθ(x,z)\log p_\theta(x, z), CqC_q is for logqϕa(zx)\log q_{\phi_a}(z|x).

Scaling AA is thus practical with S2A/S2S, unlocking the approximation power of large ensembles with minimal parameter overhead, especially with amortized Mixture VAEs (Hotti et al., 11 Jun 2024).

5. Empirical Performance

Density Estimation

  • Toy multimodal densities: Ensembles of S=2S=2 variational approximations capture all modes, yielding substantially lower KL divergence to the true density compared to single ELBO or IWELBO (Kviman et al., 2022).
  • MNIST with NVAE (Nouveau VAE): MISELBO with S=2S=2, L=1000L=1000 achieves NLL 77.77\approx 77.77 versus $78.21$ for a single-model IWELBO, a $0.44$ nat improvement. MISELBO with L=50L=50 outperforms IWELBO with L=1000L=1000 (90% fewer samples) and further gains are seen with S=3S=3 (Kviman et al., 2022). In settings with A=800A=800 (with S=1S=1 S2A), NLL 74.07\approx 74.07 is achieved with slight increase in compute (Hotti et al., 11 Jun 2024).
  • FashionMNIST and robust likelihoods: S2A/S2S allow continued NLL improvement with mixture size increase, and S2S maintains wall-clock time irrespective of AA (Hotti et al., 11 Jun 2024).

Bayesian Phylogenetics

  • VBPI-NF and VBPI-mixtures evaluated on six and eight real datasets show consistent substantial gains (0.3–0.6 nat improvements in marginal log-likelihood) and up to 4×4\times reduction in inference runtime via S2A while matching prior ELBOs (Kviman et al., 2022, Hotti et al., 11 Jun 2024).

6. Connections to Importance Sampling and Broader Context

MISELBO is derived by applying the balance-heuristic (deterministic mixture) weights from MIS theory (Veach & Guibas 1995; Elvira et al. 2019), which are optimal for reducing estimator variance when using diverse proposals. This connection elucidates how advances in the theory of importance sampling, such as adaptive proposal learning and variance reduction schemes, can be ported to VI via the MISELBO formalism. The gap between standard ensemble ELBOs and MISELBO is exactly the Jensen–Shannon divergence, providing an operational measure of mixture utility (Kviman et al., 2022). Following this perspective, recent works have introduced efficient amortized mixture learning (MISVAE) and highly scalable mixture estimators (S2A, S2S), increasing the practical impact on state-of-the-art density estimation and black-box variational inference (Hotti et al., 11 Jun 2024).

7. Summary and Significance

MISELBO provides a rigorous unification of variational inference with multiple importance sampling, enabling the deployment of large, diverse ensembles (including deep ensembles in amortized VI) and offering principled estimator constructions whose statistical efficiency and tightness are quantifiable. Empirically, MISELBO delivers improved bounds, sample efficiency, and estimation speed across a range of models and tasks, including density estimation (MNIST, FashionMNIST), and Bayesian phylogenetic inference. By incorporating advances such as S2A and S2S estimators, MISELBO enables scalable generalization to large mixture families with minimal computational penalty and lays methodological groundwork for further integration of importance sampling theory with VI (Kviman et al., 2022, Hotti et al., 11 Jun 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Multiple Importance Sampling ELBO (MISELBO).