MISELBO: Multiple Importance Sampling ELBO
- The paper demonstrates that MISELBO yields a strictly tighter lower bound than standard ELBO by leveraging ensemble diversity quantified via the Jensen–Shannon divergence.
- It employs a mixture of variational proposals and novel estimators (A2A, S2A, S2S) to reduce estimator variance and computational cost in both classical and amortized inference settings.
- Empirical evaluations on benchmarks like MNIST and Bayesian phylogenetics reveal significant improvements in negative log-likelihood and runtime efficiency.
The Multiple Importance Sampling Evidence Lower Bound (MISELBO) is a framework for variational inference (VI) that leverages ensembles of independently trained variational approximations to provide a strictly tighter lower bound on the marginal log-likelihood than standard approaches. MISELBO arises by combining variational proposals into a mixture and applying multiple importance sampling strategies, yielding improved estimator tightness, variance reduction, and empirical gains in density estimation and Bayesian inference tasks in both classical and amortized VI contexts (Kviman et al., 2022, Hotti et al., 11 Jun 2024).
1. Background: Variational Bounds and Ensembles
In VI, the marginal likelihood of an observed variable under a latent variable model is typically lower bounded by the evidence lower bound (ELBO): The importance-weighted ELBO (IWELBO) generalizes this using samples: Both bounds suffer limitations when the variational approximation is unimodal or mismatched to the true posterior, motivating the use of ensembles of variational distributions and mixtures (Kviman et al., 2022, Hotti et al., 11 Jun 2024).
2. Definition and Derivation of MISELBO
Suppose an ensemble of independently inferred variational approximations , with mixture proposal
MISELBO replaces the individual proposal in IWELBO by the mixture, yielding (for samples per member): For single samples (), this reduces to: This formulation is directly aligned with the deterministic mixture weights or balance heuristic of multiple importance sampling (MIS) literature and generalizes to mixture families in variational inference settings (Kviman et al., 2022, Hotti et al., 11 Jun 2024).
3. Tightness and Theoretical Guarantees
Let denote the average of per-member IWELBOs. For : where is the Jensen–Shannon divergence across . Thus, MISELBO is provably tighter than the average ELBO unless all proposals coincide, and the gap directly quantifies their diversity. For , empirical results uniformly find . Moreover,
and as . Thus, MISELBO supplies a valid, monotonically tightening lower bound whose improvement is quantified by JSD (Kviman et al., 2022).
4. Practical Implementation: Deep Ensembles and Efficient Estimators
Classical and Amortized VI
- Classical VI: Independently optimized variational parameters for each ensemble member, yielding diversely supported proposals.
- Amortized VI: Multiple independent encoders are trained for a fixed decoder, e.g., in a VAE, by re-initializing each encoder separately and optimizing only encoder parameters, producing a "deep ensemble" with shared generation (Kviman et al., 2022).
Black-Box VI and Mixtures
Given a mixture of components,
the naive "All-to-All" (A2A) estimator for MISELBO requires density evaluations: where . To address this inefficiency, sub-sampling based estimators are introduced:
- Some-to-All (S2A): Subsets are evaluated using the full mixture in the denominator, yielding unbiased estimates at cost.
- Some-to-Some (S2S): Denominator summation is restricted to the subset, reducing cost to but inducing a downward bias; expectation strictly less than (Hotti et al., 11 Jun 2024).
Complexity Table
| Estimator | Cost per Data Point | Bias |
|---|---|---|
| A2A | Unbiased | |
| S2A | Unbiased | |
| S2S | Downward |
Here is the cost of evaluating , is for .
Scaling is thus practical with S2A/S2S, unlocking the approximation power of large ensembles with minimal parameter overhead, especially with amortized Mixture VAEs (Hotti et al., 11 Jun 2024).
5. Empirical Performance
Density Estimation
- Toy multimodal densities: Ensembles of variational approximations capture all modes, yielding substantially lower KL divergence to the true density compared to single ELBO or IWELBO (Kviman et al., 2022).
- MNIST with NVAE (Nouveau VAE): MISELBO with , achieves NLL versus $78.21$ for a single-model IWELBO, a $0.44$ nat improvement. MISELBO with outperforms IWELBO with (90% fewer samples) and further gains are seen with (Kviman et al., 2022). In settings with (with S2A), NLL is achieved with slight increase in compute (Hotti et al., 11 Jun 2024).
- FashionMNIST and robust likelihoods: S2A/S2S allow continued NLL improvement with mixture size increase, and S2S maintains wall-clock time irrespective of (Hotti et al., 11 Jun 2024).
Bayesian Phylogenetics
- VBPI-NF and VBPI-mixtures evaluated on six and eight real datasets show consistent substantial gains (0.3–0.6 nat improvements in marginal log-likelihood) and up to reduction in inference runtime via S2A while matching prior ELBOs (Kviman et al., 2022, Hotti et al., 11 Jun 2024).
6. Connections to Importance Sampling and Broader Context
MISELBO is derived by applying the balance-heuristic (deterministic mixture) weights from MIS theory (Veach & Guibas 1995; Elvira et al. 2019), which are optimal for reducing estimator variance when using diverse proposals. This connection elucidates how advances in the theory of importance sampling, such as adaptive proposal learning and variance reduction schemes, can be ported to VI via the MISELBO formalism. The gap between standard ensemble ELBOs and MISELBO is exactly the Jensen–Shannon divergence, providing an operational measure of mixture utility (Kviman et al., 2022). Following this perspective, recent works have introduced efficient amortized mixture learning (MISVAE) and highly scalable mixture estimators (S2A, S2S), increasing the practical impact on state-of-the-art density estimation and black-box variational inference (Hotti et al., 11 Jun 2024).
7. Summary and Significance
MISELBO provides a rigorous unification of variational inference with multiple importance sampling, enabling the deployment of large, diverse ensembles (including deep ensembles in amortized VI) and offering principled estimator constructions whose statistical efficiency and tightness are quantifiable. Empirically, MISELBO delivers improved bounds, sample efficiency, and estimation speed across a range of models and tasks, including density estimation (MNIST, FashionMNIST), and Bayesian phylogenetic inference. By incorporating advances such as S2A and S2S estimators, MISELBO enables scalable generalization to large mixture families with minimal computational penalty and lays methodological groundwork for further integration of importance sampling theory with VI (Kviman et al., 2022, Hotti et al., 11 Jun 2024).