Importance-Weighted ELBO (IWELBO)

Updated 8 December 2025

Importance-Weighted ELBO is a variational inference objective that leverages K-sample Monte Carlo importance sampling to yield increasingly tight lower bounds on the marginal log-likelihood.
The method systematically reduces the gap to the true log-likelihood, with the difference decaying as O(1/K), enhancing the accuracy of latent variable models.
IWELBO underpins advanced models like Variational Autoencoders and deep state-space models, with extensions such as VR-IWAE and Hierarchical IWELBO further optimizing inference.

The Importance-Weighted Evidence Lower Bound (IWELBO), or importance-weighted ELBO, is a core variational inference objective that generalizes the classic ELBO through Monte Carlo importance sampling. By replacing the single-sample variational expectation in the standard ELBO with a $K$ -sample unbiased importance-sampling estimator, IWELBO yields a sequence of provably tighter lower bounds on the marginal log-likelihood as $K$ increases. This methodology is foundational in modern learning of latent variable models, including Variational Autoencoders (VAEs), deep state-space models, deep Gaussian processes, and serves as the central objective in Importance Weighted Autoencoders (IWAE).

1. Definition and Jensen-Based Derivation

Let $x$ denote observed data and $z$ latent variables. For a probabilistic model $p(x, z)$ and a variational approximation $q(z|x)$ , draw $K$ i.i.d. samples $z_1, ..., z_K \sim q(z|x)$ and define the importance weights $w_k = p(x,z_k) / q(z_k|x)$ . The $K$ -sample importance-weighted ELBO is then

$L_K(x) := \mathbb{E}_{z_{1:K}\sim q} \left[ \log\left( \frac{1}{K} \sum_{k=1}^K w_k \right) \right].$

This follows directly from Jensen’s inequality, exploiting the nonlinearity of the logarithm and the unbiasedness of the importance-sampling estimator for $p(x)$ . That is,

$\log p(x) = \log \mathbb{E}_{q(z|x)}[w(z)] \geq \mathbb{E}_{z_{1:K}\sim q} \left[ \log \left( \frac{1}{K} \sum_{k=1}^K w_k \right) \right] = L_K(x).$

For $K=1$ , $L_1(x)$ recovers the classical ELBO. As $K \to \infty$ , $L_K(x) \to \log p(x)$ from below, yielding monotonic tightening with increasing $K$ (Domke et al., 2018).

2. Theoretical Properties and Tightness

The IWELBO sequence satisfies $L_1(x) \leq L_2(x) \leq \cdots \leq \log p(x)$ . The gap $\log p(x) - L_K(x)$ decays asymptotically as $O(1/K)$ , governed by the variance of the single-sample importance weight $R = p(x,z)/q(z|x)$ : $\log p(x) - L_K(x) \approx \frac{\operatorname{Var}[R]}{2p(x)^2} \cdot \frac{1}{K},\quad K \to \infty.$ The bound attains equality with $\log p(x)$ if and only if $q(z|x)$ recovers the true posterior $p(z|x)$ up to almost sure equality in $R$ (Domke et al., 2018, Finke et al., 2019, Rainforth et al., 2018).

3. Augmented Variational Inference Interpretation

IWELBO is naturally reinterpreted as augmented variational inference in a product space of $K$ samples. Specifically, it minimizes a joint KL divergence between an “augmented” proposal $q_K(z_{1:K}|x)$ —constructed via sampling and normalized importance weighting—and a corresponding product joint $p_K(z_{1:K},x) = p(x,z_1) \prod_{i=2}^K q(z_i|x)$ . The decomposition is: $\log p(x) = L_K(x) + \mathrm{KL}[q_K \| p_K].$ Thus, maximizing $L_K$ corresponds to minimizing the joint divergence over the $K$ -sample augmented space. The marginal for $z_1$ under $q_K$ approximates the self-normalized importance weights posterior, clarifying the precise “variational gap” and the remaining looseness (Domke et al., 2018, Cremer et al., 2017).

4. Gradient Estimation and Signal-to-Noise Ratio

Optimization of IWELBO with respect to both generative ( $\theta$ ) and inference ( $\phi$ ) parameters employs two principal gradient estimators:

(a) Pathwise (Reparameterization) Estimator:

If $z_j = T(\epsilon_j;\phi)$ , $\epsilon_j$ independent noise, then

$\nabla_\phi L_K(x) = \mathbb{E}_{\epsilon_{1:K}} \left[ \sum_{j=1}^K \widetilde w_j \nabla_\phi \log w_j \right],\quad \widetilde w_j = \frac{w_j}{\sum_{i=1}^K w_i}.$

(b) Score-Function/Pathwise Hybrid and Variance Issues:

The signal-to-noise ratio (SNR) of the gradient estimator with respect to $\phi$ decays as $O(1/\sqrt{K})$ , while for $\theta$ it improves as $O(\sqrt{K})$ (Rainforth et al., 2018, Finke et al., 2019, Daudel et al., 15 Oct 2024). Notably, as $K$ increases, the expected $\phi$ -gradient shrinks toward zero, implying an SNR collapse that can hinder amortized inference.

Variance Reduction Techniques:

Doubly-reparameterized (DReG) estimators (Finke et al., 2019, Daudel et al., 15 Oct 2024), combination objectives (CIWAE), and multiply-IS estimators (MIWAE, PIWAE) mitigate SNR degradation, with DReG gradients eliminating high-variance score-function terms and maintaining stable updates for large $K$ .

5. Extensions: VR-IWAE, Hierarchical IWELBO, and Deep Ensembles

VR-IWAE:

IWELBO generalizes as a special case ( $\alpha=0$ ) of the VR-IWAE bound: $L_{K,\alpha} = \frac{1}{1-\alpha} \mathbb{E}_{z_{1:K}} \left[ \log \left( \frac{1}{K}\sum_{j=1}^K w_j^{1-\alpha} \right) \right].$ VR-IWAE interpolates between IWAE (for $\alpha=0$ ), Rényi-VI (for $K=1$ ), and the ELBO (as $\alpha \to 1$ ), providing a continuous bias-variance trade-off and restoring SNR scaling as $O(\sqrt{K})$ for $\phi$ when $\alpha > 0$ (Daudel et al., 15 Oct 2024).

Hierarchical IWELBO:

Introducing structured correlation between the $K$ samples (via a meta-latent, $z_0$ ) induces negative correlation among the importance weights, further reducing estimator variance and speeding up convergence to $\log p(x)$ . This approach, called Hierarchical IWAE (H-IWAE), empirically outperforms i.i.d. proposals in density estimation and exhibits strictly superior estimator variance properties (Huang et al., 2019).

Multiple-IS ELBO (MISELBO) and Deep Ensembles:

An ensemble of variational approximations, coordinated with multiple importance sampling, further tightens the bound compared to the average IWELBO or ELBO. This approach, MISELBO, achieves consistently better test log-likelihoods in high-dimensional image and phylogenetic inference tasks, leveraging proposal diversity as quantified by the Jensen–Shannon divergence between proposal ensembles (Kviman et al., 2022).

6. Applications and Practical Recommendations

IWELBO-based objectives are widely adopted in VAEs, deep Markov models, deep Gaussian processes, and semi-supervised generative models. For deep sequential models, such as the deep Kalman filter, extending IWELBO (IW-DKF) yields substantial improvements in test log-likelihood and state estimation accuracy—empirically reducing RMSE by $30$–$40$\% for practical $K=5{-}10$ , at moderate additional computational cost (Calatrava et al., 2023).

Algorithmic Considerations:

Use moderate $K$ ($5$–$20$) to balance bound tightness, computational overhead, and gradient SNR.
Employ DReG gradients or VR-IWAE with $\alpha>0$ to maintain reliable learning signals for the inference network.
Consider hierarchical proposals or ensembles to maximize estimator efficiency, especially for complex or high-dimensional latent spaces.
For semi-supervised VAEs, importance-weighted objectives (such as PIWO/SSPIWO) enable fine-grained control of the balance between observed and unobserved latent variable inference (Felhi et al., 2020).

7. Limitations, Open Challenges, and Future Directions

While IWELBO achieves a monotonic tightening toward the true log-marginal and provides practical improvements in variational inference, several limitations persist:

The decay of $\phi$ -gradient SNR with $K$ necessitates variance reduction or VR-IWAE/DReG alternatives for stable amortized inference (Rainforth et al., 2018, Daudel et al., 15 Oct 2024).
In high-dimensional models, $K$ must grow rapidly to avoid "weight collapse," making naive importance sampling impractical unless combined with heavy-tailed or adaptive proposals (Domke et al., 2018).
The choice of variational family (e.g., elliptical vs. Gaussian) and proposal diversity critically impacts bound tightness, variance, and convergence (Domke et al., 2018, Kviman et al., 2022).
The interplay between bound tightness, training stability, and downstream task performance remains nontrivial—tighter bounds are not always optimal for inference network learning or amortized posteriors (Rainforth et al., 2018, Finke et al., 2019).

Continued research focuses on adaptive multiple-IS, hybrid surrogate objectives, controlled use of non-i.i.d. proposals, and enhanced proposal families for scalability and robustness in deep probabilistic models.