Papers
Topics
Authors
Recent
2000 character limit reached

Importance-Weighted ELBO (IWELBO)

Updated 8 December 2025
  • Importance-Weighted ELBO is a variational inference objective that leverages K-sample Monte Carlo importance sampling to yield increasingly tight lower bounds on the marginal log-likelihood.
  • The method systematically reduces the gap to the true log-likelihood, with the difference decaying as O(1/K), enhancing the accuracy of latent variable models.
  • IWELBO underpins advanced models like Variational Autoencoders and deep state-space models, with extensions such as VR-IWAE and Hierarchical IWELBO further optimizing inference.

The Importance-Weighted Evidence Lower Bound (IWELBO), or importance-weighted ELBO, is a core variational inference objective that generalizes the classic ELBO through Monte Carlo importance sampling. By replacing the single-sample variational expectation in the standard ELBO with a KK-sample unbiased importance-sampling estimator, IWELBO yields a sequence of provably tighter lower bounds on the marginal log-likelihood as KK increases. This methodology is foundational in modern learning of latent variable models, including Variational Autoencoders (VAEs), deep state-space models, deep Gaussian processes, and serves as the central objective in Importance Weighted Autoencoders (IWAE).

1. Definition and Jensen-Based Derivation

Let xx denote observed data and zz latent variables. For a probabilistic model p(x,z)p(x, z) and a variational approximation q(zx)q(z|x), draw KK i.i.d. samples z1,...,zKq(zx)z_1, ..., z_K \sim q(z|x) and define the importance weights wk=p(x,zk)/q(zkx)w_k = p(x,z_k) / q(z_k|x). The KK-sample importance-weighted ELBO is then

LK(x):=Ez1:Kq[log(1Kk=1Kwk)].L_K(x) := \mathbb{E}_{z_{1:K}\sim q} \left[ \log\left( \frac{1}{K} \sum_{k=1}^K w_k \right) \right].

This follows directly from Jensen’s inequality, exploiting the nonlinearity of the logarithm and the unbiasedness of the importance-sampling estimator for p(x)p(x). That is,

logp(x)=logEq(zx)[w(z)]Ez1:Kq[log(1Kk=1Kwk)]=LK(x).\log p(x) = \log \mathbb{E}_{q(z|x)}[w(z)] \geq \mathbb{E}_{z_{1:K}\sim q} \left[ \log \left( \frac{1}{K} \sum_{k=1}^K w_k \right) \right] = L_K(x).

For K=1K=1, L1(x)L_1(x) recovers the classical ELBO. As KK \to \infty, LK(x)logp(x)L_K(x) \to \log p(x) from below, yielding monotonic tightening with increasing KK (Domke et al., 2018).

2. Theoretical Properties and Tightness

The IWELBO sequence satisfies L1(x)L2(x)logp(x)L_1(x) \leq L_2(x) \leq \cdots \leq \log p(x). The gap logp(x)LK(x)\log p(x) - L_K(x) decays asymptotically as O(1/K)O(1/K), governed by the variance of the single-sample importance weight R=p(x,z)/q(zx)R = p(x,z)/q(z|x): logp(x)LK(x)Var[R]2p(x)21K,K.\log p(x) - L_K(x) \approx \frac{\operatorname{Var}[R]}{2p(x)^2} \cdot \frac{1}{K},\quad K \to \infty. The bound attains equality with logp(x)\log p(x) if and only if q(zx)q(z|x) recovers the true posterior p(zx)p(z|x) up to almost sure equality in RR (Domke et al., 2018, Finke et al., 2019, Rainforth et al., 2018).

3. Augmented Variational Inference Interpretation

IWELBO is naturally reinterpreted as augmented variational inference in a product space of KK samples. Specifically, it minimizes a joint KL divergence between an “augmented” proposal qK(z1:Kx)q_K(z_{1:K}|x)—constructed via sampling and normalized importance weighting—and a corresponding product joint pK(z1:K,x)=p(x,z1)i=2Kq(zix)p_K(z_{1:K},x) = p(x,z_1) \prod_{i=2}^K q(z_i|x). The decomposition is: logp(x)=LK(x)+KL[qKpK].\log p(x) = L_K(x) + \mathrm{KL}[q_K \| p_K]. Thus, maximizing LKL_K corresponds to minimizing the joint divergence over the KK-sample augmented space. The marginal for z1z_1 under qKq_K approximates the self-normalized importance weights posterior, clarifying the precise “variational gap” and the remaining looseness (Domke et al., 2018, Cremer et al., 2017).

4. Gradient Estimation and Signal-to-Noise Ratio

Optimization of IWELBO with respect to both generative (θ\theta) and inference (ϕ\phi) parameters employs two principal gradient estimators:

(a) Pathwise (Reparameterization) Estimator:

If zj=T(ϵj;ϕ)z_j = T(\epsilon_j;\phi), ϵj\epsilon_j independent noise, then

ϕLK(x)=Eϵ1:K[j=1Kw~jϕlogwj],w~j=wji=1Kwi.\nabla_\phi L_K(x) = \mathbb{E}_{\epsilon_{1:K}} \left[ \sum_{j=1}^K \widetilde w_j \nabla_\phi \log w_j \right],\quad \widetilde w_j = \frac{w_j}{\sum_{i=1}^K w_i}.

(b) Score-Function/Pathwise Hybrid and Variance Issues:

The signal-to-noise ratio (SNR) of the gradient estimator with respect to ϕ\phi decays as O(1/K)O(1/\sqrt{K}), while for θ\theta it improves as O(K)O(\sqrt{K}) (Rainforth et al., 2018, Finke et al., 2019, Daudel et al., 15 Oct 2024). Notably, as KK increases, the expected ϕ\phi-gradient shrinks toward zero, implying an SNR collapse that can hinder amortized inference.

Variance Reduction Techniques:

Doubly-reparameterized (DReG) estimators (Finke et al., 2019, Daudel et al., 15 Oct 2024), combination objectives (CIWAE), and multiply-IS estimators (MIWAE, PIWAE) mitigate SNR degradation, with DReG gradients eliminating high-variance score-function terms and maintaining stable updates for large KK.

5. Extensions: VR-IWAE, Hierarchical IWELBO, and Deep Ensembles

VR-IWAE:

IWELBO generalizes as a special case (α=0\alpha=0) of the VR-IWAE bound: LK,α=11αEz1:K[log(1Kj=1Kwj1α)].L_{K,\alpha} = \frac{1}{1-\alpha} \mathbb{E}_{z_{1:K}} \left[ \log \left( \frac{1}{K}\sum_{j=1}^K w_j^{1-\alpha} \right) \right]. VR-IWAE interpolates between IWAE (for α=0\alpha=0), Rényi-VI (for K=1K=1), and the ELBO (as α1\alpha \to 1), providing a continuous bias-variance trade-off and restoring SNR scaling as O(K)O(\sqrt{K}) for ϕ\phi when α>0\alpha > 0 (Daudel et al., 15 Oct 2024).

Hierarchical IWELBO:

Introducing structured correlation between the KK samples (via a meta-latent, z0z_0) induces negative correlation among the importance weights, further reducing estimator variance and speeding up convergence to logp(x)\log p(x). This approach, called Hierarchical IWAE (H-IWAE), empirically outperforms i.i.d. proposals in density estimation and exhibits strictly superior estimator variance properties (Huang et al., 2019).

Multiple-IS ELBO (MISELBO) and Deep Ensembles:

An ensemble of variational approximations, coordinated with multiple importance sampling, further tightens the bound compared to the average IWELBO or ELBO. This approach, MISELBO, achieves consistently better test log-likelihoods in high-dimensional image and phylogenetic inference tasks, leveraging proposal diversity as quantified by the Jensen–Shannon divergence between proposal ensembles (Kviman et al., 2022).

6. Applications and Practical Recommendations

IWELBO-based objectives are widely adopted in VAEs, deep Markov models, deep Gaussian processes, and semi-supervised generative models. For deep sequential models, such as the deep Kalman filter, extending IWELBO (IW-DKF) yields substantial improvements in test log-likelihood and state estimation accuracy—empirically reducing RMSE by $30$–$40$\% for practical K=510K=5{-}10, at moderate additional computational cost (Calatrava et al., 2023).

Algorithmic Considerations:

  • Use moderate KK ($5$–$20$) to balance bound tightness, computational overhead, and gradient SNR.
  • Employ DReG gradients or VR-IWAE with α>0\alpha>0 to maintain reliable learning signals for the inference network.
  • Consider hierarchical proposals or ensembles to maximize estimator efficiency, especially for complex or high-dimensional latent spaces.
  • For semi-supervised VAEs, importance-weighted objectives (such as PIWO/SSPIWO) enable fine-grained control of the balance between observed and unobserved latent variable inference (Felhi et al., 2020).

7. Limitations, Open Challenges, and Future Directions

While IWELBO achieves a monotonic tightening toward the true log-marginal and provides practical improvements in variational inference, several limitations persist:

  • The decay of ϕ\phi-gradient SNR with KK necessitates variance reduction or VR-IWAE/DReG alternatives for stable amortized inference (Rainforth et al., 2018, Daudel et al., 15 Oct 2024).
  • In high-dimensional models, KK must grow rapidly to avoid "weight collapse," making naive importance sampling impractical unless combined with heavy-tailed or adaptive proposals (Domke et al., 2018).
  • The choice of variational family (e.g., elliptical vs. Gaussian) and proposal diversity critically impacts bound tightness, variance, and convergence (Domke et al., 2018, Kviman et al., 2022).
  • The interplay between bound tightness, training stability, and downstream task performance remains nontrivial—tighter bounds are not always optimal for inference network learning or amortized posteriors (Rainforth et al., 2018, Finke et al., 2019).

Continued research focuses on adaptive multiple-IS, hybrid surrogate objectives, controlled use of non-i.i.d. proposals, and enhanced proposal families for scalability and robustness in deep probabilistic models.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Importance-Weighted ELBO (IWELBO).