Importance Weighted Autoencoders (IWAE)

Updated 7 June 2026

Importance Weighted Autoencoders (IWAE) extend VAEs by using multiple samples to yield a provably tighter lower bound on the marginal log-likelihood.
They employ efficient reparameterization gradient estimators like REP and DReG to optimize deep latent variable models effectively.
While IWAE improves generative performance and posterior expressiveness, it encounters trade-offs such as gradient noise and weight collapse in high dimensions.

An Importance-Weighted Autoencoder (IWAE) is a generalization of the Variational Autoencoder (VAE) that yields a provably tighter lower bound on the marginal log-likelihood by leveraging the principles of importance sampling within variational inference. By increasing the number of importance samples $K$ , IWAEs encourage inference networks to approximate more expressive, multi-modal posterior distributions and improve generative modeling performance under the standard variational framework. The IWAE methodology has produced significant advances in both theoretical understanding and practical applications of deep generative models, has inspired numerous algorithmic generalizations, and has motivated new approaches to variance reduction, high-dimensional approximate inference, and robust autoencoding.

1. Foundations: IWAE Objective and Statistical Interpretation

Given an observed datum $x$ , latent variable $z$ , generative model $p_\theta(x, z)$ , and proposal $q_\phi(z\,|\,x)$ , the goal is to maximize the intractable marginal log-likelihood $\log p_\theta(x) = \log \int p_\theta(x, z)\,dz$ . The standard one-sample VAE ELBO is

$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}[ \log p_\theta(x,z) - \log q_\phi(z|x) ].$

Burda et al. (2015) introduced the IWAE bound, which, for $K>1$ samples $z_{1:K} \sim q_\phi(z|x)$ , is

$\mathcal{L}_{\rm IWAE}^K(\theta, \phi; x) = \mathbb{E}_{z_{1:K} \sim q_\phi} \left[ \log \left( \frac{1}{K} \sum_{k=1}^{K} \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)} \right) \right ].$

This bound satisfies $x$ 0, converging monotonically to the exact log-marginal as $x$ 1 (Burda et al., 2015).

The key property is that the IWAE objective is identical to the standard ELBO evaluated under a random, nonparametric, importance-weighted implicit posterior $x$ 2 that becomes more expressive as $x$ 3 increases (Cremer et al., 2017): $x$ 4 The expected value of this distribution (averaging over auxiliary samples $x$ 5) yields $x$ 6, which always provides a tighter bound than the original IWAE and VAE objectives.

2. Algorithms: Gradient Estimation, Training, and Generalizations

Efficient unbiased gradient estimators for IWAE can be constructed using the reparameterization trick. With $x$ 7 and $x$ 8, the stochastic gradient estimator for the bound with respect to both $x$ 9 and $z$ 0 is

$z$ 1

where $z$ 2 and $z$ 3. Corresponding pseudocode implements parallel sampling, weight computation, log-sum-exp stabilizations, and batch-wise optimization (Burda et al., 2015, Dieng et al., 2019).

Two main unbiased gradient estimators exist for IWAE and its generalizations (Daudel et al., 2024):

Reparameterized (REP) estimator: weights each $z$ 4 by $z$ 5.
Doubly reparameterized (DReG) estimator: scales as $z$ 6 and eliminates high-variance score terms, resulting in lower-variance and unbiased gradients for $z$ 7 (Finke et al., 2019). DReG estimators are now standard in many applications.

IWAE extends seamlessly to settings where the inference network is implicit (i.e., $z$ 8 is a sampler without a tractable density), by using density-ratio estimation via adversarial techniques (Importance-Weighted Adversarial VAEs) (Im et al., 2019). Further extensions exploit graphical model factorization via Tensor Monte Carlo (TMC), exponentially increasing sample combinations to obtain much tighter bounds in deep latent hierarchies with only moderate additional cost (Aitchison, 2018).

3. Signal-to-Noise, High-Dimensional Limits, and Variational Rényi Generalizations

One limitation of IWAE is that, for large $z$ 9, the signal-to-noise ratio (SNR) of the REP gradient estimator for $p_\theta(x, z)$ 0 decays as $p_\theta(x, z)$ 1: increased bound tightness comes at the cost of noisier gradients, ultimately stalling the optimization of the recognition network (Daudel et al., 2024, Finke et al., 2019). DReG estimators, by contrast, exhibit SNR $p_\theta(x, z)$ 2, providing robust learning for larger $p_\theta(x, z)$ 3.

High-dimensional settings ( $p_\theta(x, z)$ 4) introduce "weight collapse": the variance of the log-importance weights grows, causing a single $p_\theta(x, z)$ 5 to dominate, collapsing the bound back to the ELBO regardless of $p_\theta(x, z)$ 6 unless $p_\theta(x, z)$ 7 grows exponentially with dimension (Daudel et al., 2022, Daudel et al., 2024). This effect caps the achievable improvement of IWAE in high-dimensional latent spaces, motivating alternative computational strategies (e.g., TMC (Aitchison, 2018)).

Generalizations of IWAE via $p_\theta(x, z)$ 8-divergence objectives, known as the VR-IWAE bounds, interpolate between IWAE ( $p_\theta(x, z)$ 9) and the ELBO ( $q_\phi(z\,|\,x)$ 0): $q_\phi(z\,|\,x)$ 1 For $q_\phi(z\,|\,x)$ 2, VR-IWAE offers a continuum of bias-variance tradeoffs. Choosing larger $q_\phi(z\,|\,x)$ 3 gives higher gradient SNR for $q_\phi(z\,|\,x)$ 4 (scaling as $q_\phi(z\,|\,x)$ 5) at the expense of introducing bias (Daudel et al., 2022, Daudel et al., 2024, Jiang et al., 4 Feb 2026). DReG estimators remain effective in these settings.

4. Variance Reduction, Hierarchical, and Geometric Approaches

Several variance reduction and gradient stabilization techniques have emerged:

"Sticking the Landing" (STL) heuristically omits score-function terms to reduce gradient variance at the cost of bias (Finke et al., 2019).
Hierarchical IWAE (H-IWAE) introduces structured, negatively correlated proposals, further reducing variance and maintaining bound tightness as $q_\phi(z\,|\,x)$ 6 increases (Huang et al., 2019).
Optimal transport and geometric methods: Formulating the optimization of the IWELBO or VR-IWAE on the Bures–Wasserstein manifold for Gaussians yields gradient estimators with provable SNR $q_\phi(z\,|\,x)$ 7, eliminating SNR decay entirely, and offers improved robustness and mass covering in multimodal settings (Jiang et al., 4 Feb 2026).

5. Applications and Empirical Insights

IWAEs and their extensions underpin state-of-the-art in deep latent generative modeling:

Experiments on MNIST and OMNIGLOT show that IWAE (with $q_\phi(z\,|\,x)$ 8) improves test log-likelihood and learns richer, higher-dimensional latent representations than VAEs ( $q_\phi(z\,|\,x)$ 9) (Burda et al., 2015, Dieng et al., 2019).
In discrete latent models, continuous relaxations enable IWAE training with Boltzmann machine priors, outperforming earlier discrete VAEs (Vahdat et al., 2018).
In multiple imputation of missing-not-at-random data, IWAE's robust multi-sample objective yields improved accuracy versus single-sample VAE methods (Lim et al., 2021).
Adversarial and implicit inference variants (IW-AVAE, IW-AAE) yield strong generative models and effective posterior estimation, even when $\log p_\theta(x) = \log \int p_\theta(x, z)\,dz$ 0 is intractable (Im et al., 2019).
For MCMC-variational hybrids, annealed importance sampling recovers IWAE as $\log p_\theta(x) = \log \int p_\theta(x, z)\,dz$ 1 and strictly generalizes it for $\log p_\theta(x) = \log \int p_\theta(x, z)\,dz$ 2, bridging VI and MCMC (Ding et al., 2019).

Empirical studies consistently find that moderate $\log p_\theta(x) = \log \int p_\theta(x, z)\,dz$ 3 ( $\log p_\theta(x) = \log \int p_\theta(x, z)\,dz$ 4– $\log p_\theta(x) = \log \int p_\theta(x, z)\,dz$ 5) yields most practical gains, with little benefit in higher $\log p_\theta(x) = \log \int p_\theta(x, z)\,dz$ 6 due to increased cost and gradient noise (Burda et al., 2015, Dieng et al., 2019, Daudel et al., 2024).

6. Limitations, Best Practices, and Ongoing Research

IWAE's main limitations stem from signal-to-noise degradation at large $\log p_\theta(x) = \log \int p_\theta(x, z)\,dz$ 7 for standard REP estimators, the collapse in high latent dimension, and the persistent amortization gap due to limited expressivity in inference networks (Dieng et al., 2019, Daudel et al., 2024, Cremer et al., 2017). Recent work recommends:

Moderate $\log p_\theta(x) = \log \int p_\theta(x, z)\,dz$ 8 (2–16) balances tightness and stability.
Employing DReG or Wasserstein-based gradients to avoid SNR decay (Daudel et al., 2024, Jiang et al., 4 Feb 2026).
Adopting more flexible $\log p_\theta(x) = \log \int p_\theta(x, z)\,dz$ 9 (e.g., normalizing flows) to minimize weight variance and delay collapse.
Monitoring empirical SNR and reverting to ELBO-like regimes in high dimension or when collapse is observed.

Exploration of connections to $\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}[ \log p_\theta(x,z) - \log q_\phi(z|x) ].$ 0-divergence VI, adaptive importance sampling (e.g., AISLE (Finke et al., 2019)), hierarchical and geometric inference, and further advances in variational objective design remain the subject of current research.

7. Summary Table: Core IWAE Variational Bounds

Bound Type	Formula	Limiting Case
ELBO	$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 1	$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 2
IWAE	$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 3	$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 4, $\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 5
VR-IWAE ( $\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 6)	$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 7	$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 8

The IWAE framework, its extensions, and variance-reduced estimators have transformed deep generative modeling by tightening variational bounds, enabling richer posterior approximations, and providing a principled foundation for future developments in probabilistic machine learning (Burda et al., 2015, Daudel et al., 2022, Daudel et al., 2024, Jiang et al., 4 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (13)

Importance Weighted Autoencoders (2015)

Reinterpreting Importance-Weighted Autoencoders (2017)

Reweighted Expectation Maximization (2019)

Learning with Importance Weighted Variational Inference: Asymptotics for Gradient Estimators of the VR-IWAE Bound (2024)

On importance-weighted autoencoders (2019)

Importance Weighted Adversarial Variational Autoencoders for Spike Inference from Calcium Imaging Data (2019)

Tensor Monte Carlo: particle methods for the GPU era (2018)

Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics (2022)

Bures-Wasserstein Importance-Weighted Evidence Lower Bound: Exposition and Applications (2026)

10.

Hierarchical Importance Weighted Autoencoders (2019)

11.

DVAE#: Discrete Variational Autoencoders with Relaxed Boltzmann Priors (2018)

12.

Unsupervised Imputation of Non-ignorably Missing Data Using Importance-Weighted Autoencoders (2021)

13.

Learning Deep Generative Models with Annealed Importance Sampling (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Importance Weighted Autoencoders (IWAEs).

Bound Type	Formula	Limiting Case
ELBO	$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 1	$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 2
IWAE	$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 3	$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 4, $\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 5
VR-IWAE ( $\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 6)	$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 7	$\mathcal{L}_{\rm ELBO}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z\|x)}[ \log p_\theta(x,z) - \log q_\phi(z\|x) ].$ 8