Variational Rejection Sampling (VRS)
- Variational Rejection Sampling is a hybrid method that refines variational proposals with an acceptance-rejection mechanism to better approximate complex posterior distributions.
- The algorithm employs both soft and hard acceptance thresholds alongside α-divergence objectives to balance computational cost and approximation fidelity.
- Empirical results show improved log-likelihood and reduced gradient variance, demonstrating VRS’s efficacy in enhancing latent variable model inference.
Variational Rejection Sampling (VRS) is a hybrid framework for approximate inference that refines a parametric variational proposal by integrating a rejection sampling mechanism. The approach systematically enhances the fidelity of variational approximations to complex target distributions, particularly in latent-variable models. By combining properties of traditional rejection sampling with variational inference objectives—often via α-divergence or ELBO minimization—VRS bridges the gap between efficient proposal construction and principled sample-based correction. Extensions such as Refined α-Divergence Rejection Sampling (α-DRS) and Reparameterized Variational Rejection Sampling (RVRS) further broaden the practical and theoretical impact of this scheme in probabilistic modeling (grover et al., 2018, Sharma et al., 2019, Jankowiak et al., 2023).
1. Theoretical Foundations and Model Setup
VRS addresses inference for latent variable models, where the joint distribution takes the form , with observed and latent. The key challenge is constructing an efficient, high-fidelity posterior approximation , especially when the true posterior exhibits complex, multimodal, or heavy-tailed structure that standard variational families fail to capture (grover et al., 2018, Jankowiak et al., 2023).
The essential idea is to build an improved variational family by "resampling" or "refining" via an accept-reject process. This yields a new density: where the acceptance function depends on both the model and proposal densities, ensuring that as the acceptance criterion is tightened, the refined approaches (grover et al., 2018).
2. Core Algorithm and Mathematical Framework
The canonical VRS acceptance probability follows the (hard) rejection sampling prescription: 0 where 1 is a normalization constant bounding 2. In practice, to circumvent the need for a strict bound, VRS employs a "soft" acceptance threshold 3, yielding
4
with 5 and 6 (softplus). The resulting resampled proposal is then normalized as
7
This procedure ensures a continuous, tunable trade-off between computational cost and closeness to the true posterior (grover et al., 2018, Jankowiak et al., 2023).
3. Relation to α-Divergence and the α-DRS Scheme
The "Refined α-Divergence Variational Inference via Rejection Sampling" framework (α-DRS) generalizes VRS by introducing Rényi α-divergence as an objective. For 8, 9, the Rényi divergence between target 0 and proposal 1 is
2
The α-DRS algorithm proceeds in two stages (Sharma et al., 2019):
- Stage 1: Optimize 3 by minimizing a Monte Carlo estimate of 4.
- Stage 2: Use learned 5 and an (approximate) optimal RS constant (or quantile-based surrogate) to perform rejection sampling, generating a refined sample-based approximation.
The key theoretical link is that as 6, 7, with 8 the tightest rejection sampling constant for 9. Crucially, it is established that the rejection step cannot increase 0, i.e.,
1
which guarantees improvement (or at least non-degradation) in the variational approximation after rejection sampling (Sharma et al., 2019).
4. Variational Objectives, Gradient Estimation, and Reparameterization
VRS can be cast within a variational inference framework, where the Evidence Lower Bound (ELBO) under the resampled proposal is
2
Taking advantage of the structure of 3, Grover et al. derive low-variance gradient estimators, involving covariances under the resampled proposal. For reparameterizable base proposals (e.g., Gaussian), Jankowiak & Phan introduce a "pathwise" (low-variance) gradient for the parameters 4 of 5 via the identity: 6 where 7 is the (smooth) acceptance, 8 is a function of the log-density ratios, and 9 is the Jacobian from reparameterization (Jankowiak et al., 2023). This estimator exhibits substantially reduced variance compared to REINFORCE-style alternatives, enabling scalable and robust training.
5. Cost–Fidelity Trade-offs and Algorithmic Structure
The expected cost per accepted sample is inversely proportional to the mean acceptance probability, 0. Lowering the acceptance threshold tightens the approximation (reducing bias/KL-divergence) but concomitantly reduces 1, thus increasing computational cost. The variational gap 2 can be bounded by
3
for sufficiently heavy-tailed 4 and 5, reinforcing that accuracy improves at the expense of sampling effort (Jankowiak et al., 2023). The full algorithm typically involves an inner sample-reject loop embedded within standard SGD updates, with threshold 6 either dynamically tuned (e.g., to match a target quantile acceptance) or set via theoretical criteria.
Pseudocode Structure for α-DRS
6. Empirical Results and Applications
Empirical evidence demonstrates significant improvements using VRS-based methods:
- Grover et al. report that on sigmoid belief networks trained on MNIST, VRS yields average improvements of 3.71 nats (single-sample) and 0.21 nats (multi-sample) in marginal log-likelihood over state-of-the-art baselines (grover et al., 2018).
- α-DRS provides substantial reductions in 7, with marked fidelity improvements (e.g., posterior mode recovery in mixture models, improvement in Bayesian neural network regression) (Sharma et al., 2019).
- Jankowiak & Phan observe that RVRS achieves lower gradient variance and superior or competitive posterior fidelity compared to normalizing flows, importance weighting, and hybrid MCMC/variational schemes. For instance, RVRS with a simple Gaussian proposal outpaces IWAE and normalizing flow VI on several inference tasks, with empirical speedups and improved negative ELBOs (Jankowiak et al., 2023).
7. Practical Considerations, Limitations, and Extensions
Principal algorithmic considerations include the tuning of acceptance thresholds (either through quantile-based rules or theoretical approximations), selection of divergence order 8 (α-DRS), and proposal family expressivity. Too aggressive rejection increases variance and computational cost, while soft acceptance thresholds facilitate efficient trade-offs.
Limitations include:
- Increased complexity from rejection loops, leading to variable per-sample compute.
- Added hyperparameter tuning (thresholds, quantiles, or 9).
- For non-reparameterizable proposals, standard VRS relies on higher-variance gradient estimators.
VRS variants extend to richer variational families (normalizing flows, hierarchical structures), per-layer factorized resampling, and potentially reinforcement learning/policy search (grover et al., 2018, Jankowiak et al., 2023).
VRS and its descendants stand as a flexible, model-agnostic enhancement to variational inference pipelines—explicitly utilizing model densities to refine approximate inference and offering principled mechanisms to balance fidelity and cost in probabilistic modeling (grover et al., 2018, Sharma et al., 2019, Jankowiak et al., 2023).