Papers
Topics
Authors
Recent
2000 character limit reached

Single-Representation Rejection Sampling

Updated 7 December 2025
  • The paper introduces single-representation rejection sampling, unifying variational inference with rejection sampling through a shared proposal density and envelope constant.
  • It refines posterior approximations by leveraging divergence minimization, balancing computational efficiency with statistical accuracy.
  • Empirical results show improved performance in Bayesian neural networks and sigmoid belief networks, reducing divergence metrics and enhancing predictive accuracy.

Single-representation rejection sampling is a principled methodology that synthesizes variational inference and classical rejection sampling via a unified proposal distribution and envelope constant. This approach achieves more accurate posterior approximations in probabilistic modeling, particularly in settings where conventional variational approximations exhibit significant bias. Central to the method is the use of a single proposal family and a single scalar envelope, streamlining both optimization and sampling without iterative re-estimation or region-specific envelopes. Key instantiations include the frameworks presented in "Variational Rejection Sampling" (grover et al., 2018) and "Refined α-Divergence Variational Inference via Rejection Sampling" (Sharma et al., 2019).

1. Formal Definition and Foundational Concepts

A "single representation" is defined by a proposal density qθ(x)q_\theta(x) and an envelope constant M(θ)M(\theta) such that Mqθ(x)p~(x)M q_\theta(x) \geq \tilde{p}(x) for all xx, where p~(x)\tilde{p}(x) is the unnormalized target (e.g., posterior) density. The same proposal qθq_\theta and envelope M(θ)M(\theta) are used both within variational inference and the subsequent rejection sampling stage. This contrasts with approaches that adapt proposal or envelope parameters during or after sampling phases.

A central theoretical insight is the equivalence between the optimal RS envelope and the Rényi divergence at α=\alpha=\infty: D(pqθ)=logsupxp(x)qθ(x)=logM(θ)D_\infty(p\parallel q_\theta) = \log \sup_x \frac{p(x)}{q_\theta(x)} = \log M(\theta) This establishes a direct link between tightness of the rejection sampling envelope and worst-case proposal-target mismatch (Sharma et al., 2019).

2. Two-Stage Algorithm: Refined α-Divergence and Rejection Sampling

The canonical algorithm proceeds in two stages:

Stage 1: Refined α-Divergence Variational Inference (RDVI)

  • Minimize the Rényi α-divergence Dα(pqθ)D_\alpha(p\parallel q_\theta) for α>0\alpha>0.
  • Objective:

J(θ)=Eqθ[(p~(x)qθ(x))α]J(\theta) = \mathbb{E}_{q_\theta}\left[\left(\frac{\tilde{p}(x)}{q_\theta(x)}\right)^\alpha\right]

  • Gradient estimation leverages the score function:

θJ(θ)=αEqθ[w(x)θlogqθ(x)],w(x)=(p~(x)qθ(x))α\nabla_\theta J(\theta) = -\alpha\,\mathbb{E}_{q_\theta}\left[w(x)\,\nabla_\theta\log q_\theta(x)\right],\quad w(x)=\left(\frac{\tilde{p}(x)}{q_\theta(x)}\right)^\alpha

  • Update θ\theta using SGD-type rules until convergence, yielding an optimized qθ^q_{\hat\theta}.

Stage 2: Single-Representation Rejection Sampling

  • Determine M^=exp(D(pqθ^))exp(Dα(pqθ^))\hat{M} = \exp(D_\infty(p\parallel q_{\hat\theta})) \approx \exp(D_\alpha(p\parallel q_{\hat\theta})) or equivalent model-specific estimator.
  • For each draw xqθ^x \sim q_{\hat\theta}, accept with probability:

aθ^(xT^)=min{1,eT^p~(x)qθ^(x)}a_{\hat\theta}(x | \hat{T}) = \min\left\{1, e^{\hat{T}}\frac{\tilde{p}(x)}{q_{\hat\theta}(x)}\right\}

where T^=logM^\hat{T} = -\log \hat{M}.

  • Accepted samples form the refined distribution:

r(x)=qθ^(x)aθ^(xT^)ZR(θ^,T^),ZR=qθ^(x)aθ^(xT^)dxr(x) = \frac{q_{\hat\theta}(x)a_{\hat\theta}(x|\hat{T})}{Z_R(\hat\theta, \hat{T})}, \quad Z_R = \int q_{\hat\theta}(x)a_{\hat\theta}(x|\hat{T})dx

This algorithmic structure is explicit in both (Sharma et al., 2019) and (grover et al., 2018).

3. Acceptance Probability and Trade-Offs

The acceptance function aθ,ϕ(xT)a_{\theta,\phi}(x|T) mediates the trade-off between computational cost and approximation accuracy. As T+T \rightarrow +\infty, all proposals are accepted and the method reduces to variational inference with qθq_\theta; as TT \rightarrow -\infty, only proposals aligning closely with the target density are accepted, converging to exact posterior samples at the expense of more rejections. The threshold TT or its quantile-based stabilization controls this trade-off, yielding a spectrum between computational efficiency and statistical fidelity (grover et al., 2018, Sharma et al., 2019).

4. Gradient Estimation and Theoretical Guarantees

In both the unnormalized proposal context of (grover et al., 2018) and the acceptance-weighted refinement of (Sharma et al., 2019), efficient covariance-style gradient estimators are derived. For instance, the R-ELBO gradient with respect to variational parameters ϕ\phi under the resampled proposal is: ϕR-ELBO=CovzR[A(z), (1σ(lθ,ϕ(zx,T)))ϕlogqϕ(zx)]\nabla_\phi R\text{-ELBO} = \mathrm{Cov}_{z\sim R}\left[A(z),\ (1 - \sigma(l_{\theta,\phi}(z|x,T)))\,\nabla_\phi \log q_\phi(z|x)\right] where A(z)A(z) is an explicit learning signal and σ()\sigma(\cdot) denotes the logistic function (grover et al., 2018).

A key theoretical guarantee is that the rejection sampling refinement does not increase the α-divergence: Dα(pr)Dα(pqθ), α>0D_\alpha(p\parallel r) \leq D_\alpha(p\parallel q_\theta),\ \forall\,\alpha>0 and the divergence decreases strictly as the envelope tightens (TT decreases), except in the degenerate case where qθq_\theta already matches pp (Sharma et al., 2019).

5. Empirical Evaluations

Empirical results demonstrate the efficacy of single-representation rejection sampling across several domains:

  • Synthetic Gaussian Mixture: When the initial variational proposal qθq_\theta misses multiple modes, the RS refinement enables recovery of a multimodal sample-based approximation aligning with the true mixture, as evidenced by a significant decrease in divergence metrics (Sharma et al., 2019).
  • Bayesian Neural Network Regression: In fully-factorized Gaussian variational families for regression with one hidden layer (50 ReLU units), applying single-representation RS consistently reduces root-mean-squared error (e.g., from 0.70 for RDVI to 0.66 for α-DRS) and improves held-out log-likelihood (from −1.25 to −1.10) compared to state-of-the-art baselines (Sharma et al., 2019).
  • Sigmoid Belief Networks on MNIST: Variational rejection sampling outperforms both single-sample and multi-sample baselines in marginal log-likelihood evaluation, e.g., achieving average test negative log-likelihoods of ≈91.7 nats for a 3×200 network compared to ≈91.9 nats for VIMCO with 50 samples (grover et al., 2018).

The table summarizes selected results:

Task Method RMSE / NLL (test) Divergence Drop
Bayesian NN Reg. RDVI 0.70 / −1.25 Baseline
Bayesian NN Reg. α-DRS (SRRS) 0.66 / −1.10 Substantial
MNIST SBN VIMCO (k=50) – / ≈91.9 Baseline
MNIST SBN VRS (SRRS) – / ≈91.7 Moderate

Single-representation rejection sampling provides a generic recipe for augmenting any parametric variational posterior with a tractable, adjustable accept-reject mechanism. The approach is compatible with quantile-thresholding adaptations to handle high-dimensional sampling issues (Sharma et al., 2019), and connects to the broader literature on divergence correction and proposal refinement in inference.

A limitation is the potential computational overhead for tight envelopes (TT small), as the average number of proposals before acceptance can grow rapidly. Empirical practice requires careful tuning of threshold or quantile hyperparameters to balance accuracy and efficiency.

Notably, compared to variational rejection sampling methods employing differentiable ("soft") acceptance functions (grover et al., 2018), the α-DRS procedure of (Sharma et al., 2019) employs a hard minimum and is explicitly tied to the optimal Rényi divergence constant.

7. Conclusions

Single-representation rejection sampling constitutes a robust strategy for variational inference refinement, characterized by its use of a unified proposal and envelope for both optimization and sampling. The method leverages a theoretical synergy between rejection sampling constants and worst-case divergence, enabling provable and empirical improvements in posterior approximation. It represents a key development in the intersection of divergence-minimizing inference and classical simulation algorithms, supported by reproducible experimental benefits and strong theoretical guarantees (grover et al., 2018, Sharma et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Single-Representation Rejection Sampling.