REINFORCE Leave-One-Out Gradients

Updated 13 June 2026

The paper introduces REINFORCE leave-one-out gradients as a variance reduction technique that integrates leave-one-out control variates with score-function estimators to yield unbiased gradients.
It reframes ELBO gradient estimation using a log-variance loss to derive the VarGrad estimator, which uses near-optimal baseline coefficients for enhanced performance.
Extensions with Stein-based and DoubleCV methods further lower gradient variance, improving convergence in discrete latent variable models and variational autoencoders.

REINFORCE leave-one-out (RLOO) gradients refer to a class of variance-reduced gradient estimators for Monte Carlo integration of expectations with respect to parameterized probability distributions, particularly within variational inference (VI) using the score function (REINFORCE) method. These estimators combine the score-function approach with leave-one-out control variates, yielding unbiased but lower-variance gradient estimates, especially suited for discrete latent variable models and variational autoencoders.

1. Score-Function (REINFORCE) Estimators and Their Variants

The standard REINFORCE estimator targets gradients of the evidence lower bound (ELBO)

$\mathrm{ELBO}(\theta) = \mathbb{E}_{q_\theta(z)} \left[ \log p(x,z) - \log q_\theta(z) \right].$

By the score-function (SF) identity,

$\nabla_\theta \, \mathrm{ELBO}(\theta) = \mathbb{E}_{q_\theta(z)}\left[ \left(\log p(x,z) - \log q_\theta(z)\right) \nabla_\theta \log q_\theta(z) \right].$

Naive Monte Carlo score-function gradients typically exhibit high variance due to sample correlations, especially for discrete variables. To mitigate this, baseline control variates are often subtracted. The leave-one-out (LOO) REINFORCE variant, also known as RLOO, uses a sample-specific baseline:

$\widehat{G}_{\mathrm{RLOO}} = \frac1K \sum_{k=1}^K \left( f(x_k) - \bar{f}_{-k} \right) \nabla_\eta \log q_\eta(x_k),$

where $\bar{f}_{-k}$ is the average of $f(x_j)$ over all $j \neq k$ (Titsias et al., 2021).

2. Log-Variance Loss and the VarGrad Estimator

VarGrad, introduced by Richter et al. (2020), generalizes RLOO by reframing ELBO gradient estimation as the gradient of a "log-variance loss." For a reference density $r(z)$ ,

$L_r(\theta) = \frac{1}{2} \operatorname{Var}_r \left[ \log \frac{q_\theta(z)}{p(z|x)} \right ].$

When $r = q_\theta$ , this loss is a divergence that vanishes if the approximate and true posteriors coincide. The gradient of $L_{q_\theta}$ yields a score-function term plus a mean correction:

$\nabla_\theta \, \mathrm{ELBO}(\theta) = \mathbb{E}_{q_\theta(z)}\left[ \left(\log p(x,z) - \log q_\theta(z)\right) \nabla_\theta \log q_\theta(z) \right].$ 0

At $\nabla_\theta \, \mathrm{ELBO}(\theta) = \mathbb{E}_{q_\theta(z)}\left[ \left(\log p(x,z) - \log q_\theta(z)\right) \nabla_\theta \log q_\theta(z) \right].$ 1, the mean term cancels, recovering $\nabla_\theta \, \mathrm{ELBO}(\theta) = \mathbb{E}_{q_\theta(z)}\left[ \left(\log p(x,z) - \log q_\theta(z)\right) \nabla_\theta \log q_\theta(z) \right].$ 2 (Richter et al., 2020).

The empirical approximation,

$\nabla_\theta \, \mathrm{ELBO}(\theta) = \mathbb{E}_{q_\theta(z)}\left[ \left(\log p(x,z) - \log q_\theta(z)\right) \nabla_\theta \log q_\theta(z) \right].$ 3

yields, after differentiation, the VarGrad estimator:

$\nabla_\theta \, \mathrm{ELBO}(\theta) = \mathbb{E}_{q_\theta(z)}\left[ \left(\log p(x,z) - \log q_\theta(z)\right) \nabla_\theta \log q_\theta(z) \right].$ 4

where $\nabla_\theta \, \mathrm{ELBO}(\theta) = \mathbb{E}_{q_\theta(z)}\left[ \left(\log p(x,z) - \log q_\theta(z)\right) \nabla_\theta \log q_\theta(z) \right].$ 5 (Richter et al., 2020). This form precisely recovers a leave-one-out baseline with near-optimal coefficient and rescaling.

3. Stein-Based and Double Control Variate Extensions

Further variance reduction is possible by introducing additional control variates. Stein operators, as developed in the "RODEO" framework, yield flexible zero-mean corrections in discrete spaces. For a discrete $\nabla_\theta \, \mathrm{ELBO}(\theta) = \mathbb{E}_{q_\theta(z)}\left[ \left(\log p(x,z) - \log q_\theta(z)\right) \nabla_\theta \log q_\theta(z) \right].$ 6 and Markov kernel $\nabla_\theta \, \mathrm{ELBO}(\theta) = \mathbb{E}_{q_\theta(z)}\left[ \left(\log p(x,z) - \log q_\theta(z)\right) \nabla_\theta \log q_\theta(z) \right].$ 7, the Stein operator $\nabla_\theta \, \mathrm{ELBO}(\theta) = \mathbb{E}_{q_\theta(z)}\left[ \left(\log p(x,z) - \log q_\theta(z)\right) \nabla_\theta \log q_\theta(z) \right].$ 8 satisfies $\nabla_\theta \, \mathrm{ELBO}(\theta) = \mathbb{E}_{q_\theta(z)}\left[ \left(\log p(x,z) - \log q_\theta(z)\right) \nabla_\theta \log q_\theta(z) \right].$ 9. The RLOO estimator can thus be augmented with local and global Stein control variates without bias:

$\widehat{G}_{\mathrm{RLOO}} = \frac1K \sum_{k=1}^K \left( f(x_k) - \bar{f}_{-k} \right) \nabla_\eta \log q_\eta(x_k),$ 0

with

$\widehat{G}_{\mathrm{RLOO}} = \frac1K \sum_{k=1}^K \left( f(x_k) - \bar{f}_{-k} \right) \nabla_\eta \log q_\eta(x_k),$ 1

Empirically, RODEO significantly lowers gradient variance in generative modeling, achieving state-of-the-art ELBO and convergence trends with the same computational budget as standard RLOO (Shi et al., 2022).

Double control variate (DoubleCV) methods further exploit auxiliary functions—typically constructed from first-order Taylor expansions—to form additional, sample-specific corrections atop the leave-one-out baseline. When optimally combined, these yield strictly lower variance than RLOO, sometimes even surpassing estimators using the unattainable "true mean" baseline (Titsias et al., 2021).

4. Theoretical Properties and Variance Analysis

The variance improvement gained by RLOO and its extensions is rooted in the optimal baseline selection problem. The optimal coefficient for a control variate $\widehat{G}_{\mathrm{RLOO}} = \frac1K \sum_{k=1}^K \left( f(x_k) - \bar{f}_{-k} \right) \nabla_\eta \log q_\eta(x_k),$ 2 in dimension $\widehat{G}_{\mathrm{RLOO}} = \frac1K \sum_{k=1}^K \left( f(x_k) - \bar{f}_{-k} \right) \nabla_\eta \log q_\eta(x_k),$ 3 is

$\widehat{G}_{\mathrm{RLOO}} = \frac1K \sum_{k=1}^K \left( f(x_k) - \bar{f}_{-k} \right) \nabla_\eta \log q_\eta(x_k),$ 4

VarGrad uses the sample mean $\widehat{G}_{\mathrm{RLOO}} = \frac1K \sum_{k=1}^K \left( f(x_k) - \bar{f}_{-k} \right) \nabla_\eta \log q_\eta(x_k),$ 5 as its baseline coefficient, which is close to optimal under broad conditions: specifically, when $\widehat{G}_{\mathrm{RLOO}} = \frac1K \sum_{k=1}^K \left( f(x_k) - \bar{f}_{-k} \right) \nabla_\eta \log q_\eta(x_k),$ 6 is either very large (early in training) or very small (late in training), and moments of the score are bounded (Richter et al., 2020).

In high-dimensional regimes ( $\widehat{G}_{\mathrm{RLOO}} = \frac1K \sum_{k=1}^K \left( f(x_k) - \bar{f}_{-k} \right) \nabla_\eta \log q_\eta(x_k),$ 7 large) and with sufficiently many samples, VarGrad's variance

$\widehat{G}_{\mathrm{RLOO}} = \frac1K \sum_{k=1}^K \left( f(x_k) - \bar{f}_{-k} \right) \nabla_\eta \log q_\eta(x_k),$ 8

provably holds. Stein-augmented estimators (RODEO) and DoubleCV, when optimally tuned, analogously achieve variance lower than RLOO (Shi et al., 2022, Titsias et al., 2021).

5. Practical Implementation and Algorithmic Details

VarGrad and its extensions operate as black-box, parameter-free gradient estimators requiring only standard score-function gradient machinery. The implementation closely follows the following steps, exemplified for VarGrad (Richter et al., 2020):

Sample $\widehat{G}_{\mathrm{RLOO}} = \frac1K \sum_{k=1}^K \left( f(x_k) - \bar{f}_{-k} \right) \nabla_\eta \log q_\eta(x_k),$ 9 i.i.d. draws $\bar{f}_{-k}$ 0 (samples detached for “stop-gradient” semantics).
Compute $\bar{f}_{-k}$ 1.
Assemble sample mean $\bar{f}_{-k}$ 2 and gradients $\bar{f}_{-k}$ 3.
Return

$\bar{f}_{-k}$ 4

Algorithmic enhancements for VAEs and other models leverage autodiff to obtain required derivatives, incur negligible additional computational cost, and introduce no extra passes through the decoder or sampling distribution (Titsias et al., 2021).

6. Empirical Evaluation and Comparison

Empirical results demonstrate that RLOO, VarGrad, RODEO, and DoubleCV provide favorable variance-computation trade-offs in both synthetic and real-world settings:

In discrete VAEs (Bernoulli latent models, e.g., Omniglot), VarGrad (with 4 samples) achieves learning curves nearly matching those of REBAR, RELAX, and ARM, but at reduced per-step computational cost.
RODEO achieves up to an order-of-magnitude variance reduction relative to RLOO and DoubleCV, matching or exceeding ELBOs of competing estimators at fixed budget (Shi et al., 2022).
DoubleCV converges in fewer steps and achieves uniformly lower variance and higher ELBOs than RLOO, as illustrated in benchmark studies on MNIST, Fashion-MNIST, and Omniglot (Titsias et al., 2021).

A summary table of ELBO comparisons for $\bar{f}_{-k}$ 5 on Bernoulli-likelihood VAEs:

Estimator	MNIST	Fashion-MNIST	Omniglot
RLOO	$\bar{f}_{-k}$ 6	$\bar{f}_{-k}$ 7	$\bar{f}_{-k}$ 8
DoubleCV	$\bar{f}_{-k}$ 9	$f(x_j)$ 0	$f(x_j)$ 1
DisARM	$f(x_j)$ 2	$f(x_j)$ 3	$f(x_j)$ 4

7. Assumptions, Limitations, and Regimes of Use

These methods assume that the support of $f(x_j)$ 5 contains the support of $f(x_j)$ 6 to ensure well-definedness of the control variate and gradient terms. All estimators require that sampling and baseline construction do not differentiate through the samples (“stop-gradient” on draws). Tight variance guarantees hold under additional tail-regularity and bounded kurtosis conditions on the score, with the most significant improvements observed in large-sample ( $f(x_j)$ 7) and high-dimensional ( $f(x_j)$ 8) settings (Richter et al., 2020).

RODEO’s efficiency may be limited in moderate-to-high dimensional discrete spaces when full evaluation of neighbor states for the Stein operators is infeasible, though surrogate strategies mitigate this. DoubleCV, when adapted as recommended, incurs no additional autodiff or sampling cost in standard VAE implementations.

The development and refinement of REINFORCE leave-one-out gradient estimators—including VarGrad, Stein-augmented, and double control variate extensions—represent key advances in low-variance, unbiased gradient estimation for variational inference in discrete latent variable frameworks (Richter et al., 2020, Shi et al., 2022, Titsias et al., 2021).