REINFORCE Leave-One-Out Gradients
- The paper introduces REINFORCE leave-one-out gradients as a variance reduction technique that integrates leave-one-out control variates with score-function estimators to yield unbiased gradients.
- It reframes ELBO gradient estimation using a log-variance loss to derive the VarGrad estimator, which uses near-optimal baseline coefficients for enhanced performance.
- Extensions with Stein-based and DoubleCV methods further lower gradient variance, improving convergence in discrete latent variable models and variational autoencoders.
REINFORCE leave-one-out (RLOO) gradients refer to a class of variance-reduced gradient estimators for Monte Carlo integration of expectations with respect to parameterized probability distributions, particularly within variational inference (VI) using the score function (REINFORCE) method. These estimators combine the score-function approach with leave-one-out control variates, yielding unbiased but lower-variance gradient estimates, especially suited for discrete latent variable models and variational autoencoders.
1. Score-Function (REINFORCE) Estimators and Their Variants
The standard REINFORCE estimator targets gradients of the evidence lower bound (ELBO)
By the score-function (SF) identity,
Naive Monte Carlo score-function gradients typically exhibit high variance due to sample correlations, especially for discrete variables. To mitigate this, baseline control variates are often subtracted. The leave-one-out (LOO) REINFORCE variant, also known as RLOO, uses a sample-specific baseline:
where is the average of over all (Titsias et al., 2021).
2. Log-Variance Loss and the VarGrad Estimator
VarGrad, introduced by Richter et al. (2020), generalizes RLOO by reframing ELBO gradient estimation as the gradient of a "log-variance loss." For a reference density ,
When , this loss is a divergence that vanishes if the approximate and true posteriors coincide. The gradient of yields a score-function term plus a mean correction:
0
At 1, the mean term cancels, recovering 2 (Richter et al., 2020).
The empirical approximation,
3
yields, after differentiation, the VarGrad estimator:
4
where 5 (Richter et al., 2020). This form precisely recovers a leave-one-out baseline with near-optimal coefficient and rescaling.
3. Stein-Based and Double Control Variate Extensions
Further variance reduction is possible by introducing additional control variates. Stein operators, as developed in the "RODEO" framework, yield flexible zero-mean corrections in discrete spaces. For a discrete 6 and Markov kernel 7, the Stein operator 8 satisfies 9. The RLOO estimator can thus be augmented with local and global Stein control variates without bias:
0
with
1
Empirically, RODEO significantly lowers gradient variance in generative modeling, achieving state-of-the-art ELBO and convergence trends with the same computational budget as standard RLOO (Shi et al., 2022).
Double control variate (DoubleCV) methods further exploit auxiliary functions—typically constructed from first-order Taylor expansions—to form additional, sample-specific corrections atop the leave-one-out baseline. When optimally combined, these yield strictly lower variance than RLOO, sometimes even surpassing estimators using the unattainable "true mean" baseline (Titsias et al., 2021).
4. Theoretical Properties and Variance Analysis
The variance improvement gained by RLOO and its extensions is rooted in the optimal baseline selection problem. The optimal coefficient for a control variate 2 in dimension 3 is
4
VarGrad uses the sample mean 5 as its baseline coefficient, which is close to optimal under broad conditions: specifically, when 6 is either very large (early in training) or very small (late in training), and moments of the score are bounded (Richter et al., 2020).
In high-dimensional regimes (7 large) and with sufficiently many samples, VarGrad's variance
8
provably holds. Stein-augmented estimators (RODEO) and DoubleCV, when optimally tuned, analogously achieve variance lower than RLOO (Shi et al., 2022, Titsias et al., 2021).
5. Practical Implementation and Algorithmic Details
VarGrad and its extensions operate as black-box, parameter-free gradient estimators requiring only standard score-function gradient machinery. The implementation closely follows the following steps, exemplified for VarGrad (Richter et al., 2020):
- Sample 9 i.i.d. draws 0 (samples detached for “stop-gradient” semantics).
- Compute 1.
- Assemble sample mean 2 and gradients 3.
- Return
4
Algorithmic enhancements for VAEs and other models leverage autodiff to obtain required derivatives, incur negligible additional computational cost, and introduce no extra passes through the decoder or sampling distribution (Titsias et al., 2021).
6. Empirical Evaluation and Comparison
Empirical results demonstrate that RLOO, VarGrad, RODEO, and DoubleCV provide favorable variance-computation trade-offs in both synthetic and real-world settings:
- In discrete VAEs (Bernoulli latent models, e.g., Omniglot), VarGrad (with 4 samples) achieves learning curves nearly matching those of REBAR, RELAX, and ARM, but at reduced per-step computational cost.
- RODEO achieves up to an order-of-magnitude variance reduction relative to RLOO and DoubleCV, matching or exceeding ELBOs of competing estimators at fixed budget (Shi et al., 2022).
- DoubleCV converges in fewer steps and achieves uniformly lower variance and higher ELBOs than RLOO, as illustrated in benchmark studies on MNIST, Fashion-MNIST, and Omniglot (Titsias et al., 2021).
A summary table of ELBO comparisons for 5 on Bernoulli-likelihood VAEs:
| Estimator | MNIST | Fashion-MNIST | Omniglot |
|---|---|---|---|
| RLOO | 6 | 7 | 8 |
| DoubleCV | 9 | 0 | 1 |
| DisARM | 2 | 3 | 4 |
7. Assumptions, Limitations, and Regimes of Use
These methods assume that the support of 5 contains the support of 6 to ensure well-definedness of the control variate and gradient terms. All estimators require that sampling and baseline construction do not differentiate through the samples (“stop-gradient” on draws). Tight variance guarantees hold under additional tail-regularity and bounded kurtosis conditions on the score, with the most significant improvements observed in large-sample (7) and high-dimensional (8) settings (Richter et al., 2020).
RODEO’s efficiency may be limited in moderate-to-high dimensional discrete spaces when full evaluation of neighbor states for the Stein operators is infeasible, though surrogate strategies mitigate this. DoubleCV, when adapted as recommended, incurs no additional autodiff or sampling cost in standard VAE implementations.
The development and refinement of REINFORCE leave-one-out gradient estimators—including VarGrad, Stein-augmented, and double control variate extensions—represent key advances in low-variance, unbiased gradient estimation for variational inference in discrete latent variable frameworks (Richter et al., 2020, Shi et al., 2022, Titsias et al., 2021).