Rao-Blackwellised Gradient Estimator
- Rao-Blackwellised gradient estimator is a variance reduction technique that analytically marginalizes over latent variables to produce lower-variance, unbiased stochastic gradients.
- It leverages conditional expectation to replace Monte Carlo sampling in parts of the computation, which enhances efficiency in variational inference, reinforcement learning, and probabilistic deep learning.
- Empirical results demonstrate significant variance reductions—often over 90%—compared to traditional estimators, with only moderate additional computational cost.
A Rao-Blackwellised gradient estimator is a stochastic gradient estimator that leverages the Rao-Blackwell theorem to reduce variance by analytically marginalizing over one or more latent random variables, rather than resorting purely to Monte Carlo sampling. This approach has found broad application across variational inference, discrete and continuous stochastic optimization, probabilistic deep learning, and reinforcement learning, yielding unbiased (or lower-bias) and lower-variance estimators that often require only moderate additional computation. It subsumes and generalizes several classical techniques and is particularly effective in high-dimensional or structured latent-variable models where naive score-function estimators or standard reparameterization gradients suffer from excessive variance.
1. Definition and Fundamental Principle
The core principle behind Rao-Blackwellised gradient estimators is the reduction of the variance of an unbiased (or otherwise useful) Monte Carlo gradient estimator by conditioning on a convenient sufficient statistic, performing exact integration over a tractable subset of latent variables (or labels). In a stochastic gradient setting, suppose the gradient estimator is unbiased for the gradient , with a random variable sampled from distribution . The Rao-Blackwell theorem guarantees that the conditional expectation satisfies but with lower variance: , with strict inequality unless is almost surely a function of only.
In practice, this is realized by drawing a full sample from , then, for the subset of variables most closely associated with a given parameter block or computational bottleneck, replacing Monte Carlo expectations with analytically tractable marginalizations (such as exact summation over all values of a discrete random variable or one-dimensional numerical quadrature in continuous cases) (Titsias, 2015).
2. Applications Across Latent Variable Models
2.1. Variational Inference—Local Expectation Gradients
In variational inference, particularly for latent-variable models 0 and structured variational families 1, stochastic optimization of the evidence lower bound (ELBO) typically employs either the log-derivative (score-function) gradient estimator or the reparameterization trick. The local expectation gradient estimator—explicitly a Rao-Blackwellised score-function gradient—achieves lower variance by integrating over the latent variable 2 most directly associated with parameter 3:
4
This estimator is applicable for both discrete and continuous 5 (using quadrature or summation), is unbiased, and empirically lowers variance by about an order of magnitude versus standard reparameterization at fixed computational budget. Local expectation can be computed for all parameters in parallel, at per-iteration complexity 6, yielding fast, stable convergence in high-dimensional and structured latent variable models (Titsias, 2015).
2.2. Discrete Distributions and "Top-K" Summing Out
For gradients involving expectations over large discrete spaces (e.g., categorical variables with 7), the Rao-Blackwellized estimator can be implemented by summing out the 8 categories with the largest mass in 9, and sampling over the complement:
0
where 1.
This technique preserves unbiasedness and leverages extreme concentration of probability mass (low-entropy settings) to drive variance down rapidly as 2 increases, requiring only minimal extra cost over vanilla REINFORCE. Practically, even 3 yields a 4 reduction in variance in highly concentrated cases (Liu et al., 2018).
2.3. Rao-Blackwellisation with Sampling Without Replacement
Variants based on sampling unordered sets without replacement (e.g., using Plackett-Luce samplers) further lower variance in expectations over discrete random variables. By conditioning on the unordered set and analytically computing leave-one-out ratios for each element, unbiased estimators with substantially reduced variance—especially in low-entropy or peaky distributions—are obtained, often outperforming independent-sampling baselines (Kool et al., 2020).
2.4. Continuous Latent Variables—Reparameterisation Gradients
For continuous latent Gaussian variables with linear or neural parameterizations, Rao-Blackwellising the reparameterization gradient (e.g., as in the R2-G2 estimator) analytically marginalizes over sampled pre-activation noise, yielding gradients of the form
5
where 6 is a linear function of the Gaussian variable.
This construction strictly reduces variance compared to standard reparameterization, preserves unbiasedness, and generalizes the local reparameterization trick for BNNs (Lam et al., 9 Jun 2025).
2.5. Augment-REINFORCE-Swap-Merge (ARSM) Estimator
The ARSM estimator applies Rao-Blackwellisation to categorical expectations, re-expressing the gradient as an expectation over a Dirichlet distribution, and leveraging permutation symmetry (swap/merge) to construct variance-reduced estimators. The resulting estimator matches the true gradient up to Monte Carlo error, with variance 7--8 times lower than REINFORCE or RELAX (Yin et al., 2019).
3. Algorithmic Implementations
Across settings, the general recipe for a Rao-Blackwellised gradient estimator is:
- Draw a full or partial pivot sample from the variational or target distribution.
- For each sub-block/parameter/latent variable of interest, analytically compute the expectation of the gradient estimator over the variable(s) most directly coupled to the parameter, conditioned on the rest of the sample.
- Aggregate those conditional expectations—typically one per parameter block—in a parallelizable or batched fashion.
- Update parameters using the resulting low-variance gradient blocks.
If the conditional expectation is analytic (finite sum for discrete, tractable integral for continuous, or closed-form Dirichlet/conditional Gumbel, etc.), then no additional stochasticity is introduced, and the computational cost is only marginally higher than the naive estimator. Example implementations include:
- Local expectation gradients for variational inference (Titsias, 2015)
- Top-K category summing for large discrete support (Liu et al., 2018)
- Sampling-without-replacement/leave-one-out methods (Kool et al., 2020)
- R2-G2 for latent Gaussians (Lam et al., 9 Jun 2025)
- Gumbel-Rao or Rao-Blackwellised ST-Gumbel-Softmax (Paulus et al., 2020)
- ReinMax-Rao for categorical/straight-through gradients (Wang et al., 9 Mar 2026)
4. Variance Reduction, Bias, and Theoretical Properties
The variance reduction property of Rao-Blackwellisation is guaranteed by the law of total variance: conditioning on an analytically tractable sufficient statistic always yields an estimator with variance no larger than the original, with equality only if the original estimator is already deterministic given the statistic. Multiple studies quantify specific reductions; for instance, summing out one category in a highly concentrated discrete distribution can cut REINFORCE variance by over 90% (Liu et al., 2018); local expectation gradients are empirically about an order of magnitude lower in variance than vanilla reparameterization in both synthetic and real-world examples (Titsias, 2015).
Bias properties are preserved in almost all Rao-Blackwellisation cases, since the estimator is constructed as a conditional expectation of an unbiased estimator. Some advanced constructions (e.g., Gumbel-Softmax or ReinMax-Rao) operate on already biased surrogate estimators, but variance is still strictly reduced at fixed bias (Paulus et al., 2020, Wang et al., 9 Mar 2026).
5. Empirical Performance and Application Case Studies
Extensive empirical studies consistently demonstrate the effectiveness of Rao-Blackwellised estimators:
- High-dimensional Gaussian variational inference: order-of-magnitude variance improvements, faster stable ELBO convergence, ability to use smaller quadrature/single sample compared to thousands of MC samples for naive estimators (Titsias, 2015).
- Discrete-latent VAEs: top-K RB estimators achieve rapid ELBO convergence and high classification accuracy with minimal computational overhead (Liu et al., 2018).
- Categorical VAEs, RL policy gradients, and structured prediction: ARSM and unordered-set RB estimators match or surpass all prior baselines, show 10–1,000× gradient variance reduction, and avoid additional control-variate or baseline tuning (Yin et al., 2019, Kool et al., 2020).
- Risk-sensitive control: Rao-Blackwellised score climbing with particle filters yields unbiased, low-variance policy gradients within SMC frameworks for non-Gaussian, nonlinear stochastic dynamical systems (Abdulsamad et al., 2023).
6. Comparisons with Alternative Gradient Estimators
The Rao-Blackwellised approach generalizes and often strictly dominates classical methods:
| Estimator | Applicability | Variance | Unbiasedness | Structure / Notes |
|---|---|---|---|---|
| Score-function | Arbitrary (discrete, cont.) | High | Unbiased | Simple, needs control-variate |
| Reparameterized | Cont., differentiable 9 | Moderate | Unbiased | Low-variance if appl. |
| Rao-Blackwellised | Discrete/cont./graph-factored | Lowest | Unbiased | Tractable marginalization |
In many settings, Rao-Blackwellised estimators inherit the generality of score-function methods but match or outperform reparameterized estimators when the latter are available. Rao-Blackwellisation can also be applied to various straight-through, surrogate, or policy-gradient estimators as a meta-optimization step (Titsias, 2015, Paulus et al., 2020, Wang et al., 9 Mar 2026, Yin et al., 2019).
7. Limitations and Practical Considerations
While the computational cost of marginalizing over a single variable per parameter block is typically moderate (e.g., 0 for K quadrature points per iteration), applications to high-arity or high-dimensional discrete variables can be challenging unless strong structural factorization or probability mass concentration is present (Liu et al., 2018, Kool et al., 2020). In Bayesian neural networks and deep VAEs, efficient matrix solves or blockwise analytic marginalization suffice for practical deployment (Lam et al., 9 Jun 2025). For particle-based sequential models, per-iteration cost scales with the number of particles but the variance reduction justifies the additional computation (Abdulsamad et al., 2023).
References
- "Local Expectation Gradients for Doubly Stochastic Variational Inference" (Titsias, 2015)
- "Rao-Blackwellized Stochastic Gradients for Discrete Distributions" (Liu et al., 2018)
- "Estimating Gradients for Discrete Random Variables by Sampling without Replacement" (Kool et al., 2020)
- "Rao-Blackwellised Reparameterisation Gradients" (Lam et al., 9 Jun 2025)
- "Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator" (Paulus et al., 2020)
- "Beyond ReinMax: Low-Variance Gradient Estimators for Discrete Latent Variables" (Wang et al., 9 Mar 2026)
- "ARSM: Augment-REINFORCE-Swap-Merge Estimator for Gradient Backpropagation Through Categorical Variables" (Yin et al., 2019)
- "Risk-Sensitive Stochastic Optimal Control as Rao-Blackwellized Markovian Score Climbing" (Abdulsamad et al., 2023)