Papers
Topics
Authors
Recent
Search
2000 character limit reached

Variance Reduced Policy Gradient

Updated 9 March 2026
  • Variance Reduced Policy Gradient methods are a set of reinforcement learning techniques that systematically lower gradient variance while preserving unbiasedness and convergence.
  • They integrate methods such as control variates, off-policy experience replay, and stochastic optimization to substantially enhance sample efficiency and stability.
  • Empirical studies show these methods can reduce gradient variance by 10^2–10^3 times, leading to faster learning and more reliable long-horizon performance.

Variance reduced policy gradient (VRPG) methods comprise a family of algorithms in reinforcement learning (RL) that address the statistical inefficiency of policy gradient estimators by systematically controlling or minimizing the variance of the gradient estimate without sacrificing unbiasedness or convergence to optimality. High variance in standard stochastic policy gradients propagates as slow learning, suboptimal sample efficiency, and instability. The last decade has seen intensive development of variance-reduction techniques tailored for RL, spanning analytically structured control variates, off-policy experience replay, stochastic optimization adaptations (SVRG, SARAH, STORM), action-dependent and trajectory-level baselines, and advanced replay-buffer selection and weighting frameworks. These innovations have delivered quantifiable reduction in sample complexity and improved stability, leading VRPG to become a central paradigm underlying modern scalable policy optimization.

1. High Variance in Policy Gradients: Core Problem Statement

Standard policy gradient methods, such as REINFORCE and actor-critic, estimate the gradient of the expected return

J(θ)=E[t=0T1γtR(St,At)],J(\theta) = \mathbb{E}\Big[\sum_{t=0}^{T-1}\gamma^t R(S_t, A_t)\Big],

via the Monte Carlo estimator

~J(X)=tγtGt(X)θlogπθ(AtSt),\tilde\nabla J(X) = \sum_t \gamma^t G_t(X) \nabla_\theta\log\pi_\theta(A_t|S_t),

where GtG_t is the empirical return-to-go (Kaledin et al., 2022). This estimator exhibits high variance because both GtG_t and the score function θlogπ\nabla_\theta\log \pi are random. The excess variance compels the use of exceptionally small step sizes or large gradient-batch sizes, directly impeding learning efficiency. Variance-reduction techniques aim to retain unbiasedness while introducing structure (e.g., control variates, replay methods, recursive estimators) to systematically lower this variance.

2. Control Variates, Baselines, and the Limits of Classic Techniques

The use of control variates in RL traces to the baseline trick: subtracting a value b(St)b(S_t) from returns before multiplying by the policy score ensures unbiasedness but can meaningfully lower variance (Kaledin et al., 2022, Zhong et al., 2021). In practice, baselines may be:

  • State-dependent (classic critic): fit by least-squares (A2C-style),
  • Action-dependent: constructed via Stein's identity or using the structure of factorized policies, leading to stronger variance reductions, especially for high-dimensional or continuous-action policies (Liu et al., 2017, Wu et al., 2018).

Coordinate-wise and layer-wise vector baselines further reduce variance by optimizing the baseline at each parameter coordinate or layer (Zhong et al., 2021). These vector-structured baselines empirically and theoretically dominate standard scalar ones in variance reduction, particularly for deep/large policies.

Recent work also exploits trajectory-wise control variates: by recursively expanding the law of total variance along the full trajectory, these methods eliminate not only the immediate (state-action) variance but also the variance arising from future-trajectory realizations. Theoretically, trajectory-wise CVs provably achieve the minimum residual variance among all admissible CVs that depend on current policy noise, and empirically speed up long-horizon RL (Cheng et al., 2019).

3. Off-Policy Sample Reuse: Experience Replay with Variance Reduction

Off-policy methods seek to further improve sample efficiency by reusing experience not just from the current policy but also from past (behavior) policies. Variance Reduction Experience Replay (VRER) frameworks—both for full trajectories and per-step (partial trajectory) reuse—introduce sophisticated weighting, selection, and mixture strategies (Zheng et al., 5 Feb 2026, Zheng et al., 2022, Zheng et al., 2021). The key elements are:

  • Importance sampling correction: Each reused sample is weighted by a likelihood ratio wi,k(s,a)=πθk(as)/πθi(as)w_{i,k}(s,a) = \pi_{\theta_k}(a|s)/\pi_{\theta_i}(a|s) to ensure unbiasedness.
  • Mixture likelihood ratio and selective reuse: Instead of uniform replay, samples are weighted/admitted based on their estimated relevance (via variance, KL-divergence, or a variance upper-bound selection rule). This guarantees that only samples sufficiently similar to the current policy are reused, avoiding catastrophic variance inflation (Zheng et al., 2021, Zheng et al., 2022).

VRER achieves O(1/Uk)O(1/|U_k|) reductions in variance compared to on-policy estimators, modulo a selection constant cc, while careful buffer and selection management bounds the bias from policy drift and Markov mixing (Zheng et al., 5 Feb 2026). The finite-time convergence theory explicitly quantifies buffer size, sample “age,” selection threshold, and their bias-variance trade-off.

4. Variance Reduction via Stochastic Optimization: SVRG, SARAH, STORM, and Loopless Protocols

A distinct line of work adapts stochastic variance-reduction schemes from optimization—namely SVRG, SARAH, and STORM—to the RL context (Xu et al., 2017, Yuan et al., 2020, Gargiani et al., 2022, Xu et al., 2019, Zhang et al., 2021). The essence is to construct a recursively updated auxiliary gradient estimator that blends a high-accuracy snapshot (“anchor”) with fast stochastic increments:

  • SVRG-style: Large-batch gradient computed at a “snapshot” parameter θ~\tilde \theta; in the inner loop, small-batch updates correct the current estimate by adding the difference between gradients at θ\theta and θ~\tilde\theta, with importance sampling used for off-policy correction (Xu et al., 2017, Xu et al., 2019).
  • SARAH/STORM: Recursive momentum-style updates, where the gradient estimator at each iteration is an exponential moving average of the previous estimator and the current stochastic estimate (potentially with Hessian-vector corrections), obviating the need for periodic “restarts” (Yuan et al., 2020, Salehkaleybar et al., 2022).

Loopless methods such as PAGE-PG randomize between large-batch and small-batch updates via a probabilistic switch, maintaining unbiasedness and obtaining sharp O(ϵ3)O(\epsilon^{-3}) or better sample complexity (Gargiani et al., 2022).

Truncation-based protocols (TSIVR-PG) further address the critical bottleneck of uncontrolled importance-weight variance by incorporating trust-region-style parameter updates. This provides rigorous control over IS variances and enables global sample-complexity results without unverifiable assumptions (Zhang et al., 2021).

5. Theoretical Guarantees and Sample Complexity

A central metric for VRPG techniques is sample complexity: the number of environment interactions to achieve ϵ\epsilon-stationarity (i.e., EJ(θ)2ϵ2\mathbb{E}\|\nabla J(\theta)\|^2 \leq \epsilon^2). The following table summarizes the leading results, where KK is the number of updates, ϵ\epsilon the target accuracy, and N,BN, B denote large and small batch sizes:

Algorithm Batch/epoch structure Sample Complexity Notable Properties
REINFORCE/GPOMDP On-policy, SGD O(ϵ4)O(\epsilon^{-4}) Baseline only, no variance reduction
SVRPG SVRG-epoch O(ϵ2)O(\epsilon^{-2}) Improved by tighter IS analysis (Xu et al., 2019)
TSIVR-PG SVRG/truncation O(ϵ3)O(\epsilon^{-3}) IS variance controlled, global rates, nonlinear objectives (Zhang et al., 2021)
SRVRPG/PAGE-PG Loopless, recurs. O(ϵ3)O(\epsilon^{-3}) No epoch tuning; minimal storage (Gargiani et al., 2022)
STORM-PG SARAH-momentum O(ϵ3)O(\epsilon^{-3}) Exponential averaging; single-iteration loop (Yuan et al., 2020)
SHARP Hessian-aided, mom. O(ϵ3)O(\epsilon^{-3}) Checkpoint-free, IS-free (Salehkaleybar et al., 2022)
VRER (experience replay) Buffer-based, offline O(ϵ3)O(\epsilon^{-3}) asymp. Selection rule bounds bias-variance (Zheng et al., 5 Feb 2026)
Multi-objective, nonlin. (MO-TSIVR-PG) SVRG, nonlinear ff O(M2ϵ2)O(M^2 \epsilon^{-2}) MM objectives, improves MM-dependence (Guidobene et al., 14 Aug 2025)

Variance-reduced approaches can dominate standard policy gradients by up to one or two orders of magnitude in sample complexity, with O(ϵ2)O(\epsilon^{-2}) rates feasible under stringent conditions such as global concavity and overparameterization (Zhang et al., 2021).

6. Empirical Analysis and Practical Recommendations

Empirically, VRPG techniques consistently outperform baselines (vanilla policy gradient, A2C, on-policy PPO/TRPO) across benchmarks. Key findings include:

  • Experience replay with variance-based selection accelerates convergence and reduces policy and gradient-variance in both simple and high-dimensional tasks: PPO-VRER converges up to 50% faster and achieves higher asymptotic rewards across environments (e.g., CartPole, Hopper, Inverted Pendulum) (Zheng et al., 5 Feb 2026, Zheng et al., 2021).
  • Empirical variance minimization of the control variate (EV-based approaches) can achieve up to 10210^2103×10^3\times reduction in gradient-variance relative to normal A2C, yielding stable rewards and compressed learning curves (Kaledin et al., 2022).
  • Stochastic recursive, loopless, or Hessian-aided VRPG (e.g., STORM-PG, SHARP) demonstrate superior sample efficiency and stability, eliminating the need for large checkpoint batches or IS, with stability improvements verified over many random seeds (Yuan et al., 2020, Salehkaleybar et al., 2022, Gargiani et al., 2022).
  • Action-dependent and vector-structured baselines substantially reduce variance and increase sample efficiency, especially in continuous control and high-dimensional action spaces (Liu et al., 2017, Zhong et al., 2021, Wu et al., 2018).
  • Buffer size and selection thresholds are critical: Small buffers limit reuse; large ones introduce bias unless controlled via variance/KL-based criteria. Empirically, selection constants c1.02c\approx1.02–$1.06$ and buffer sizes \approx300–500 optimize the bias-variance trade-off (Zheng et al., 5 Feb 2026).
  • Adaptive shrinkage baselines (e.g., James–Stein) further improve training stability in large-scale RL from human feedback (RLHF) and LLM fine-tuning, with measurable reductions in gradient variance and improved final task performance (Zeng et al., 5 Nov 2025).

7. Advanced Directions and Open Problems

Recent advances extend VRPG frameworks to:

  • Average-reward infinite-horizon MDPs: Implicit Gradient Transport and Hessian-based algorithms now achieve order-optimal regret bounds, O~(T)\tilde O(\sqrt{T}), closing the gap to theoretical lower bounds for model-free RL in this setting (Ganesh et al., 2024).
  • Multi-objective RL (MORL): Variance-reduction with control variates admits sample-complexity that scales only quadratically in MM, the number of objectives, independent of state/action space dimension (Guidobene et al., 14 Aug 2025).
  • Generic policy parameterizations and global optimality: Combining VRPG with natural policy gradients and function-approximation theory yields global convergence guarantees (modulo function-approximation error) and pushes practical algorithms closer to minimax-optimal sample efficiency (Liu et al., 2022).

Notable open questions include tightening the dependence on horizon HH and discount factor 1/(1γ)1/(1-\gamma), generalizing without strong IS-variance assumptions, and extending provably optimal VRPG schemes to partially observed or multi-agent domains.


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Variance Reduced Policy Gradient.