Variance Reduced Policy Gradient
- Variance Reduced Policy Gradient methods are a set of reinforcement learning techniques that systematically lower gradient variance while preserving unbiasedness and convergence.
- They integrate methods such as control variates, off-policy experience replay, and stochastic optimization to substantially enhance sample efficiency and stability.
- Empirical studies show these methods can reduce gradient variance by 10^2–10^3 times, leading to faster learning and more reliable long-horizon performance.
Variance reduced policy gradient (VRPG) methods comprise a family of algorithms in reinforcement learning (RL) that address the statistical inefficiency of policy gradient estimators by systematically controlling or minimizing the variance of the gradient estimate without sacrificing unbiasedness or convergence to optimality. High variance in standard stochastic policy gradients propagates as slow learning, suboptimal sample efficiency, and instability. The last decade has seen intensive development of variance-reduction techniques tailored for RL, spanning analytically structured control variates, off-policy experience replay, stochastic optimization adaptations (SVRG, SARAH, STORM), action-dependent and trajectory-level baselines, and advanced replay-buffer selection and weighting frameworks. These innovations have delivered quantifiable reduction in sample complexity and improved stability, leading VRPG to become a central paradigm underlying modern scalable policy optimization.
1. High Variance in Policy Gradients: Core Problem Statement
Standard policy gradient methods, such as REINFORCE and actor-critic, estimate the gradient of the expected return
via the Monte Carlo estimator
where is the empirical return-to-go (Kaledin et al., 2022). This estimator exhibits high variance because both and the score function are random. The excess variance compels the use of exceptionally small step sizes or large gradient-batch sizes, directly impeding learning efficiency. Variance-reduction techniques aim to retain unbiasedness while introducing structure (e.g., control variates, replay methods, recursive estimators) to systematically lower this variance.
2. Control Variates, Baselines, and the Limits of Classic Techniques
The use of control variates in RL traces to the baseline trick: subtracting a value from returns before multiplying by the policy score ensures unbiasedness but can meaningfully lower variance (Kaledin et al., 2022, Zhong et al., 2021). In practice, baselines may be:
- State-dependent (classic critic): fit by least-squares (A2C-style),
- Action-dependent: constructed via Stein's identity or using the structure of factorized policies, leading to stronger variance reductions, especially for high-dimensional or continuous-action policies (Liu et al., 2017, Wu et al., 2018).
Coordinate-wise and layer-wise vector baselines further reduce variance by optimizing the baseline at each parameter coordinate or layer (Zhong et al., 2021). These vector-structured baselines empirically and theoretically dominate standard scalar ones in variance reduction, particularly for deep/large policies.
Recent work also exploits trajectory-wise control variates: by recursively expanding the law of total variance along the full trajectory, these methods eliminate not only the immediate (state-action) variance but also the variance arising from future-trajectory realizations. Theoretically, trajectory-wise CVs provably achieve the minimum residual variance among all admissible CVs that depend on current policy noise, and empirically speed up long-horizon RL (Cheng et al., 2019).
3. Off-Policy Sample Reuse: Experience Replay with Variance Reduction
Off-policy methods seek to further improve sample efficiency by reusing experience not just from the current policy but also from past (behavior) policies. Variance Reduction Experience Replay (VRER) frameworks—both for full trajectories and per-step (partial trajectory) reuse—introduce sophisticated weighting, selection, and mixture strategies (Zheng et al., 5 Feb 2026, Zheng et al., 2022, Zheng et al., 2021). The key elements are:
- Importance sampling correction: Each reused sample is weighted by a likelihood ratio to ensure unbiasedness.
- Mixture likelihood ratio and selective reuse: Instead of uniform replay, samples are weighted/admitted based on their estimated relevance (via variance, KL-divergence, or a variance upper-bound selection rule). This guarantees that only samples sufficiently similar to the current policy are reused, avoiding catastrophic variance inflation (Zheng et al., 2021, Zheng et al., 2022).
VRER achieves reductions in variance compared to on-policy estimators, modulo a selection constant , while careful buffer and selection management bounds the bias from policy drift and Markov mixing (Zheng et al., 5 Feb 2026). The finite-time convergence theory explicitly quantifies buffer size, sample “age,” selection threshold, and their bias-variance trade-off.
4. Variance Reduction via Stochastic Optimization: SVRG, SARAH, STORM, and Loopless Protocols
A distinct line of work adapts stochastic variance-reduction schemes from optimization—namely SVRG, SARAH, and STORM—to the RL context (Xu et al., 2017, Yuan et al., 2020, Gargiani et al., 2022, Xu et al., 2019, Zhang et al., 2021). The essence is to construct a recursively updated auxiliary gradient estimator that blends a high-accuracy snapshot (“anchor”) with fast stochastic increments:
- SVRG-style: Large-batch gradient computed at a “snapshot” parameter ; in the inner loop, small-batch updates correct the current estimate by adding the difference between gradients at and , with importance sampling used for off-policy correction (Xu et al., 2017, Xu et al., 2019).
- SARAH/STORM: Recursive momentum-style updates, where the gradient estimator at each iteration is an exponential moving average of the previous estimator and the current stochastic estimate (potentially with Hessian-vector corrections), obviating the need for periodic “restarts” (Yuan et al., 2020, Salehkaleybar et al., 2022).
Loopless methods such as PAGE-PG randomize between large-batch and small-batch updates via a probabilistic switch, maintaining unbiasedness and obtaining sharp or better sample complexity (Gargiani et al., 2022).
Truncation-based protocols (TSIVR-PG) further address the critical bottleneck of uncontrolled importance-weight variance by incorporating trust-region-style parameter updates. This provides rigorous control over IS variances and enables global sample-complexity results without unverifiable assumptions (Zhang et al., 2021).
5. Theoretical Guarantees and Sample Complexity
A central metric for VRPG techniques is sample complexity: the number of environment interactions to achieve -stationarity (i.e., ). The following table summarizes the leading results, where is the number of updates, the target accuracy, and denote large and small batch sizes:
| Algorithm | Batch/epoch structure | Sample Complexity | Notable Properties |
|---|---|---|---|
| REINFORCE/GPOMDP | On-policy, SGD | Baseline only, no variance reduction | |
| SVRPG | SVRG-epoch | Improved by tighter IS analysis (Xu et al., 2019) | |
| TSIVR-PG | SVRG/truncation | IS variance controlled, global rates, nonlinear objectives (Zhang et al., 2021) | |
| SRVRPG/PAGE-PG | Loopless, recurs. | No epoch tuning; minimal storage (Gargiani et al., 2022) | |
| STORM-PG | SARAH-momentum | Exponential averaging; single-iteration loop (Yuan et al., 2020) | |
| SHARP | Hessian-aided, mom. | Checkpoint-free, IS-free (Salehkaleybar et al., 2022) | |
| VRER (experience replay) | Buffer-based, offline | asymp. | Selection rule bounds bias-variance (Zheng et al., 5 Feb 2026) |
| Multi-objective, nonlin. (MO-TSIVR-PG) | SVRG, nonlinear | objectives, improves -dependence (Guidobene et al., 14 Aug 2025) |
Variance-reduced approaches can dominate standard policy gradients by up to one or two orders of magnitude in sample complexity, with rates feasible under stringent conditions such as global concavity and overparameterization (Zhang et al., 2021).
6. Empirical Analysis and Practical Recommendations
Empirically, VRPG techniques consistently outperform baselines (vanilla policy gradient, A2C, on-policy PPO/TRPO) across benchmarks. Key findings include:
- Experience replay with variance-based selection accelerates convergence and reduces policy and gradient-variance in both simple and high-dimensional tasks: PPO-VRER converges up to 50% faster and achieves higher asymptotic rewards across environments (e.g., CartPole, Hopper, Inverted Pendulum) (Zheng et al., 5 Feb 2026, Zheng et al., 2021).
- Empirical variance minimization of the control variate (EV-based approaches) can achieve up to – reduction in gradient-variance relative to normal A2C, yielding stable rewards and compressed learning curves (Kaledin et al., 2022).
- Stochastic recursive, loopless, or Hessian-aided VRPG (e.g., STORM-PG, SHARP) demonstrate superior sample efficiency and stability, eliminating the need for large checkpoint batches or IS, with stability improvements verified over many random seeds (Yuan et al., 2020, Salehkaleybar et al., 2022, Gargiani et al., 2022).
- Action-dependent and vector-structured baselines substantially reduce variance and increase sample efficiency, especially in continuous control and high-dimensional action spaces (Liu et al., 2017, Zhong et al., 2021, Wu et al., 2018).
- Buffer size and selection thresholds are critical: Small buffers limit reuse; large ones introduce bias unless controlled via variance/KL-based criteria. Empirically, selection constants –$1.06$ and buffer sizes 300–500 optimize the bias-variance trade-off (Zheng et al., 5 Feb 2026).
- Adaptive shrinkage baselines (e.g., James–Stein) further improve training stability in large-scale RL from human feedback (RLHF) and LLM fine-tuning, with measurable reductions in gradient variance and improved final task performance (Zeng et al., 5 Nov 2025).
7. Advanced Directions and Open Problems
Recent advances extend VRPG frameworks to:
- Average-reward infinite-horizon MDPs: Implicit Gradient Transport and Hessian-based algorithms now achieve order-optimal regret bounds, , closing the gap to theoretical lower bounds for model-free RL in this setting (Ganesh et al., 2024).
- Multi-objective RL (MORL): Variance-reduction with control variates admits sample-complexity that scales only quadratically in , the number of objectives, independent of state/action space dimension (Guidobene et al., 14 Aug 2025).
- Generic policy parameterizations and global optimality: Combining VRPG with natural policy gradients and function-approximation theory yields global convergence guarantees (modulo function-approximation error) and pushes practical algorithms closer to minimax-optimal sample efficiency (Liu et al., 2022).
Notable open questions include tightening the dependence on horizon and discount factor , generalizing without strong IS-variance assumptions, and extending provably optimal VRPG schemes to partially observed or multi-agent domains.
References
- (Zheng et al., 5 Feb 2026) Variance Reduction Based Experience Replay for Policy Optimization
- (Kaledin et al., 2022) Variance Reduction for Policy-Gradient Methods via Empirical Variance Minimization
- (Zheng et al., 2021) Variance Reduction based Experience Replay for Policy Optimization
- (Xu et al., 2017) Stochastic Variance Reduction for Policy Gradient Estimation
- (Salehkaleybar et al., 2022) Momentum-Based Policy Gradient with Second-Order Information
- (Yuan et al., 2020) Stochastic Recursive Momentum for Policy Gradient Methods
- (Gargiani et al., 2022) PAGE-PG: A Simple and Loopless Variance-Reduced Policy Gradient Method with Probabilistic Gradient Estimation
- (Zhang et al., 2021) On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method
- (Ganesh et al., 2024) Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs
- (Guidobene et al., 14 Aug 2025) Variance Reduced Policy Gradient Method for Multi-Objective Reinforcement Learning
- (Zeng et al., 5 Nov 2025) Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards
- (Zhong et al., 2021) Coordinate-wise Control Variates for Deep Policy Gradients
- (Liu et al., 2017) Action-dependent Control Variates for Policy Optimization via Stein's Identity
- (Wu et al., 2018) Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines
- (Liu et al., 2022) An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods
- (Cheng et al., 2019) Trajectory-wise Control Variates for Variance Reduction in Policy Gradient Methods