Sandwiched Policy Gradient (SPG)
- Sandwiched Policy Gradient (SPG) is a reinforcement learning framework that integrates multiple gradient objectives to mitigate bias and variance in policy updates.
- It employs tractable lower and intractable upper bounds to stabilize learning in off-policy control, continuous actions, and diffusion models.
- Extensions such as Sinkhorn SPG and diffusion fine-tuning demonstrate its versatility in achieving improved convergence and performance in diverse RL applications.
Sandwiched Policy Gradient (SPG) denotes a class of reinforcement learning algorithms characterized by the combination or "sandwiching" of multiple gradient-based objectives within the policy update, typically aiming to correct value estimation bias, exploit multiple bounds (as in likelihood estimation), or stabilize policy optimization in challenging settings such as off-policy control, continuous action spaces, or diffusion models. Over the past decade, several research trajectories have converged on this concept under technically precise frameworks, unifying off-policy control, stochastic/deterministic policy gradients, entropy regularization, and modern RL for generative models.
1. Conceptual Foundations and Definitions
The SPG framework is motivated by the observation that standard policy gradient estimators often suffer from bias, high variance, or instability due to distribution shift, the intractability of true objective functions, or one-sided approximations. Canonical examples include off-policy control (Lehnert et al., 2015), where changing the policy alters the underlying data distribution, and diffusion LLMs (Wang et al., 10 Oct 2025), where the log-likelihood is not tractable for direct gradient computation.
Mathematically, SPG involves policy updates of the form:
where each may represent gradients of lower or upper bounds, corrections for distribution drift, or components related to entropy or bias constraints.
Key motivations:
- Correction for nonstationarity arising when the policy under evaluation is contemporaneously being improved (Lehnert et al., 2015).
- Construction of tighter surrogate objectives by sandwiching intractable targets between tractable lower and upper bounds (Wang et al., 10 Oct 2025).
- Variance reduction through integration over action space or adaptive exploration (Ciosek et al., 2017, Ciosek et al., 2018).
2. SPG in Off-policy Control and Value Function Correction
The foundational work (Lehnert et al., 2015) presents the first policy gradient algorithm for off-policy control with function approximation, extending gradient TD methods (GTD/TDC/GQ) to settings with a changing policy . The crux lies in correctly differentiating the Mean Squared Projected BeLLMan Error (MSPBE) when both the stationary state–action distribution and the BeLLMan operator depend on the policy parameters.
The update for involves standard TD terms and explicit corrections:
where is the TD error, is an auxiliary weight vector, and the policy-gradient corrections "sandwich" the TD error to ensure consistency in the evolving data distribution.
Convergence theorem: With linear function approximation, suitable step-size schedules, and ergodicity assumptions, the method converges to a fixed point corresponding to a stably improved value function even as the policy is continuously adapted.
Empirical results: Tested on the Baird counterexample, the proposed PGQ algorithm achieves stable convergence (MSPBE ) and outperforms classic Q-learning and prior TDC/GQ under both uniform and trajectory-based sampling regimes.
3. SPG in Expected Policy Gradients and Variance Reduction
SPG is closely related to the “expected policy gradients” (EPG) framework, which unifies stochastic and deterministic policy gradient methods by analytically integrating the action space, rather than relying solely on Monte Carlo samples (Ciosek et al., 2017, Ciosek et al., 2018). The key result is the general policy gradient theorem:
where the derivative operator is "sandwiched" inside the expectation.
This result subsumes both classical SPG (sampling-based) and DPG (Dirac delta policies) as special cases. Analytical integration (when feasible) or numerical quadrature reduces estimator variance compared to single-sample estimates, enabling larger learning rates and improved sample efficiency.
Exploration strategies: For Gaussian policies and quadratic critics, optimal exploration noise is derived from the matrix exponential of the critic’s Hessian, yielding
where is the action Hessian of . This curvature-adaptive approach yields superior performance in MuJoCo control benchmarks compared to heuristic Ornstein–Uhlenbeck noise.
4. SPG Extensions: Sinkhorn SPG and Sampling-based Actor Updates
Domain-specific instantiations of SPG include the Sinkhorn Policy Gradient for permutations (Emami et al., 2018), which relaxes the discrete space of permutation matrices to continuous doubly-stochastic matrices via the Sinkhorn operator. Gradients are backpropagated through the continuous representation, while rewards are computed over rounded hard permutations. An auxiliary critic penalty helps match values for discrete and relaxed actions, debiasing updates.
In continuous domains, sampled policy gradient (SPG) variants sample multiple candidate actions per state and update the actor toward the action with maximal Q-value (Wiehe et al., 2018, Holubar et al., 2020). This strategy:
- Facilitates global search in Q-space, reducing the risk of local optima.
- Can be extended with action prioritization and weighted updates for improved stability (e.g., SPG-p).
- Experience replay (ER) enhances critic robustness and accelerates training, outperforming on-policy baselines such as PPO under ER (Holubar et al., 2020).
5. SPG for Diffusion LLMs and Likelihood Bounds
Recent advances adapt SPG to the reinforcement learning fine-tuning of diffusion LLMs (Wang et al., 10 Oct 2025). Here, the log-likelihood is intractable—ELBO is used as a lower bound, but either maximizing ELBO for good outputs or minimizing for undesirable completions can bias the gradient and limit RL effectiveness.
SPG improves upon this by “sandwiching” the true log-likelihood between a tractable ELBO and a Rényi-based evidence upper bound (EUBO):
- For positively rewarded outputs, SPG maximizes ELBO.
- For negatively rewarded outputs, SPG minimizes EUBO (or a mixture of ELBO/EUBO).
- Block-wise masking ensures variance reduction in estimation.
The policy optimization objective is:
Experimental results show that SPG achieves superior accuracy on GSM8K (+3.6%), MATH500 (+2.6%), Countdown (+18.4%), and Sudoku (+27.0%) compared to RL baselines.
6. Related Theoretical Developments: Bias, Entropy, and Momentum
SPG concepts have informed developments in entropy-regularized RL (Liu et al., 2019), where the policy gradient is:
incorporating entropy directly into the gradient update for stabilized exploration, improved representation capacity (via local action variance), and enhanced scalability.
Other extensions include bias–gain optimization: methods seek to optimize not only long-run average reward (gain) but also the bias for superior transient performance, employing "sandwiched" objectives and logarithmic barrier functions to maintain gain-optimality while maximizing bias (Dewanto et al., 2021).
Momentum-based acceleration for SPG has also emerged: SPG-NM integrates a negative momentum term, updating parameters as the better of the gradient or momentum-adjusted value, leading to faster convergence and improved robustness across bandit and MDP tasks (Zhang et al., 8 May 2024).
7. SPG, Importance Correction, and Bias–Variance Tradeoff
SPG also connects to the bias–variance tradeoff in Monte Carlo policy gradient updates. For example, in bandit and online learning settings, SPG minimizes variance by nullifying importance corrections for sampled actions, trading unbiasedness for stability (Morrill et al., 2022, Tosatto et al., 2022). Extensions such as NeuRD-CIX interpolate between high-variance unbiased updates and low-variance biased SPG updates via capped implicit exploration.
Regret bounds: NeuRD-CIX achieves sublinear regret with high probability in sequential decision settings, demonstrating the utility of “sandwiched” importance correction for robust learning in non-stationary environments.
Table: SPG Method Variants and Primary Mechanism
Variant | Domain/Application | Main Mechanism |
---|---|---|
PGQ SPG (Lehnert et al., 2015) | Off-policy control | TD update “sandwiched” with policy gradient drift corrections |
EPG (Ciosek et al., 2017, Ciosek et al., 2018) | Stochastic/deterministic PG | Analytical integration over action, variance reduction |
Sinkhorn SPG (Emami et al., 2018) | Permutation learning | Continuous relaxation via Sinkhorn, actor-critic “bypass” |
Sampled SPG (Wiehe et al., 2018, Holubar et al., 2020) | Continuous RL | Actor update by sampling and Q-value maximization |
Diffusion SPG (Wang et al., 10 Oct 2025) | dLLMs | ELBO/EUBO bounds “sandwiched” in likelihood estimation |
Entropic SPG (Liu et al., 2019) | Maximum entropy RL | Policy gradient with entropy regularization |
Momentum SPG (Zhang et al., 8 May 2024) | Accelerated RL | Negative momentum sequence in gradient ascent |
NeuRD-CIX (Morrill et al., 2022) | Bandit/sequential learning | Capped importance weighting for bias–variance tuning |
References
- Policy Gradient Methods for Off-policy Control (Lehnert et al., 2015)
- Expected Policy Gradients (Ciosek et al., 2017)
- Expected Policy Gradients for Reinforcement Learning (Ciosek et al., 2018)
- Learning Permutations with Sinkhorn Policy Gradient (Emami et al., 2018)
- Sampled Policy Gradient for Learning to Play the Game Agar.io (Wiehe et al., 2018)
- Policy Optimization Reinforcement Learning with Entropy Regularization (Liu et al., 2019)
- Continuous-action Reinforcement Learning for Playing Racing Games: Comparing SPG to PPO (Holubar et al., 2020)
- A nearly Blackwell-optimal policy gradient method (Dewanto et al., 2021)
- A Temporal-Difference Approach to Policy Gradient Estimation (Tosatto et al., 2022)
- Interpolating Between Softmax Policy Gradient and Neural Replicator Dynamics with Capped Implicit Exploration (Morrill et al., 2022)
- Model-free Reinforcement Learning of Semantic Communication by Stochastic Policy Gradient (Beck et al., 2023)
- Fast Stochastic Policy Gradient: Negative Momentum for Reinforcement Learning (Zhang et al., 8 May 2024)
- SPG: Sandwiched Policy Gradient for Masked Diffusion LLMs (Wang et al., 10 Oct 2025)
Summary
Sandwiched Policy Gradient encompasses a heterogeneous set of techniques unified by their multi-component gradient estimators, which combine and correct value and policy gradients using bounding functions, adaptive corrections, or structured relaxations to address intractability, bias, or instability. Empirical and theoretical evidence consistently shows that the “sandwiching” mechanism yields superior convergence and alignment performance in challenging RL and generative modeling contexts. The framework continues to evolve, integrating novel estimation, exploration, and acceleration schemes, thereby serving as a foundational paradigm in modern reinforcement learning.