Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 171 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 435 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Sandwiched Policy Gradient (SPG)

Updated 13 October 2025
  • Sandwiched Policy Gradient (SPG) is a reinforcement learning framework that integrates multiple gradient objectives to mitigate bias and variance in policy updates.
  • It employs tractable lower and intractable upper bounds to stabilize learning in off-policy control, continuous actions, and diffusion models.
  • Extensions such as Sinkhorn SPG and diffusion fine-tuning demonstrate its versatility in achieving improved convergence and performance in diverse RL applications.

Sandwiched Policy Gradient (SPG) denotes a class of reinforcement learning algorithms characterized by the combination or "sandwiching" of multiple gradient-based objectives within the policy update, typically aiming to correct value estimation bias, exploit multiple bounds (as in likelihood estimation), or stabilize policy optimization in challenging settings such as off-policy control, continuous action spaces, or diffusion models. Over the past decade, several research trajectories have converged on this concept under technically precise frameworks, unifying off-policy control, stochastic/deterministic policy gradients, entropy regularization, and modern RL for generative models.

1. Conceptual Foundations and Definitions

The SPG framework is motivated by the observation that standard policy gradient estimators often suffer from bias, high variance, or instability due to distribution shift, the intractability of true objective functions, or one-sided approximations. Canonical examples include off-policy control (Lehnert et al., 2015), where changing the policy alters the underlying data distribution, and diffusion LLMs (Wang et al., 10 Oct 2025), where the log-likelihood is not tractable for direct gradient computation.

Mathematically, SPG involves policy updates of the form:

Δθ=α1G1(θ)+α2G2(θ)+\Delta_\theta = \alpha_1 \cdot G_1(\theta) + \alpha_2 \cdot G_2(\theta) + \cdots

where each GiG_i may represent gradients of lower or upper bounds, corrections for distribution drift, or components related to entropy or bias constraints.

Key motivations:

2. SPG in Off-policy Control and Value Function Correction

The foundational work (Lehnert et al., 2015) presents the first policy gradient algorithm for off-policy control with function approximation, extending gradient TD methods (GTD/TDC/GQ) to settings with a changing policy πθ\pi_\theta. The crux lies in correctly differentiating the Mean Squared Projected BeLLMan Error (MSPBE) when both the stationary state–action distribution ds,a=dsπθ(as)d_{s,a} = d_s \pi_\theta(a|s) and the BeLLMan operator TθT_\theta depend on the policy parameters.

The update for θ\theta involves standard TD terms and explicit corrections:

θθ+α[δϕγϕ(ϕw)(θπθ/πθ)δ(ϕw)γ(θπθ/πθ)(θϕ)(ϕw)+12(wϕ)2(θπθ/πθ)]\theta \leftarrow \theta + \alpha [ \delta \phi - \gamma \phi' (\phi^\top w) - (\nabla_\theta \pi_\theta/\pi_\theta) \delta (\phi^\top w) - \gamma (\nabla_\theta \pi'_\theta/\pi'_\theta) (\theta^\top \phi')(\phi^\top w) + \tfrac{1}{2} (w^\top \phi)^2 (\nabla_\theta \pi_\theta/\pi_\theta) ]

where δ\delta is the TD error, ww is an auxiliary weight vector, and the policy-gradient corrections "sandwich" the TD error to ensure consistency in the evolving data distribution.

Convergence theorem: With linear function approximation, suitable step-size schedules, and ergodicity assumptions, the method converges to a fixed point corresponding to a stably improved value function even as the policy is continuously adapted.

Empirical results: Tested on the Baird counterexample, the proposed PGQ algorithm achieves stable convergence (MSPBE 0\rightarrow 0) and outperforms classic Q-learning and prior TDC/GQ under both uniform and trajectory-based sampling regimes.

3. SPG in Expected Policy Gradients and Variance Reduction

SPG is closely related to the “expected policy gradients” (EPG) framework, which unifies stochastic and deterministic policy gradient methods by analytically integrating the action space, rather than relying solely on Monte Carlo samples (Ciosek et al., 2017, Ciosek et al., 2018). The key result is the general policy gradient theorem:

θJ=sρ(s)[V(s)aπ(as)Q(a,s)]\nabla_\theta J = \int_s \rho(s) [ \nabla V(s) - \int_a \pi(a|s) \nabla Q(a,s) ]

where the derivative operator is "sandwiched" inside the expectation.

This result subsumes both classical SPG (sampling-based) and DPG (Dirac delta policies) as special cases. Analytical integration (when feasible) or numerical quadrature reduces estimator variance compared to single-sample estimates, enabling larger learning rates and improved sample efficiency.

Exploration strategies: For Gaussian policies and quadratic critics, optimal exploration noise is derived from the matrix exponential of the critic’s Hessian, yielding

ΣseH(s)\Sigma_s \propto e^{H(s)}

where H(s)H(s) is the action Hessian of QQ. This curvature-adaptive approach yields superior performance in MuJoCo control benchmarks compared to heuristic Ornstein–Uhlenbeck noise.

4. SPG Extensions: Sinkhorn SPG and Sampling-based Actor Updates

Domain-specific instantiations of SPG include the Sinkhorn Policy Gradient for permutations (Emami et al., 2018), which relaxes the discrete space of permutation matrices to continuous doubly-stochastic matrices via the Sinkhorn operator. Gradients are backpropagated through the continuous representation, while rewards are computed over rounded hard permutations. An auxiliary critic penalty helps match values for discrete and relaxed actions, debiasing updates.

In continuous domains, sampled policy gradient (SPG) variants sample multiple candidate actions per state and update the actor toward the action with maximal Q-value (Wiehe et al., 2018, Holubar et al., 2020). This strategy:

  • Facilitates global search in Q-space, reducing the risk of local optima.
  • Can be extended with action prioritization and weighted updates for improved stability (e.g., SPG-p).
  • Experience replay (ER) enhances critic robustness and accelerates training, outperforming on-policy baselines such as PPO under ER (Holubar et al., 2020).

5. SPG for Diffusion LLMs and Likelihood Bounds

Recent advances adapt SPG to the reinforcement learning fine-tuning of diffusion LLMs (Wang et al., 10 Oct 2025). Here, the log-likelihood is intractable—ELBO is used as a lower bound, but either maximizing ELBO for good outputs or minimizing for undesirable completions can bias the gradient and limit RL effectiveness.

SPG improves upon this by “sandwiching” the true log-likelihood between a tractable ELBO and a Rényi-based evidence upper bound (EUBO):

  • For positively rewarded outputs, SPG maximizes ELBO.
  • For negatively rewarded outputs, SPG minimizes EUBO (or a mixture of ELBO/EUBO).
  • Block-wise masking ensures variance reduction in estimation.

The policy optimization objective is:

JSPG(θ)=Ec,x[IA0ALELBO(xc;θ)+IA<0ALEUBO(xc;θ)]J_{\text{SPG}}(\theta) = \mathbb{E}_{c, x} [ \mathbb{I}_{A \geq 0} \cdot A \cdot L_{\text{ELBO}}(x|c;\theta) + \mathbb{I}_{A < 0} \cdot A \cdot L_{\text{EUBO}}(x|c;\theta) ]

Experimental results show that SPG achieves superior accuracy on GSM8K (+3.6%), MATH500 (+2.6%), Countdown (+18.4%), and Sudoku (+27.0%) compared to RL baselines.

SPG concepts have informed developments in entropy-regularized RL (Liu et al., 2019), where the policy gradient is:

θJ(θ)Es,a[qπ(s,a)αlogπ(as)]\nabla_\theta J(\theta) \propto \mathbb{E}_{s, a} [ q_{\pi}(s,a) - \alpha \log \pi(a|s) ]

incorporating entropy directly into the gradient update for stabilized exploration, improved representation capacity (via local action variance), and enhanced scalability.

Other extensions include bias–gain optimization: methods seek to optimize not only long-run average reward (gain) but also the bias for superior transient performance, employing "sandwiched" objectives and logarithmic barrier functions to maintain gain-optimality while maximizing bias (Dewanto et al., 2021).

Momentum-based acceleration for SPG has also emerged: SPG-NM integrates a negative momentum term, updating parameters as the better of the gradient or momentum-adjusted value, leading to faster convergence and improved robustness across bandit and MDP tasks (Zhang et al., 8 May 2024).

7. SPG, Importance Correction, and Bias–Variance Tradeoff

SPG also connects to the bias–variance tradeoff in Monte Carlo policy gradient updates. For example, in bandit and online learning settings, SPG minimizes variance by nullifying importance corrections for sampled actions, trading unbiasedness for stability (Morrill et al., 2022, Tosatto et al., 2022). Extensions such as NeuRD-CIX interpolate between high-variance unbiased updates and low-variance biased SPG updates via capped implicit exploration.

Regret bounds: NeuRD-CIX achieves sublinear regret with high probability in sequential decision settings, demonstrating the utility of “sandwiched” importance correction for robust learning in non-stationary environments.


Table: SPG Method Variants and Primary Mechanism

Variant Domain/Application Main Mechanism
PGQ SPG (Lehnert et al., 2015) Off-policy control TD update “sandwiched” with policy gradient drift corrections
EPG (Ciosek et al., 2017, Ciosek et al., 2018) Stochastic/deterministic PG Analytical integration over action, variance reduction
Sinkhorn SPG (Emami et al., 2018) Permutation learning Continuous relaxation via Sinkhorn, actor-critic “bypass”
Sampled SPG (Wiehe et al., 2018, Holubar et al., 2020) Continuous RL Actor update by sampling and Q-value maximization
Diffusion SPG (Wang et al., 10 Oct 2025) dLLMs ELBO/EUBO bounds “sandwiched” in likelihood estimation
Entropic SPG (Liu et al., 2019) Maximum entropy RL Policy gradient with entropy regularization
Momentum SPG (Zhang et al., 8 May 2024) Accelerated RL Negative momentum sequence in gradient ascent
NeuRD-CIX (Morrill et al., 2022) Bandit/sequential learning Capped importance weighting for bias–variance tuning

References

Summary

Sandwiched Policy Gradient encompasses a heterogeneous set of techniques unified by their multi-component gradient estimators, which combine and correct value and policy gradients using bounding functions, adaptive corrections, or structured relaxations to address intractability, bias, or instability. Empirical and theoretical evidence consistently shows that the “sandwiching” mechanism yields superior convergence and alignment performance in challenging RL and generative modeling contexts. The framework continues to evolve, integrating novel estimation, exploration, and acceleration schemes, thereby serving as a foundational paradigm in modern reinforcement learning.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Sandwiched Policy Gradient (SPG).