PSPO: Smoothing in Policy Optimization

Updated 3 July 2026

PSPO is a reinforcement learning framework that employs probability smoothing to interpolate between current and reference policies, ensuring stable, soft trust-region updates.
It replaces hard clipping with continuous smoothing operators, contracting divergence metrics like total variation and Kullback–Leibler, and maintaining gradient continuity.
Empirical results show PSPO enhances performance in LLM fine-tuning, continuous control, and offline RL by delivering improved stability and generalization.

Probability Smoothing Policy Optimisation (PSPO) encompasses a set of reinforcement learning methodologies designed to replace hard, information-discarding mechanisms for policy update restriction—such as sharp ratio clipping—with principled smoothing-based operators. These operators interpolate between the current policy and a reference (or behavior) policy, yielding a soft trust region that guarantees gradient continuity, formal divergence contraction in total variation (TV) and Kullback–Leibler (KL) distance, and empirically leads to markedly improved stability and generalization across domains ranging from LLM fine-tuning to offline model-based RL. The PSPO framework includes instantiations for both online, on-policy actor–critic methods and offline, Bayesian model-based methods, unified by the core idea of controlled probability smoothing as the basis of trust-region enforcement.

1. Core Principles and Smoothing Operators

The fundamental concept in PSPO is the replacement of discrete update boundaries with soft, continuous smoothing on the policy probability simplex. For on-policy settings such as Generalized Reversed PPO (GRPO), the linear probability-smoothing operator is defined as follows:

$S_\alpha[\pi_\theta](a|s) = (1-\alpha)\pi_\theta(a|s) + \alpha\pi_{\theta_\text{old}}(a|s),\quad \alpha \in [0,1]$

This operator interpolates smoothly between the updated policy $\pi_\theta$ (for $\alpha=0$ ) and the behavior policy $\pi_{\theta_\text{old}}$ (for $\alpha=1$ ). Smoothing directly induces a contraction in divergence metrics:

Total Variation Contraction: $\|S_\alpha[\pi_\theta](\cdot|s) - \pi_{\theta_\text{old}}(\cdot|s)\|_1 = (1-\alpha)\|\pi_\theta(\cdot|s) - \pi_{\theta_\text{old}}(\cdot|s)\|_1$ .
KL-Contraction Corollary: $D_{\mathrm{KL}}[S_\alpha[\pi_\theta]\|\pi_{\theta_\text{old}}] \le (1-\alpha) D_{\mathrm{KL}}[\pi_\theta\|\pi_{\theta_\text{old}}]$ and similarly for the reverse KL.

A related functional smoothing approach for PPO-style surrogates is based on smoothly contracting likelihood ratios via hyperbolic tangent or other bounded differentiable operators, as in the Proximal Policy Optimization Smoothed Algorithm (PPOS) (Zhu et al., 2020).

2. Objective Formulations and Algorithmic Realizations

PSPO modifies the canonical policy-gradient surrogate used in PPO/GRPO by substituting the hard-clipped importance ratio with a smoothed alternative. For the linear PSPO approach (Dwyer et al., 25 Sep 2025):

The smoothed probability for action $a$ :

$\tilde{\pi}_\theta(a|s) = (1-\alpha)\pi_\theta(a|s) + \alpha\pi_{\theta_\text{old}}(a|s)$

The corresponding importance ratio:

$\tilde{r}(a|s) = \frac{\tilde{\pi}_\theta(a|s)}{\pi_{\theta_\text{old}}(a|s)} = (1-\alpha) r(a|s) + \alpha$

The unconstrained PSPO objective is:

$\pi_\theta$ 0

Optionally, an explicit KL penalty to a reference policy can be added. The PSPO framework thus subsumes both unconstrained and KL-constrained policy optimization.

In model-based offline RL (Lin et al., 8 May 2026), “probability smoothing” occurs via averaging or sampling an ensemble of Bayesian posterior transition models, ensuring policy updates reflect uncertainty in dynamics, especially in out-of-distribution (OOD) regions.

3. Theoretical Guarantees and Formal Properties

PSPO provides closed-form, non-asymptotic contraction guarantees for divergence and gradient signal:

Ratio Contraction & Non-Vanishing Slopes: For all $\pi_\theta$ 1, $\pi_\theta$ 2, and $\pi_\theta$ 3, precluding the “flat-plateau” phenomenon prevalent in hard clipping.
Monotonic Improvement in Model-Based Settings: When the KL-step size $\pi_\theta$ 4 is sufficiently small and the Fisher-based condition on gradients holds, successive PSPO policy iterates guarantee $\pi_\theta$ 5 (Lin et al., 8 May 2026).
Convergence of Value Estimation: Posterior sampling with stochastic approximation retains bounded variance and contractive Bellman operators, ensuring convergence (Lin et al., 8 May 2026).

4. Algorithmic Instantiations and Pseudocode

GR-PSPO (Dwyer et al., 25 Sep 2025):

Sample actions from $\pi_\theta$ 6.
Compute empirical advantages.
Evaluate smoothed log-probabilities and ratios $\pi_\theta$ 7.
Form the policy gradient estimator: $\pi_\theta$ 8.
Update parameters via Adam.

PPOS (Zhu et al., 2020): Replaces PPO’s flat clipping with a smooth function based on $\pi_\theta$ 9 for out-of-bounds ratios, ensuring gradients decay smoothly and remain nonzero.

Posterior Sampling-Based PSPO (Lin et al., 8 May 2026):

Bayesian ensemble for transition model posterior $\alpha=0$ 0.
Bellman updates via posterior sampling.
KL-constrained policy update: $\alpha=0$ 1.

5. Empirical Results and Comparative Assessment

Empirical validation demonstrates the superiority of PSPO over hard clipping, both in terms of update stability and downstream task performance.

LLM RL Fine-Tuning (GR-PSPO):
- Qwen2.5-0.5B, GSM8K: GRPO-clipped, 17.6%; GR-PSPO (α=0.1), 39.7%; unclipped GRPO, 40.7%.
- Qwen2.5-1.5B, GSM8K: GRPO-clipped, 37.8%; GR-PSPO, 59.4%; unclipped, 57.9%.
- Out-of-distribution benchmarks: GR-PSPO matches or outperforms both variants, with improvements of ≈7–20 pp.
- Quality metrics: LLM-as-judge rates GR-PSPO responses as more concise, logically coherent, and better formatted (Dwyer et al., 25 Sep 2025).
Continuous Control (PPOS):
- Highest mean return in 4/5 MuJoCo tasks, e.g., Humanoid-v2: PPOS, 535.6±59.7 versus PPO, 473.3±61.2.
- Lower run-to-run variance and policy entropy; more deterministic agents (Zhu et al., 2020).
Offline RL (Posterior Sampling):

| Dataset | CQL | MOReL | RAMBO | PMDB | PSPO (ours) | |---------------------|-------|-------|-------|-------|-------------| | HalfCheetah-Medium | 46.9 | 60.7 | 77.9 | 75.6 | 79.3 | | Hopper-Medium | 61.9 | 84.0 | 87.0 | 106.8 | 108.5 | | Walker2d-Medium | 79.5 | 72.8 | 84.9 | 94.2 | 103.9 | | Optimal-Liquidation | 89.4 | 64.7 | 99.6 | 85.5 | 102.3 |

Posterior sampling-based PSPO outperforms or matches all state-of-the-art baselines on 14/18 D4RL tasks and provides stable returns under pronounced stochasticity (Lin et al., 8 May 2026).

6. Hyperparameterization and Practical Integration

Key parameters include:

Smoothing strength α — Governs trust-region radius.
- Typical effective range: α ≈ 0.05–0.2 for LLMs, with α=0 recovering unclipped objectives and larger α imparting more conservative updates (Dwyer et al., 25 Sep 2025, Zhu et al., 2020).
- In PPOS, α scales with observation dimension: α(|O|) ≈ 0.3333 exp(−0.0048 |O|) (Zhu et al., 2020).
Clipping threshold ε in PPOS — Standard PPO value ε=0.2 retained.
KL penalty β or step size ε — Typically zero in PSPO/GR-PSPO due to implicit divergence control, positive or tuned in model-based settings for explicit regularization.

Integration into existing PPO/GRPO codes is minimal: replace ratio calculations and optionally add smoothing and KL terms as outlined above.

7. Broader Impact, Limitations, and Connections

PSPO provides a unifying framework for soft trust-region enforcement, mitigating sharp gradient discontinuities and instability associated with hard clipping and excessive pessimistic regularization in both online and offline RL. It has proven particularly effective in LLM fine-tuning under GRPO and challenging control tasks demanding stability and expressivity in exploration. A key limitation is the selection and tuning of smoothing strength hyperparameters—excessive smoothing can slow learning, while insufficient smoothing may allow instability. The PSPO methodology connects to label smoothing in supervised learning and Bayesian model averaging in the context of offline RL uncertainty management.

Further research directions include dynamic adaptation of smoothing strength, broader benchmarking on emerging large-policy domains, and theoretical analysis of convergence speed and robustness as a function of smoothing operator design.