Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pb-PPO: Adaptive Clipping in Proximal Policy Optimization

Updated 29 April 2026
  • Pb-PPO is a bi-level reinforcement learning method that dynamically optimizes clipping bounds via multi-armed bandit feedback.
  • It integrates a multi-armed bandit mechanism with standard PPO to adjust clipping parameters based on actual return feedback for enhanced policy updates.
  • Empirical benchmarks demonstrate that Pb-PPO achieves higher final returns, improved sample efficiency, and smoother learning curves compared to fixed-bound PPO variants.

Preference-based Proximal Policy Optimization (Pb-PPO) is a bi-level reinforcement learning framework designed to address limitations inherent in conventional Proximal Policy Optimization (PPO), specifically the reliance on a fixed clipping bound for policy updates. Pb-PPO dynamically optimizes the choice of the clipping bound via a multi-armed bandit mechanism guided by feedback from actual returns, aligning the update process with the true objective of maximizing cumulative return. This approach results in improved sample efficiency, stability, and overall performance in continuous control and navigation tasks, as evidenced by empirical benchmarks (Zhang et al., 2023).

1. Motivation and Background

PPO is widely adopted for its stable training characteristics, largely attributed to the use of a clipped surrogate objective that constrains policy updates. The clipping bound ϵ\epsilon in PPO controls the extent of update per iteration, with PPO's performance shown to be sensitive to this hyperparameter. Traditional PPO uses a fixed ϵ\epsilon, yet there is no theoretical guarantee that a constant bound remains optimal throughout training. Fixed bounds can restrict exploration and adaptability, and prior work on dynamically adjusting ϵ\epsilon has not directly aligned clipping adjustment with maximizing true cumulative return.

Pb-PPO introduces a principled framework for dynamically and automatically selecting the clipping bound. Unlike previous methods, the mechanism directly uses task-return feedback to tune the bound, thus better matching reinforcement learning’s core objective.

2. Bi-Level Optimization Paradigm

Pb-PPO formalizes the policy update as a bi-level optimization problem:

  • Inner Level: For a given clipping bound ϵ\epsilon, optimize the policy using the standard PPO-style clipped surrogate objective:

Jinner(πnew;ϵ)=Eτπold[min(r(τ)Aold(τ),clip(r(τ),1ϵ,1+ϵ)Aold(τ))],J_{\text{inner}}(\pi_{\text{new}}; \epsilon) = \mathbb{E}_{\tau \sim \pi_{\text{old}}} \left[ \min\left( r(\tau)A_{\text{old}}(\tau), \mathrm{clip}(r(\tau), 1-\epsilon, 1+\epsilon)A_{\text{old}}(\tau) \right) \right],

where r(τ)=πnew(τ)/πold(τ)r(\tau) = \pi_{\text{new}}(\tau)/\pi_{\text{old}}(\tau).

  • Outer Level: Select the optimal ϵ\epsilon^* from a candidate set Z={ϵ0,,ϵn}\mathbb{Z} = \{\epsilon_0, \ldots, \epsilon_n\} using a multi-armed bandit upper confidence bound (UCB) objective:

ϵargmaxϵiZUUCB(ϵi):=U(ϵi)+λH^(ϵi),\epsilon^* \leftarrow \arg\max_{\epsilon_i \in \mathbb{Z}} U^\text{UCB}(\epsilon_i) := U(\epsilon_i) + \lambda \hat{H}(\epsilon_i),

where U(ϵi)U(\epsilon_i) estimates expected return under ϵ\epsilon0 and ϵ\epsilon1 quantifies uncertainty.

The two levels are tightly coupled: the outer bandit selects the clipping parameter to maximize true return, and the inner PPO update uses this choice to update the policy.

3. Multi-Armed Bandit Integration

Each clipping bound candidate ϵ\epsilon2 is treated as a bandit arm. The integration operates as follows:

  • Bandit Reward: After each PPO update under ϵ\epsilon3, evaluate the updated policy ϵ\epsilon4 on ϵ\epsilon5 trajectories and compute the average return ϵ\epsilon6.
  • Statistics Maintenance: For each arm, track the number of visits ϵ\epsilon7, the current expected reward ϵ\epsilon8 (updated as ϵ\epsilon9), and total visits ϵ\epsilon0.
  • Uncertainty Quantification: The bandit UCB uncertainty is defined by ϵ\epsilon1.
  • Selection and Update: At each outer iteration, select ϵ\epsilon2 as the arm with maximal ϵ\epsilon3, update statistics, and use ϵ\epsilon4 in the next inner PPO epoch. Optionally, normalize ϵ\epsilon5 to obtain an advantage-style signal.

This approach ensures that exploration and exploitation of candidate bounds are balanced and that the policy update is directly steered by return-based preference.

4. Theoretical Properties

Pb-PPO preserves PPO's local monotonic improvement guarantee under the standard conditions of small policy updates and proper clipping. The multi-armed bandit outer loop provides probabilistic control (via the UCB rule) over the selection of suboptimal clipping bounds, visiting them only ϵ\epsilon6 times, with ϵ\epsilon7 the total number of iterations. The Hoeffding-based derivation justifies the uncertainty term ϵ\epsilon8. This two-layered structure ensures ϵ\epsilon9 converges to the best-performing bound as learning proceeds (Zhang et al., 2023).

5. Algorithmic Realization

The overall algorithm can be described as:

  1. Collect on-policy data with the current policy.
  2. For each candidate ϵ\epsilon0, compute ϵ\epsilon1.
  3. Select ϵ\epsilon2.
  4. Perform PPO-style updates using ϵ\epsilon3 for several epochs.
  5. Evaluate the new policy, compute reward statistics, and update bandit estimates.
  6. (Optionally) Normalize bandit utilities to improve signal for selection.

Pseudocode is documented in the original reference. Typical hyperparameter settings include a candidate set ϵ\epsilon4 (10 values uniformly spaced in ϵ\epsilon5), UCB weight ϵ\epsilon6, discount ϵ\epsilon7, and evaluation episodes ϵ\epsilon8 or ϵ\epsilon9.

6. Empirical Evaluation and Benchmarking

Experiments with Pb-PPO were conducted on continuous control benchmarks in Gym-Mujoco (Ant-v3, HalfCheetah-v3, Hopper-v3, Walker2d-v3) and pybullet-gym (Dog-run), with comparisons to PPO (fixed Jinner(πnew;ϵ)=Eτπold[min(r(τ)Aold(τ),clip(r(τ),1ϵ,1+ϵ)Aold(τ))],J_{\text{inner}}(\pi_{\text{new}}; \epsilon) = \mathbb{E}_{\tau \sim \pi_{\text{old}}} \left[ \min\left( r(\tau)A_{\text{old}}(\tau), \mathrm{clip}(r(\tau), 1-\epsilon, 1+\epsilon)A_{\text{old}}(\tau) \right) \right],0), TRPO, DDPG, TRGPPO, and PPO-Jinner(πnew;ϵ)=Eτπold[min(r(τ)Aold(τ),clip(r(τ),1ϵ,1+ϵ)Aold(τ))],J_{\text{inner}}(\pi_{\text{new}}; \epsilon) = \mathbb{E}_{\tau \sim \pi_{\text{old}}} \left[ \min\left( r(\tau)A_{\text{old}}(\tau), \mathrm{clip}(r(\tau), 1-\epsilon, 1+\epsilon)A_{\text{old}}(\tau) \right) \right],1.

Evaluation metrics include average episodic return, sample efficiency (AUC), final return after Jinner(πnew;ϵ)=Eτπold[min(r(τ)Aold(τ),clip(r(τ),1ϵ,1+ϵ)Aold(τ))],J_{\text{inner}}(\pi_{\text{new}}; \epsilon) = \mathbb{E}_{\tau \sim \pi_{\text{old}}} \left[ \min\left( r(\tau)A_{\text{old}}(\tau), \mathrm{clip}(r(\tau), 1-\epsilon, 1+\epsilon)A_{\text{old}}(\tau) \right) \right],2 steps, and policy-improvement success rate (fraction of iterations with Jinner(πnew;ϵ)=Eτπold[min(r(τ)Aold(τ),clip(r(τ),1ϵ,1+ϵ)Aold(τ))],J_{\text{inner}}(\pi_{\text{new}}; \epsilon) = \mathbb{E}_{\tau \sim \pi_{\text{old}}} \left[ \min\left( r(\tau)A_{\text{old}}(\tau), \mathrm{clip}(r(\tau), 1-\epsilon, 1+\epsilon)A_{\text{old}}(\tau) \right) \right],3). Pb-PPO achieves:

  • Highest average final return (e.g., average 3315 across tasks, surpassing PPO-Jinner(πnew;ϵ)=Eτπold[min(r(τ)Aold(τ),clip(r(τ),1ϵ,1+ϵ)Aold(τ))],J_{\text{inner}}(\pi_{\text{new}}; \epsilon) = \mathbb{E}_{\tau \sim \pi_{\text{old}}} \left[ \min\left( r(\tau)A_{\text{old}}(\tau), \mathrm{clip}(r(\tau), 1-\epsilon, 1+\epsilon)A_{\text{old}}(\tau) \right) \right],4/TRGPPO Jinner(πnew;ϵ)=Eτπold[min(r(τ)Aold(τ),clip(r(τ),1ϵ,1+ϵ)Aold(τ))],J_{\text{inner}}(\pi_{\text{new}}; \epsilon) = \mathbb{E}_{\tau \sim \pi_{\text{old}}} \left[ \min\left( r(\tau)A_{\text{old}}(\tau), \mathrm{clip}(r(\tau), 1-\epsilon, 1+\epsilon)A_{\text{old}}(\tau) \right) \right],5).
  • Policy-improvement success of Jinner(πnew;ϵ)=Eτπold[min(r(τ)Aold(τ),clip(r(τ),1ϵ,1+ϵ)Aold(τ))],J_{\text{inner}}(\pi_{\text{new}}; \epsilon) = \mathbb{E}_{\tau \sim \pi_{\text{old}}} \left[ \min\left( r(\tau)A_{\text{old}}(\tau), \mathrm{clip}(r(\tau), 1-\epsilon, 1+\epsilon)A_{\text{old}}(\tau) \right) \right],6 (vs. Jinner(πnew;ϵ)=Eτπold[min(r(τ)Aold(τ),clip(r(τ),1ϵ,1+ϵ)Aold(τ))],J_{\text{inner}}(\pi_{\text{new}}; \epsilon) = \mathbb{E}_{\tau \sim \pi_{\text{old}}} \left[ \min\left( r(\tau)A_{\text{old}}(\tau), \mathrm{clip}(r(\tau), 1-\epsilon, 1+\epsilon)A_{\text{old}}(\tau) \right) \right],7 for fixed-Jinner(πnew;ϵ)=Eτπold[min(r(τ)Aold(τ),clip(r(τ),1ϵ,1+ϵ)Aold(τ))],J_{\text{inner}}(\pi_{\text{new}}; \epsilon) = \mathbb{E}_{\tau \sim \pi_{\text{old}}} \left[ \min\left( r(\tau)A_{\text{old}}(\tau), \mathrm{clip}(r(\tau), 1-\epsilon, 1+\epsilon)A_{\text{old}}(\tau) \right) \right],8 PPO).
  • More stable and smooth learning curves (narrower variance) and higher monotonic-improvement ratios.

Summary of Comparative Metrics

Method Avg. Final Return Policy-Improvement Success
Pb-PPO 3315 5.0%
PPO-Jinner(πnew;ϵ)=Eτπold[min(r(τ)Aold(τ),clip(r(τ),1ϵ,1+ϵ)Aold(τ))],J_{\text{inner}}(\pi_{\text{new}}; \epsilon) = \mathbb{E}_{\tau \sim \pi_{\text{old}}} \left[ \min\left( r(\tau)A_{\text{old}}(\tau), \mathrm{clip}(r(\tau), 1-\epsilon, 1+\epsilon)A_{\text{old}}(\tau) \right) \right],9, TRGPPO r(τ)=πnew(τ)/πold(τ)r(\tau) = \pi_{\text{new}}(\tau)/\pi_{\text{old}}(\tau)03000 3–4%
Fixed-r(τ)=πnew(τ)/πold(τ)r(\tau) = \pi_{\text{new}}(\tau)/\pi_{\text{old}}(\tau)1 PPO Lower Lower

7. Ablation and Sensitivity Analyses

Ablations analyze the effect of the number of clipping arms r(τ)=πnew(τ)/πold(τ)r(\tau) = \pi_{\text{new}}(\tau)/\pi_{\text{old}}(\tau)2, the presence of normalization in the bandit feedback, and arm selection statistics.

  • Increasing r(τ)=πnew(τ)/πold(τ)r(\tau) = \pi_{\text{new}}(\tau)/\pi_{\text{old}}(\tau)3 from 3 to 10 improves performance, which then plateaus.
  • Normalizing r(τ)=πnew(τ)/πold(τ)r(\tau) = \pi_{\text{new}}(\tau)/\pi_{\text{old}}(\tau)4 provides a slight benefit ("wi-ad" outperforms "wo-ad").
  • Pb-PPO consistently demonstrates higher monotonic improvement success than fixed-bound baselines.
  • Statistical correlation analyses verify that arms with higher r(τ)=πnew(τ)/πold(τ)r(\tau) = \pi_{\text{new}}(\tau)/\pi_{\text{old}}(\tau)5 reliably produce higher returns, validating the efficacy of preference-based feedback.

A plausible implication is that further tuning of candidate set cardinality and reward normalization may yield incremental gains, though significant improvements beyond r(τ)=πnew(τ)/πold(τ)r(\tau) = \pi_{\text{new}}(\tau)/\pi_{\text{old}}(\tau)6 appear limited (Zhang et al., 2023).


Pb-PPO systematically addresses the fundamental challenge of aligning adaptive surrogate objective parameters with reinforcement learning goals by leveraging a bi-level, feedback-driven approach. The empirical and theoretical results demonstrate superiority in adaptability, convergence, and robustness over traditional and previously adaptive PPO variants.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pb-PPO.