Papers
Topics
Authors
Recent
2000 character limit reached

Behavior Proximal Policy Optimization (BPPO)

Updated 12 November 2025
  • BPPO is a reinforcement learning algorithm for offline settings that leverages PPO’s clipping mechanism to mitigate overestimation from out-of-distribution actions.
  • It utilizes behavior cloning and off-policy critic estimation to compute reliable advantage values and achieve stable, repeated policy improvements.
  • Empirical evaluations on D4RL benchmarks show BPPO outperforms comparable methods like CQL and TD3+BC, achieving state-of-the-art performance.

Behavior Proximal Policy Optimization (BPPO) is a reinforcement learning algorithm designed specifically for offline RL settings, where policy learning must proceed solely from a fixed dataset without interaction with the environment. BPPO leverages the inherent conservatism of Proximal Policy Optimization’s (PPO) clipped trust-region approach, applying it directly to offline datasets to mitigate overestimation from out-of-distribution (OOD) actions. Notably, BPPO achieves state-of-the-art performance among offline RL algorithms without introducing explicit behavior regularization or pessimistic constraints.

1. Motivation and Theoretical Underpinnings

Offline reinforcement learning presents two core difficulties:

  • Overestimation on OOD actions: Q-functions in off-policy actor-critic methods can assign erroneously high values to state–action pairs absent from the dataset, driving policy updates toward detrimental actions.
  • Data constraint: Without online sampling, there is no corrective mechanism for faulty Q-estimates, causing divergence.

Existing approaches typically constrain policy updates using KL penalties to keep the learned policy close to the behavior policy πβ\pi_\beta or penalize high Q-values for OOD actions. BPPO identifies that the standard on-policy update constraints (e.g., PPO’s clipping mechanism) inherently prevent large policy shifts, thereby limiting OOD generalization and overestimation.

The policy improvement is grounded in the Performance Difference Theorem (PDT):

J(π)J(π)=Esρπ,aπ[Aπ(s,a)]J(\pi') - J(\pi) = \mathbb{E}_{s\sim\rho_{\pi'}, a\sim\pi'}[A_\pi(s,a)]

where Aπ(s,a)=Qπ(s,a)Vπ(s)A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s), and J()J(\cdot) is the expected discounted return.

In the offline setting, the true visitation distributions ρπ\rho_{\pi} are unobservable; thus, BPPO substitutes with the empirical state distribution from the dataset ρD\rho_\mathcal{D}. Two key surrogates emerge:

  • Replace ρπ\rho_\pi and ρπold\rho_{\pi_\mathrm{old}} by ρD\rho_\mathcal{D}.
  • Estimate the return improvement as

J^Δ(π,πold)=EsρD,aπold[π(as)πold(as)Aπold(s,a)]\widehat{J}_\Delta(\pi, \pi_\mathrm{old}) = \mathbb{E}_{s \sim \rho_\mathcal{D}, a \sim \pi_\mathrm{old}} \left[ \frac{\pi(a|s)}{\pi_\mathrm{old}(a|s)} A_{\pi_\mathrm{old}}(s,a) \right]

Offline monotonic improvement can be guaranteed whenever the policy KL-divergence shift per state is sufficiently small, adhering to a total-variation distance bound:

J(π)J(πold)J^Δ(π,πold)CmaxsDTV(ππold)[s]J(\pi) - J(\pi_\mathrm{old}) \geq \widehat{J}_\Delta(\pi, \pi_\mathrm{old}) - C \max_s D_{TV}\left(\pi\|\pi_\mathrm{old}\right)[s]

where CC is a constant dependent on discount γ\gamma and bound on A|A|.

2. BPPO Algorithm

BPPO adapts PPO for offline RL with the following sequence:

  1. Behavior cloning: Fit an approximate behavior policy π^β\hat{\pi}_\beta via supervised learning on the offline dataset D\mathcal{D}.
  2. Off-policy critic and advantage estimation: Estimate Qπβ(s,a)Q_{\pi_\beta}(s,a) (e.g., via SARSA or similar methods) and Vπβ(s)V_{\pi_\beta}(s) by regression to empirical returns; compute Aβ(s,a)=Qπβ(s,a)Vπβ(s)A_\beta(s,a) = Q_{\pi_\beta}(s,a) - V_{\pi_\beta}(s).
  3. Policy update: Initialize πoldπβ\pi_\mathrm{old} \leftarrow \pi_\beta. Perform multiple PPO-like updates on sampled transitions.
  4. Sampling and surrogate objective: For each minibatch sjDs_j \sim \mathcal{D}, ajπold(sj)a_j \sim \pi_\mathrm{old}(\cdot|s_j), compute likelihood ratio rj=πθ(ajsj)/πold(ajsj)r_j = \pi_\theta(a_j|s_j) / \pi_\mathrm{old}(a_j|s_j) and advantage A~j=Aβ(sj,aj)\tilde{A}_j = A_\beta(s_j, a_j). The parameter update minimizes

L(θ)=E[min(rjA~j,clip(rj,1ϵ,1+ϵ)A~j)]L(\theta) = \mathbb{E}\left[ \min(r_j \tilde{A}_j, \operatorname{clip}(r_j, 1-\epsilon, 1+\epsilon) \tilde{A}_j) \right]

where ϵ\epsilon is the PPO clipping parameter.

  1. Iteration: Optionally, update the critic, re-estimate advantages, and reset πoldπk\pi_\mathrm{old} \leftarrow \pi_k for subsequent updates.

BPPO Pseudocode

1

Practical heuristics:

  • The advantage function is typically re-used across multiple PPO updates for stability.
  • Decay of ϵ\epsilon may be implemented to restrict policy drift from πβ\pi_\beta.
  • Standard PPO components such as normalization and gradient clipping remain applicable.

BPPO employs the PPO algorithm in a purely offline context with these modifications:

  • Sampling: All state–action pairs are sampled from the fixed offline dataset and the current policy, respectively; there is no interaction with the environment.
  • No additional regularization: No KL penalty, explicit behavior cloning regularizer, or pessimism penalty is used beyond PPO’s clipping.
  • Inherent conservatism: The PPO step-size control via clipping is sufficient to suppress the OOD overestimation, provided the dataset reflects the behavior policy’s support.

Comparison to alternative offline RL approaches:

Algorithm Explicit Regularizer OOD Q Penalization Policy Iteration Type
BPPO None (beyond PPO clip) None Multiple, clipped
CQL Conservative Q-penalty Yes Batch Q-learning
TD3+BC BC penalty No Batch actor-critic
One-step RL Single BC-constrained Possibly (via Q) One-step improvement

Compared to one-step (“batch”) approaches that improve the behavior policy only once, BPPO permits repeated clipped improvements, often yielding superior returns.

4. Theoretical Properties and Guarantees

BPPO inherits a monotonic improvement guarantee in the offline setting:

J(πnew)J(πold)EsρD,aπold[Aold(s,a)]O(maxsDTV(πnewπold)[s])J(\pi_\mathrm{new}) - J(\pi_\mathrm{old}) \geq \mathbb{E}_{s\sim\rho_{\mathcal{D}}, a\sim\pi_\mathrm{old}}[A_\mathrm{old}(s,a)] - O\left(\max_s D_{TV}(\pi_\mathrm{new} \| \pi_\mathrm{old})[s]\right)

Tight control over the policy update via clipping (rθ(s,a)12ϵ|r_\theta(s,a)-1| \leq 2\epsilon) ensures the error term remains bounded, which in turn guarantees that the return does not decrease after any update under bounded rewards/advantages. This surrogate bound is the offline analogue of PPO/TRPO’s monotonic improvement theorem in the online RL regime.

5. Implementation and Practical Considerations

Several practical points are emphasized in the design and operation of BPPO:

  • Advantage estimation: Reuse of a single off-policy Q/V network for advantage computation across policy updates produces increased empirical stability compared to continuously updating these networks.
  • Clipping schedule: Exponential decay of ϵ\epsilon can be leveraged to ensure successive policies remain close to the empirical dataset and the initial behavior policy.
  • Network and optimizer choices: Default PPO configurations—advantage normalization, orthogonal initialization, learning-rate decay, Adam optimizer, and gradient clipping—are applicable without further modification.
  • Data support: The offline dataset must possess adequate coverage of high-return behaviors; otherwise, restricted policy improvement may result from limited support.

6. Empirical Evaluation

Experimental benchmarks were conducted on the D4RL suite, comprising Gym Locomotion, Adroit Manipulation, Kitchen, and AntMaze tasks. Average scores (over five seeds) are summarized below:

Method Gym Locomotion Adroit Total Kitchen Total
CQL 698.5 93.6 144.6
TD3+BC 677.4 52.2 47.5
Onestep RL 684.6 155.2 65.5
IQL 692.4 118.1 159.8
BC 555.5±57.2 131.6±31.1 144.0±18.0
BPPO 751.0±21.8 291.4±38.8 211.0±18.0

On the AntMaze suite:

Task Best Prior (Method) BPPO (ours)
Umaze-v2 87.5 (IQL) 95.0 ± 5.5
Umaze-diverse-v2 84.0 (CQL) 91.7 ± 4.1
Medium-play-v2 71.2 (IQL) 51.7 ± 7.5
Medium-diverse-v2 70.0 (IQL) 70.0 ± 6.3
Large-play-v2 39.6 (IQL) 86.7 ± 8.2
Large-diverse-v2 47.5 (IQL) 88.3 ± 4.1
Total 378.0 483.3 ± 35.7

BPPO matches or exceeds prior algorithms across task families, achieving higher scores with less empirical complexity.

7. Relationship to Generalized and Off-Policy PPO Variants

The core principle underlying BPPO—that policy improvement can be safely and efficiently performed using clipped surrogates and conservative updates—has analogues in other off-policy PPO modifications, such as Generalized PPO with Sample Reuse (GePPO) (Queeney et al., 2021). These methods generalize the policy improvement guarantee to settings with multiple reference policies and datasets assembled from different behavior distributions. The key distinction is that BPPO utilizes a single behavior policy and dataset, while GePPO combines samples from multiple recent policies, adapting its clipping mechanism and surrogate accordingly.

Both BPPO and GePPO underscore the utility of on-policy controls (clipping) transplanted into offline or off-policy settings, preserving stability and trust-region guarantees. This convergence has prompted further interest in variants combining PPO stability with off-policy sample efficiency.


BPPO represents a succinct and theoretically justified strategy for offline RL, drawing strength from the characteristics of trust-region policy updates, and demonstrating that explicit policy regularization is not a strict prerequisite for effective offline learning when conservative updates are enforced by design (Zhuang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Behavior Proximal Policy Optimization (BPPO).