Behavior Proximal Policy Optimization (BPPO)

Updated 12 November 2025

BPPO is a reinforcement learning algorithm for offline settings that leverages PPO’s clipping mechanism to mitigate overestimation from out-of-distribution actions.
It utilizes behavior cloning and off-policy critic estimation to compute reliable advantage values and achieve stable, repeated policy improvements.
Empirical evaluations on D4RL benchmarks show BPPO outperforms comparable methods like CQL and TD3+BC, achieving state-of-the-art performance.

Behavior Proximal Policy Optimization (BPPO) is a reinforcement learning algorithm designed specifically for offline RL settings, where policy learning must proceed solely from a fixed dataset without interaction with the environment. BPPO leverages the inherent conservatism of Proximal Policy Optimization’s (PPO) clipped trust-region approach, applying it directly to offline datasets to mitigate overestimation from out-of-distribution (OOD) actions. Notably, BPPO achieves state-of-the-art performance among offline RL algorithms without introducing explicit behavior regularization or pessimistic constraints.

1. Motivation and Theoretical Underpinnings

Offline reinforcement learning presents two core difficulties:

Overestimation on OOD actions: Q-functions in off-policy actor-critic methods can assign erroneously high values to state–action pairs absent from the dataset, driving policy updates toward detrimental actions.
Data constraint: Without online sampling, there is no corrective mechanism for faulty Q-estimates, causing divergence.

Existing approaches typically constrain policy updates using KL penalties to keep the learned policy close to the behavior policy $\pi_\beta$ or penalize high Q-values for OOD actions. BPPO identifies that the standard on-policy update constraints (e.g., PPO’s clipping mechanism) inherently prevent large policy shifts, thereby limiting OOD generalization and overestimation.

The policy improvement is grounded in the Performance Difference Theorem (PDT):

$J(\pi') - J(\pi) = \mathbb{E}_{s\sim\rho_{\pi'}, a\sim\pi'}[A_\pi(s,a)]$

where $A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s)$ , and $J(\cdot)$ is the expected discounted return.

In the offline setting, the true visitation distributions $\rho_{\pi}$ are unobservable; thus, BPPO substitutes with the empirical state distribution from the dataset $\rho_\mathcal{D}$ . Two key surrogates emerge:

Replace $\rho_\pi$ and $\rho_{\pi_\mathrm{old}}$ by $\rho_\mathcal{D}$ .
Estimate the return improvement as

$\widehat{J}_\Delta(\pi, \pi_\mathrm{old}) = \mathbb{E}_{s \sim \rho_\mathcal{D}, a \sim \pi_\mathrm{old}} \left[ \frac{\pi(a|s)}{\pi_\mathrm{old}(a|s)} A_{\pi_\mathrm{old}}(s,a) \right]$

Offline monotonic improvement can be guaranteed whenever the policy KL-divergence shift per state is sufficiently small, adhering to a total-variation distance bound:

$J(\pi) - J(\pi_\mathrm{old}) \geq \widehat{J}_\Delta(\pi, \pi_\mathrm{old}) - C \max_s D_{TV}\left(\pi\|\pi_\mathrm{old}\right)[s]$

where $C$ is a constant dependent on discount $\gamma$ and bound on $|A|$ .

2. BPPO Algorithm

BPPO adapts PPO for offline RL with the following sequence:

Behavior cloning: Fit an approximate behavior policy $\hat{\pi}_\beta$ via supervised learning on the offline dataset $\mathcal{D}$ .
Off-policy critic and advantage estimation: Estimate $Q_{\pi_\beta}(s,a)$ (e.g., via SARSA or similar methods) and $V_{\pi_\beta}(s)$ by regression to empirical returns; compute $A_\beta(s,a) = Q_{\pi_\beta}(s,a) - V_{\pi_\beta}(s)$ .
Policy update: Initialize $\pi_\mathrm{old} \leftarrow \pi_\beta$ . Perform multiple PPO-like updates on sampled transitions.
Sampling and surrogate objective: For each minibatch $s_j \sim \mathcal{D}$ , $a_j \sim \pi_\mathrm{old}(\cdot|s_j)$ , compute likelihood ratio $r_j = \pi_\theta(a_j|s_j) / \pi_\mathrm{old}(a_j|s_j)$ and advantage $\tilde{A}_j = A_\beta(s_j, a_j)$ . The parameter update minimizes

$L(\theta) = \mathbb{E}\left[ \min(r_j \tilde{A}_j, \operatorname{clip}(r_j, 1-\epsilon, 1+\epsilon) \tilde{A}_j) \right]$

where $\epsilon$ is the PPO clipping parameter.

Iteration: Optionally, update the critic, re-estimate advantages, and reset $\pi_\mathrm{old} \leftarrow \pi_k$ for subsequent updates.

BPPO Pseudocode

Practical heuristics:

The advantage function is typically re-used across multiple PPO updates for stability.
Decay of $\epsilon$ may be implemented to restrict policy drift from $\pi_\beta$ .
Standard PPO components such as normalization and gradient clipping remain applicable.

BPPO employs the PPO algorithm in a purely offline context with these modifications:

Sampling: All state–action pairs are sampled from the fixed offline dataset and the current policy, respectively; there is no interaction with the environment.
No additional regularization: No KL penalty, explicit behavior cloning regularizer, or pessimism penalty is used beyond PPO’s clipping.
Inherent conservatism: The PPO step-size control via clipping is sufficient to suppress the OOD overestimation, provided the dataset reflects the behavior policy’s support.

Comparison to alternative offline RL approaches:

Algorithm	Explicit Regularizer	OOD Q Penalization	Policy Iteration Type
BPPO	None (beyond PPO clip)	None	Multiple, clipped
CQL	Conservative Q-penalty	Yes	Batch Q-learning
TD3+BC	BC penalty	No	Batch actor-critic
One-step RL	Single BC-constrained	Possibly (via Q)	One-step improvement

Compared to one-step (“batch”) approaches that improve the behavior policy only once, BPPO permits repeated clipped improvements, often yielding superior returns.

4. Theoretical Properties and Guarantees

BPPO inherits a monotonic improvement guarantee in the offline setting:

$J(\pi_\mathrm{new}) - J(\pi_\mathrm{old}) \geq \mathbb{E}_{s\sim\rho_{\mathcal{D}}, a\sim\pi_\mathrm{old}}[A_\mathrm{old}(s,a)] - O\left(\max_s D_{TV}(\pi_\mathrm{new} \| \pi_\mathrm{old})[s]\right)$

Tight control over the policy update via clipping ( $|r_\theta(s,a)-1| \leq 2\epsilon$ ) ensures the error term remains bounded, which in turn guarantees that the return does not decrease after any update under bounded rewards/advantages. This surrogate bound is the offline analogue of PPO/TRPO’s monotonic improvement theorem in the online RL regime.

5. Implementation and Practical Considerations

Several practical points are emphasized in the design and operation of BPPO:

Advantage estimation: Reuse of a single off-policy Q/V network for advantage computation across policy updates produces increased empirical stability compared to continuously updating these networks.
Clipping schedule: Exponential decay of $\epsilon$ can be leveraged to ensure successive policies remain close to the empirical dataset and the initial behavior policy.
Network and optimizer choices: Default PPO configurations—advantage normalization, orthogonal initialization, learning-rate decay, Adam optimizer, and gradient clipping—are applicable without further modification.
Data support: The offline dataset must possess adequate coverage of high-return behaviors; otherwise, restricted policy improvement may result from limited support.

6. Empirical Evaluation

Experimental benchmarks were conducted on the D4RL suite, comprising Gym Locomotion, Adroit Manipulation, Kitchen, and AntMaze tasks. Average scores (over five seeds) are summarized below:

Method	Gym Locomotion	Adroit Total	Kitchen Total
CQL	698.5	93.6	144.6
TD3+BC	677.4	52.2	47.5
Onestep RL	684.6	155.2	65.5
IQL	692.4	118.1	159.8
BC	555.5±57.2	131.6±31.1	144.0±18.0
BPPO	751.0±21.8	291.4±38.8	211.0±18.0

On the AntMaze suite:

Task	Best Prior (Method)	BPPO (ours)
Umaze-v2	87.5 (IQL)	95.0 ± 5.5
Umaze-diverse-v2	84.0 (CQL)	91.7 ± 4.1
Medium-play-v2	71.2 (IQL)	51.7 ± 7.5
Medium-diverse-v2	70.0 (IQL)	70.0 ± 6.3
Large-play-v2	39.6 (IQL)	86.7 ± 8.2
Large-diverse-v2	47.5 (IQL)	88.3 ± 4.1
Total	378.0	483.3 ± 35.7

BPPO matches or exceeds prior algorithms across task families, achieving higher scores with less empirical complexity.

7. Relationship to Generalized and Off-Policy PPO Variants

The core principle underlying BPPO—that policy improvement can be safely and efficiently performed using clipped surrogates and conservative updates—has analogues in other off-policy PPO modifications, such as Generalized PPO with Sample Reuse (GePPO) (Queeney et al., 2021). These methods generalize the policy improvement guarantee to settings with multiple reference policies and datasets assembled from different behavior distributions. The key distinction is that BPPO utilizes a single behavior policy and dataset, while GePPO combines samples from multiple recent policies, adapting its clipping mechanism and surrogate accordingly.

Both BPPO and GePPO underscore the utility of on-policy controls (clipping) transplanted into offline or off-policy settings, preserving stability and trust-region guarantees. This convergence has prompted further interest in variants combining PPO stability with off-policy sample efficiency.

BPPO represents a succinct and theoretically justified strategy for offline RL, drawing strength from the characteristics of trust-region policy updates, and demonstrating that explicit policy regularization is not a strict prerequisite for effective offline learning when conservative updates are enforced by design (Zhuang et al., 2023).

PDF Markdown Chat (Pro)

References (2)

Generalized Proximal Policy Optimization with Sample Reuse (2021)

Behavior Proximal Policy Optimization (2023)

Follow Topic

Get notified by email when new papers are published related to Behavior Proximal Policy Optimization (BPPO).