Papers
Topics
Authors
Recent
Search
2000 character limit reached

Comparative Policy Optimization (CPO)

Updated 3 July 2026
  • Comparative Policy Optimization (CPO) is a reinforcement learning framework that integrates safety, fairness, and risk constraints using Constrained Markov Decision Processes.
  • It employs local linearization and trust-region quadratic programming to optimize reward while ensuring near-feasible constraint satisfaction with theoretical monotonicity guarantees.
  • Extensions like ESB-CPO, VaR-CPO, and CPO-FOAM enhance exploration, control tail risk, and enforce fairness, demonstrating strong empirical results across diverse domains.

Comparative Policy Optimization (CPO) is a class of reinforcement learning (RL) algorithms that address the optimal control of agents subject to safety or fairness constraints. By formulating the RL problem within a Constrained Markov Decision Process (CMDP) and employing trust-region policy optimization, CPO and its extensions ensure near-feasible adherence to constraints at each policy update while optimizing expected return. Since its introduction, CPO has established a rigorous foundation for safe reinforcement learning and spawned a variety of algorithmic extensions for increased exploration efficacy, explicit risk control, and fairness in both continuous control and real-world resource allocation.

1. Constrained Policy Optimization: Core Framework

CPO is defined for the constrained RL problem in infinite-horizon discounted CMDPs. For a policy πθ\pi_\theta, the expected discounted reward is

JR(θ)=Eτπθ[t=0γtr(st,at)]J_R(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]

and the expected discounted cost is

JC(θ)=Eτπθ[t=0γtc(st,at)].J_C(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t)\right].

The optimization objective is

maxθJR(θ)subject toJC(θ)d,\max_\theta J_R(\theta) \quad \text{subject to} \quad J_C(\theta) \le d,

where dd is the allowed cost threshold (Achiam et al., 2017).

Rather than employing penalty methods, CPO constructs local linearizations (first-order surrogates) of both reward and cost objectives around the current policy. At each update, the following trust-region quadratic program is solved: maxxgx s.t.bx+c0, 12xHxδ,\begin{aligned} \max_x \quad & g^\top x \ \text{s.t.}\quad & b^\top x + c \le 0, \ & \tfrac{1}{2} x^\top H x \le \delta, \end{aligned} where x=θθoldx = \theta' - \theta_\mathrm{old}, gg and bb are gradients of the surrogate reward and cost, cc is the current constraint violation, JR(θ)=Eτπθ[t=0γtr(st,at)]J_R(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]0 is the KL-divergence Hessian, and JR(θ)=Eτπθ[t=0γtr(st,at)]J_R(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]1 is the trust-region radius (Xu et al., 2023, Achiam et al., 2017). Analytic dual solutions exist for the single-constraint case and efficient Newton-CG for the multi-constraint case. Each update is finalized with a backtracking line search to enforce the KL and cost surrogate constraints.

2. Theoretical Guarantees and Monotonicity

CPO provides per-iteration worst-case guarantees for both reward improvement and constraint satisfaction. For step size JR(θ)=Eτπθ[t=0γtr(st,at)]J_R(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]2 and maximal advantage term JR(θ)=Eτπθ[t=0γtr(st,at)]J_R(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]3, the following holds at each policy update JR(θ)=Eτπθ[t=0γtr(st,at)]J_R(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]4: JR(θ)=Eτπθ[t=0γtr(st,at)]J_R(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]5

JR(θ)=Eτπθ[t=0γtr(st,at)]J_R(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]6

where JR(θ)=Eτπθ[t=0γtr(st,at)]J_R(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]7 (JR(θ)=Eτπθ[t=0γtr(st,at)]J_R(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]8) is the reward (cost) discount and JR(θ)=Eτπθ[t=0γtr(st,at)]J_R(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]9 bounds the advantage difference magnitude (Tangri et al., 30 Jan 2026, Achiam et al., 2017). These error terms arise from the trust-region linearization and ensure that, provided JC(θ)=Eτπθ[t=0γtc(st,at)].J_C(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t)\right].0 is sufficiently small, the algorithm exhibits monotonic reward improvement and controls constraint violations throughout training.

3. Extensions for Safe Exploration: ESB-CPO

Standard CPO enforces the cost constraint strictly at every step, which may hinder exploration by disallowing informative but transiently unsafe transitions. The Extra Safety Budget extension (ESB-CPO) mitigates this by introducing a decaying slack variable ("extra safety budget") to relax the constraint in early training: JC(θ)=Eτπθ[t=0γtc(st,at)].J_C(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t)\right].1 where JC(θ)=Eτπθ[t=0γtc(st,at)].J_C(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t)\right].2 is a Lyapunov-based modified cost advantage and JC(θ)=Eτπθ[t=0γtc(st,at)].J_C(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t)\right].3 is an adaptively scheduled parameter controlling the exploration/safety tradeoff (Xu et al., 2023).

The safety budget is dynamically annealed via: JC(θ)=Eτπθ[t=0γtc(st,at)].J_C(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t)\right].4 with JC(θ)=Eτπθ[t=0γtc(st,at)].J_C(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t)\right].5 increasing as constraints are systematically respected, thereby shrinking the effective budget to zero. When JC(θ)=Eτπθ[t=0γtc(st,at)].J_C(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t)\right].6 and the slack vanishes, ESB-CPO recovers standard CPO and its guarantees.

Empirical evaluation in Safety-Gym and Bullet-Safety-Gym benchmarks shows that ESB-CPO accelerates reward learning—outperforming CPO, Lyapunov-based SPPO, TRPO-Lagrangian, and unconstrained TRPO in sample efficiency—while constraint violations converge to the prescribed limit after ~100–200 iterations (Xu et al., 2023).

4. Risk-Aware Constraints: VaR-CPO

CPO enforces constraints in expectation; however, many domains (e.g., finance, autonomous safety) require explicit control over tail risk. The Value-at-Risk Constrained Policy Optimization (VaR-CPO) algorithm incorporates direct optimization of probabilistic cost thresholds: JC(θ)=Eτπθ[t=0γtc(st,at)].J_C(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t)\right].7 with JC(θ)=Eτπθ[t=0γtc(st,at)].J_C(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t)\right].8 as threshold and JC(θ)=Eτπθ[t=0γtc(st,at)].J_C(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t)\right].9 as tolerated violation probability (Tangri et al., 30 Jan 2026).

Due to the nondifferentiability of the indicator constraint, VaR-CPO uses the one-sided Chebyshev inequality to derive a quadratic surrogate: maxθJR(θ)subject toJC(θ)d,\max_\theta J_R(\theta) \quad \text{subject to} \quad J_C(\theta) \le d,0 where maxθJR(θ)subject toJC(θ)d,\max_\theta J_R(\theta) \quad \text{subject to} \quad J_C(\theta) \le d,1 and maxθJR(θ)subject toJC(θ)d,\max_\theta J_R(\theta) \quad \text{subject to} \quad J_C(\theta) \le d,2 are mean and variance of the total cost. This surrogate admits efficient estimation via state-augmentation and allows embedding the constraint in the trust-region CPO framework.

VaR-CPO achieves, on benchmarks such as IcyLake and EcoAnt, zero constraint violations in feasible regimes, robust recovery in infeasible regions, and competitive or superior reward compared to PPO, expected-cost CPO, and CVaR-regularized PPO (Tangri et al., 30 Jan 2026).

5. Fairness-Constrained Policy Optimization

CPO has also been adapted to settings with multiple fairness constraints, such as order-matching in exchange engines. In these contexts, the objective includes standard reward maximization and simultaneous satisfaction of group or individual fairness measures: maxθJR(θ)subject toJC(θ)d,\max_\theta J_R(\theta) \quad \text{subject to} \quad J_C(\theta) \le d,3 (Cheng et al., 7 Apr 2026).

The CPO-FOAM algorithm introduces a PID-controlled adaptive margin in the trust-region constrained QP to manage constraint satisfaction under nonstationary and stochastic conditions: maxθJR(θ)subject toJC(θ)d,\max_\theta J_R(\theta) \quad \text{subject to} \quad J_C(\theta) \le d,4 where maxθJR(θ)subject toJC(θ)d,\max_\theta J_R(\theta) \quad \text{subject to} \quad J_C(\theta) \le d,5 is the observed violation, and maxθJR(θ)subject toJC(θ)d,\max_\theta J_R(\theta) \quad \text{subject to} \quad J_C(\theta) \le d,6 are gains.

Additional architectural constraints—such as spectral norm projection to enforce Lipschitz fairness—further ensure individual fairness properties without explicit cost terms. Experiments on LOBSTER NASDAQ limit order book data, crypto-asset markets, and Safety-Gymnasium continuous-control tasks demonstrate that CPO-FOAM achieves superior efficiency–fairness trade-offs compared to unconstrained PPO, FIFO/Pro-rata, and Lagrangian policy optimization baselines, all while maintaining bounded transient and steady-state constraint violations (Cheng et al., 7 Apr 2026).

6. Algorithmic Summary and Implementation Practices

The CPO family of algorithms follows a staged optimization loop:

  1. Trajectory rollouts and estimation of reward/cost (or fairness/risk) advantages via GAE.
  2. Linearization of objectives and constraints, computation of Fisher information or KL Hessian.
  3. Solution of the primal QP or dual, with analytic or Newton-CG techniques.
  4. Backtracking line search for constraint satisfaction in the original surrogate.
  5. Adaptive margin or slack adjustment (in ESB-CPO/FOAM).
  6. Auxiliary critic updates for higher-order cost moments (VaR-CPO).

The step-size and batch-size hyperparameters, network architectures, and per-update sample budgets are matched to those in standard TRPO/CPO implementations (Achiam et al., 2017, Xu et al., 2023, Tangri et al., 30 Jan 2026).

7. Empirical Performance and Domain Applications

Across benchmarks and domains—robotic locomotion, safety-constrained navigation, fair resource allocation, and explicit tail risk tasks—CPO and its variants demonstrate:

  • Empirical near-saturation of constraint limits (in expectation or with prescribed probabilistic or fairness guarantees).
  • Superior reward learning efficiency compared to primal-dual and unconstrained methods.
  • Fast recovery to feasibility, bounded constraint violation amplitude, and robustness in nonstationary regimes.
  • Domain-agnostic generalization, evidenced by performance gains in financial matching, reinforcement learning safety suites, and explicit risk-averse settings (Xu et al., 2023, Tangri et al., 30 Jan 2026, Cheng et al., 7 Apr 2026, Achiam et al., 2017).

A plausible implication is that the core trust-region constraint architecture of CPO—with recently introduced slack scheduling, mean–variance risk surrogates, and feedback-controlled constraint buffering—constitutes a highly flexible and theoretically grounded approach for safe, fair, and risk-sensitive deep policy optimization.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Comparative Policy Optimization (CPO).