Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reward Constrained Policy Optimization (RCPO)

Updated 9 March 2026
  • Reward Constrained Policy Optimization (RCPO) is an advanced reinforcement learning method for constrained Markov Decision Processes that maximizes cumulative rewards while enforcing constraints through Lagrangian duality.
  • RCPO employs multi-timescale stochastic approximations to update the actor, critic, and dual variable, ensuring sample-efficient convergence with adaptive penalty adjustments.
  • RCPO provides theoretical convergence guarantees and demonstrates empirical success in high-dimensional continuous control tasks, outperforming baseline methods in both reward performance and constraint satisfaction.

Reward Constrained Policy Optimization (RCPO) is an advanced methodology for reinforcement learning (RL) in constrained Markov Decision Processes (CMDPs). RCPO addresses the need for learning policies that maximize expected cumulative reward while ensuring satisfaction of complex constraints, such as risk thresholds or action budgets. Unlike traditional unconstrained RL, RCPO modifies the policy optimization process by introducing penalty signals, Lagrangian dual variables, and in robust extensions, worst-case estimations under model uncertainty. RCPO methods feature multi-timescale stochastic approximation and have theoretical convergence guarantees, with practical applicability to continuous control tasks in high-dimensional, real-world environments (Tessler et al., 2018, Sun et al., 2024).

1. Formalization of the Constrained RL Problem

The CMDP formulation considered by RCPO defines an environment as a tuple (S,A,P,r,c,μ,γ)(S, A, P, r, c, \mu, \gamma), where r(s,a)r(s, a) is the reward and c(s,a)c(s, a) is a per-step penalty. The objective is to maximize the discounted reward JRπ=Es0μπ[t=0γtr(st,at)]J_R^\pi = \mathbb{E}_{s_0\sim\mu}^{\pi}[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)] subject to satisfying a general constraint JCπ=Es0μπ[C(s0)]αJ_C^\pi = \mathbb{E}_{s_0\sim\mu}^{\pi}[C(s_0)] \le \alpha, where C(s0)C(s_0) could be a discounted sum or a trajectory average of penalties along an episode. The standard Lagrangian for this problem is:

L(λ,π)=JRπλ(JCπα),λ0,L(\lambda, \pi) = J_R^\pi - \lambda(J_C^\pi - \alpha), \quad \lambda \ge 0,

and RCPO seeks a local saddle point:

minλ0maxπL(λ,π).\min_{\lambda \ge 0} \max_\pi L(\lambda, \pi).

When deploying under model uncertainty, the robust extension replaces the nominal kernel PP by an uncertainty set U=s,aPsaU = \bigotimes_{s, a} P_s^a and the expected (discounted) values become worst-case values Vrπ(s)=minpUVr,pπ(s)V_r^\pi(s) = \min_{p\in U} V_{r,p}^\pi(s) and analogously for the constraint, leading to a max-min optimization (Sun et al., 2024).

2. Multi-Timescale Optimization and the RCPO Algorithm

The RCPO approach employs multi-timescale stochastic approximation to efficiently optimize both the policy and the dual variable without requiring a closed-form solution for general constraints. The main update components are:

  • Critic (Value function, vv): Updated on the fastest timescale using TD errors for a surrogate penalized reward r^λ(s,a)=r(s,a)λc(s,a)\hat{r}_\lambda(s,a) = r(s,a) - \lambda c(s,a), allowing standard Bellman-based estimation.
  • Actor (Policy parameters, θ\theta): Updated on an intermediate timescale using a policy-gradient step with an advantage function modified by the adaptive penalty.
  • Multiplier (λ\lambda): Updated on the slowest timescale, using Monte Carlo estimates for the (possibly non-Bellman) constraint, projected onto R+\mathbb{R}_{+}.

The guiding penalty facilitates the use of TD learning for general constraints:

VCγπ(s)=Eπ[t=0γtc(st,at)s0=s].V_{C_\gamma}^{\pi}(s) = \mathbb{E}^\pi\left[\sum_{t=0}^{\infty} \gamma^t c(s_t, a_t) | s_0 = s\right].

This "Penalty Value" admits Bellman recursion and is critical for enabling sample-efficient actor-critic algorithms for constrained objectives.

Key pseudocode steps:

  1. Collect transition batches under current πθk\pi_{\theta_k}.
  2. Update critic vv from TD error of penalized reward.
  3. Update actor θ\theta via policy-gradient of penalized advantage.
  4. Update λ\lambda by projected stochastic ascent on the constraint violation.

For robust settings, worst-case value functions and robust advantage estimators are recomputed at each step through projected-gradient loops over the uncertainty set (Sun et al., 2024).

3. Theoretical Guarantees and Convergence Results

The original RCPO guarantees almost sure convergence to a local saddle point under assumptions of boundedness, feasibility of constraint minima, and timescale separation for step-sizes. For the three-timescale version, if all discounted constraint minima also satisfy the true constraint (i.e., ΘγΘ\Theta_\gamma \subseteq \Theta), convergence is achieved to a feasible solution. Classical ODE arguments ensure that under appropriate step-size choices (ηi=,ηi2<,η1η2η3)(\sum\eta_i=\infty, \sum\eta_i^2 < \infty, \eta_1 \ll \eta_2 \ll \eta_3), the iterates (vk,θk,λk)(v_k, \theta_k, \lambda_k) converge almost surely (Tessler et al., 2018).

In robust settings, under mild assumptions (notably, availability of an ϵ\epsilon-accurate computation of worst-case kernels), RCPO guarantees at each iteration that the worst-case reward is non-decreasing up to a small slack, and constraint violations are strictly limited:

  • For feasible πk\pi_k, the worst-case reward and constraint changes are bounded in terms of occupancy measure ratios, Lipschitz constants, worst-case kernel errors, and KL-step size.
  • If πk\pi_k violates the constraint, the violation is also controlled via a quadratic KL penalty.

A plausible implication is that RCPO provides reliability in high-stakes settings with model uncertainty by decoupling reward maximization and constraint feasibility at every iteration (Sun et al., 2024).

4. Robust RCPO: Model Uncertainty and Minimax Formulation

Robust Constrained Policy Optimization extends RCPO to handle model mismatch by explicitly optimizing for worst-case performance over an uncertainty set of dynamics. For each policy iteration, the algorithm estimates the most pessimistic transition kernels for both reward and utility, defines robust value functions, and solves two convex subproblems:

  1. Robust Policy Improvement: Update the policy to increase worst-case expected reward in a trust-region.
  2. Constraint Projection: Project onto the constraint-satisfying set under the worst-case kernel by minimizing KL divergence from the reward-updated policy.

These subproblems are convex in the policy distribution π(s)\pi(s) and leverage natural gradient or second-order optimization. Kernels are approximated with inner projected-gradient loops, and the updates use robust advantage estimators and KL divergences for sample efficiency (Sun et al., 2024).

5. Empirical Evaluation and Applications

RCPO has been empirically validated on both discrete and continuous control domains:

  • Mars-Rover Grid-World: The agent must reach its goal while keeping its crash probability below a threshold. RCPO outperforms direct Lagrangian MC policy-gradient baselines in convergence speed, variance, and reliably meets the crash constraint.
  • MuJoCo Robotics (Swimmer-v2, Walker2d-v2, Hopper-v2, HalfCheetah-v2, Ant-v2, Humanoid-v2): Under mean torque constraints (e.g., 1Ttat25%\frac{1}{T} \sum_t \lVert a_t \rVert_\infty \leq 25\%), RCPO produces feasible or near-feasible policies with no manual penalty tuning. In various domains (see table), RCPO outperforms best fixed-penalty baselines by simultaneously achieving high reward and constraint satisfaction.
Environment Torque (%) Reward Baseline Torque Baseline Reward
Swimmer-v2 24.0 72.7 30.4 94.4 (violated)
Walker2d-v2 25.2 591.6 26–30 266–823
Hopper-v2 26.0 1138.5
HalfCheetah-v2 26.7 1547.1
Ant-v2 15.2 1031.5
Humanoid-v2 24.3 606.1

This demonstrates RCPO's ability to scale to high-dimensional settings while adaptively meeting constraint requirements (Tessler et al., 2018).

6. Relationship to Other Constrained and Robust RL Methods

Standard constrained RL algorithms, such as CPO or CRPO, usually assume knowledge of the nominal kernel and only guarantee constraint satisfaction with respect to that model. They do not account for policy-dependent model mismatch. Classical robust RL methods often combine robust value estimates with non-robust policy gradients, lacking worst-case monotonic improvement guarantees.

RCPO, in contrast, recomputes worst-case kernels at every iteration, maintains a robust performance-difference bound within a local KL neighborhood, and ensures dynamic feasibility even as policy and worst-case kernels co-evolve. A practical implication is that RCPO applies to continuous and large-scale spaces with parametric policy and kernel estimation, a capability not matched by most previous tabular- or static-robust RL algorithms (Sun et al., 2024).

7. Significance and Practical Impact

RCPO represents a critical advance for RL in settings where constraint violations are unacceptable or where environment dynamics are uncertain. Multi-timescale optimization, adaptive penalties, and robust kernel estimation together permit scalable, reliable deployment to physical systems. RCPO's theoretical guarantees and empirical effectiveness in continuous control and safety-critical domains highlight its relevance for both academic research and real-world RL applications (Tessler et al., 2018, Sun et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reward Constrained Policy Optimization (RCPO).