Constrained Reinforcement Learning

Updated 2 July 2026

Constrained reinforcement learning is a class of sequential decision-making methods that maximize rewards while enforcing explicit safety, cost, or risk constraints in Markov decision processes.
It employs Lagrangian relaxation and primal–dual techniques to integrate constraint satisfaction directly into the learning process.
CRL has practical applications in robotics, autonomous systems, and resource allocation by ensuring safe exploration and robust performance under uncertainty.

Constrained reinforcement learning (CRL) is a class of sequential decision-making methods that optimize expected reward subject to explicit constraints on cost, risk, or behavior within Markov decision processes (MDPs) or their variants. The need for control under safety, resource, or regulatory requirements arises in domains such as robotics, autonomous systems, resource allocation, and operations research. CRL extends standard reinforcement learning (RL) methodologies by enforcing constraints over trajectory-level or state-action-level quantities, and has developed a suite of algorithmic and theoretical machinery distinct from unconstrained RL.

1. Foundational Formulations and Problem Classes

CRL is typically formalized as a constrained Markov decision process (CMDP) $(\mathcal{S}, \mathcal{A}, P, r, \{c_i\}_{i=1}^m, \gamma)$ , where $\mathcal{S}$ and $\mathcal{A}$ are the state and action spaces, $P$ is the transition kernel, $r$ is the reward function, and each $c_i$ encodes a constraint cost. For an infinite-horizon discounted setup and a stationary policy $\pi_\theta$ , the objectives are: $J_R(\pi) = \mathbb{E}_\pi\!\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right],\qquad J_{C_i}(\pi) = \mathbb{E}_\pi\!\left[\sum_{t=0}^\infty \gamma^t c_i(s_t, a_t)\right].$ The classical risk-neutral CMDP seeks

$\max_\pi J_R(\pi) \quad \text{s.t.}\quad J_{C_i}(\pi) \le d_i,\; i=1, ..., m.$

This discounted cost formulation is the default, but recent works extend CRL to average-reward settings (Aggarwal et al., 2024), quantile/chance constraints (Jung et al., 2022), persistent safety (reachability) (Yu et al., 2022), and density-based constraints (Qin et al., 2021).

2. Lagrangian and Primal–Dual Approaches

The dominant methodology solves CMDPs via Lagrangian relaxation. Introducing nonnegative multipliers $\lambda \in \mathbb{R}^m_+$ , one forms the Lagrangian

$\mathcal{S}$ 0

The saddle-point problem $\mathcal{S}$ 1 is approached by alternating stochastic (sub-)gradient updates to policy and multipliers, in either policy parameter space or occupancy measure space (Zheng et al., 2022). The dual update typically takes the form

$\mathcal{S}$ 2

while the policy update minimizes the Lagrangian-regularized RL loss.

State augmentation with Lagrange multipliers (Calvo-Fullana et al., 2021, Jiang et al., 2023) addresses exploration and credit-assignment pathologies not solved by single-step reward-shaping. By making the dual variables part of the state, one guarantees constraint satisfaction across visits and supports rigorous convergence.

Specialized primal–dual algorithmic frameworks include:

Two-timescale (actor–critic, critic–dual) updates for stable convergence (Hu et al., 2023, Zheng et al., 2022).
Mirror descent, regularized saddle-flow dynamics (Zheng et al., 2022), which provide almost-sure convergence to a saddle point without requiring averaging or history mixing over iterates.
Occupancy-measure-based LP formulations for tabular and ergodic settings (Aggarwal et al., 2024, Zheng et al., 2022).

3. Advanced Constraint Types and Safety Specifications

Distributional and Risk-Based Constraints

Methods have generalized CRL beyond expectation constraints:

Quantile-constrained RL explicitly limits the probability that cumulative costs exceed a safety threshold, by directly constraining the $\mathcal{S}$ 3-quantile of the cost distribution and utilizing Lagrange multipliers for quantile constraints (Jung et al., 2022). Distributional RL and large-deviation principles provide tractable quantile and tail estimation.
State-density constraints encode safety/resource via occupancy measures, using duality between densities and Q-functions (Qin et al., 2021). This enables direct, state-local constraint specification (e.g., no more than $\mathcal{S}$ 4 agents in a region).
Reachability/safety value functions characterize persistence (invariance of the safe set) via max-min reachability equations, so constraint satisfaction means safety along the entire trajectory (Yu et al., 2022). Multipliers are often state-indexed to enforce local satisfaction.
Barrier and log-barrier methods impose constraints by adding a (smoothed) barrier penalty to the actor loss (as in CSAC-LB), replacing Lagrange multipliers by a single tunable smoothing parameter and achieving state-dependent, adaptive constraint penalization (Zhang et al., 2024).

Robustness to Model Uncertainty

Robust CRL augments CMDP models with adversarial uncertainty over transition dynamics (Wang et al., 2022, Sun et al., 2024). In robust constrained RL, both reward and constraint must be satisfied in the worst-case MDP within a specified uncertainty set. Primal–dual algorithms update both policy and adversarial models, yielding worst-case performance guarantees under modeled mismatch.

Behavior Specification and Inverse Constraint Learning

CRL has also been adopted for modular behavioral specification via binary indicator costs and explicit frequency constraints, reducing the need for reward engineering (Roy et al., 2021). Inverse constraint RL further seeks to infer (and enforce) unknown constraints from expert demonstrations, including mechanisms for confidence calibration and reporting sufficiency of expert data for safety (Subramanian et al., 2024).

Constrained Exploration and Guaranteeing Feasibility

Recent approaches formalize correctness of exploration under behavioral/temporal-logic specifications, using automata-based supervisors to dynamically limit behavior and providing necessary and sufficient conditions for optimality preservation (Chen, 2023). Model-free recovery methods (CERES) learn action-space constraints by labeling safe/unsafe states from experience and project actions onto this feasible set (Pham et al., 2018).

4. Methodological Innovations and Algorithmic Frameworks

Key algorithmic components (across many works) include:

Adaptive Lagrangian penalty coefficients and dual timescale separation for stability and constraint satisfaction, with various normalization and buffer averaging schemes (Hu et al., 2023, Roy et al., 2021).
Evolutionary Constrained RL (ECRL) merges evolutionary exploration and RL learning, balancing constraint violation and reward via stochastic ranking, per-actor multipliers, and a constraint buffer—addressing instability and reward-constraint tradeoff not resolvable by naive reward shaping (Hu et al., 2023).
Invalid action masking and scenario-based programming inject expert knowledge or symbolic rules directly into the agent's policy by blocking transitions leading to constraint violations (Hu et al., 2023, Corsi et al., 2022).

Representative pseudocode from ECRL and other frameworks highlights the integration of constraint computation, replay buffers, per-actor dual updates, and both RL and evolutionary improvement operators (Hu et al., 2023).

5. Theoretical Guarantees and Analysis

Contemporary CRL analyses provide:

Error bounds on near-optimality/feasibility for primal–dual methods with imperfect policy updates (Qin et al., 2021).
Convergence rate guarantees for both average-case and last-iterate constraints and value gaps, including non-asymptotic bounds scaling as $\mathcal{S}$ 5 or $\mathcal{S}$ 6 (Ding et al., 2023).
Sample complexity and regret for tabular and ergodic settings, both model-based (optimism/posterior-sampling) and model-free (policy gradient), with proofs matching best-known $\mathcal{S}$ 7 rates in tabular settings, and tractable regret/violation bounds in large or linear problems (Aggarwal et al., 2024, Miryoosefi et al., 2021).
Conditions for optimality preservation in constrained exploration, via the covering property of the supervisor automaton (Chen, 2023).
Safety set and reachability analysis, including guarantees on the largest feasible set and persistent safety under multi-timescale stochastic approximation (Yu et al., 2022).

In robust and resilient CRL, new equilibrium definitions incorporate constraint relaxation costs, and algorithms balance reward maximization against the explicit “price” of relaxing infeasible requirements, yielding principled tradeoff solutions (Ding et al., 2023).

6. Practical Impact, Empirical Results, and Limitations

Across continuous-control benchmarks (e.g., MuJoCo, Safety-Gym), robotic navigation, and industrial scheduling, CRL methods reliably enforce constraints while attaining high reward (Hu et al., 2023, Zhang et al., 2024, Hu et al., 2023). The ECRL framework outperforms both naive evolutionary extensions and standard Lagrangian approaches, particularly in tasks with tight or conflicting constraints. Scenario-based programming, masking, and direct indicator-constraint specification accelerate constraint satisfaction in real-world robotics (Hu et al., 2023, Corsi et al., 2022). Density-based methods attain sample-efficient, strict constraint satisfaction and improved expressivity (Qin et al., 2021).

Common limitations and ongoing challenges are hyperparameter sensitivity (multipliers, population size, buffer sizes), constraint specification (overly aggressive/conservative per-step or density constraints), formal convergence proofs under imperfect optimization, and scaling to high-dimensional or hybrid discrete-continuous constraint domains (Hu et al., 2023, Qin et al., 2021). Model-mismatch and robust generalization remain principal difficulties for real-world deployment (Sun et al., 2024).

7. Extensions and Future Directions

Emerging themes and open directions include:

Multi-objective and multi-constraint tradeoffs (resilient CRL) (Ding et al., 2023).
Model uncertainty and distribution shift—robust CRL in continuous domains (Sun et al., 2024, Wang et al., 2022).
Inverse-constrained and confidence-aware learning from expert data (Subramanian et al., 2024).
Generalization beyond discounting: average-reward formulations, weakly communicating MDPs (Aggarwal et al., 2024).
Algorithmic reductions of constraint specification and tuning, via direct density, occupancy, or symbolic automaton constraints (Qin et al., 2021, Chen, 2023).
Theoretical and computational development of safe, optimal RL algorithms that integrate symbolic, learned, and estimated constraints in large-scale, partially observed, or adversarially perturbed settings.

The CRL field continues to integrate advances from optimization, formal methods, stochastic control, and deep RL, addressing both theoretical and practical boundaries for safe, high-performance autonomous decision-making.