Papers
Topics
Authors
Recent
Search
2000 character limit reached

Feasible Actor-Critic: Statewise Safe RL

Updated 20 May 2026
  • FAC is a model-free constrained reinforcement learning algorithm that enforces safety by ensuring per-state cost limits are met.
  • It employs state-dependent Lagrange multipliers and advanced Q-network updates to balance reward optimization with strict safety constraints.
  • Empirical results in robotic locomotion and Safety-Gym benchmarks demonstrate that FAC reliably adheres to safety limits, outperforming expectation-based methods.

Feasible Actor-Critic (FAC) is a model-free constrained reinforcement learning (RL) algorithm designed to ensure statewise safety—guaranteeing that prescribed safety constraints are satisfied for each individual initial state, rather than merely on average across a distribution. FAC addresses the limitations of existing expectation-based safe RL methods, which can expose system states to risk even while satisfying aggregate constraints, and is applicable in domains where strict, per-state safety assurance is critical (Ma et al., 2021).

1. Statewise Safety: Feasibility Definitions and Constraints

Standard RL considers the maximization of expected cumulative reward, subject to constraints on cost signals. In FAC, a sharp distinction is drawn between feasible and infeasible states:

  • For a stochastic policy Ï€\pi, the cost-value from a state ss is

vCπ(s)=Eτ∼π[∑t=0∞γctc(st,at) ∣ s0=s],v_C^\pi(s) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma_c^t c(s_t, a_t) \,\Big|\, s_0 = s \right],

where rr denotes reward, cc denotes cost, and γ\gamma, γc\gamma_c are discount factors.

  • Given a limit dd, a state ss is feasible under Ï€\pi if ss0 and violated otherwise.
  • Some states, denoted ss1, are inherently unsafe (or infeasible): no policy can keep their cost below ss2,

ss3

the complement ss4 forms the feasible region.

  • Restricting to an initial-state support ss5, the set of potentially feasible initial states is ss6.
  • The core statewise safety constraint enforced by FAC is

ss7

2. Optimization Formulation: Statewise Constrained RL

FAC formalizes the learning problem as maximization of expected discounted reward,

ss8

subject to an infinite family of statewise constraints

ss9

To enable stochastic optimization, a state-dependent Lagrange multiplier vCπ(s)=Eτ∼π[∑t=0∞γctc(st,at) ∣ s0=s],v_C^\pi(s) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma_c^t c(s_t, a_t) \,\Big|\, s_0 = s \right],0 is introduced, yielding the statewise Lagrangian

vCπ(s)=Eτ∼π[∑t=0∞γctc(st,at) ∣ s0=s],v_C^\pi(s) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma_c^t c(s_t, a_t) \,\Big|\, s_0 = s \right],1

where vCπ(s)=Eτ∼π[∑t=0∞γctc(st,at) ∣ s0=s],v_C^\pi(s) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma_c^t c(s_t, a_t) \,\Big|\, s_0 = s \right],2 is the initial-state distribution, and vCπ(s)=Eτ∼π[∑t=0∞γctc(st,at) ∣ s0=s],v_C^\pi(s) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma_c^t c(s_t, a_t) \,\Big|\, s_0 = s \right],3 is the expected discounted reward starting at vCπ(s)=Eτ∼π[∑t=0∞γctc(st,at) ∣ s0=s],v_C^\pi(s) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma_c^t c(s_t, a_t) \,\Big|\, s_0 = s \right],4.

The solution is characterized as a saddle-point problem:

vCπ(s)=Eτ∼π[∑t=0∞γctc(st,at) ∣ s0=s],v_C^\pi(s) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma_c^t c(s_t, a_t) \,\Big|\, s_0 = s \right],5

recovering the solution to the constrained problem (SP) under mild conditions (Theorem 3.1 (Ma et al., 2021)).

3. Algorithmic Components and Training

FAC operationalizes statewise optimization using the following neural function approximators:

  • Reward Q-networks: vCÏ€(s)=Eτ∼π[∑t=0∞γctc(st,at) ∣ s0=s],v_C^\pi(s) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma_c^t c(s_t, a_t) \,\Big|\, s_0 = s \right],6 (double-Q for bias mitigation)
  • Cost Q-network: vCÏ€(s)=Eτ∼π[∑t=0∞γctc(st,at) ∣ s0=s],v_C^\pi(s) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma_c^t c(s_t, a_t) \,\Big|\, s_0 = s \right],7
  • Stochastic policy: vCÏ€(s)=Eτ∼π[∑t=0∞γctc(st,at) ∣ s0=s],v_C^\pi(s) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma_c^t c(s_t, a_t) \,\Big|\, s_0 = s \right],8 (commonly a squashed Gaussian)
  • Multiplier network: vCÏ€(s)=Eτ∼π[∑t=0∞γctc(st,at) ∣ s0=s],v_C^\pi(s) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma_c^t c(s_t, a_t) \,\Big|\, s_0 = s \right],9

Parameter updates proceed as follows:

  • Q-Function Updates: Reward and cost Q-functions are regressed to n-step Bellman targets with respective discount factors and loss functions; cost-Q omits entropy regularization.
  • Policy Update: Using sampling from the replay buffer rr0 and the surrogate objective from the Lagrangian,

rr1

with gradients estimated via the reparameterization trick.

  • Multiplier Update: Dual ascent via

rr2

aligning rr3 with the degree of constraint violation at each state.

  • Temperature Adaptation: Entropy temperature rr4 is tuned to encourage exploration, as in Soft Actor-Critic (SAC).

A concise pseudocode summarizing the full procedure is provided in Table 1.

Component Update Rule (abbreviated) Remarks
Reward Q rr5 Double-Q, off-policy
Cost Q rr6 Bellman loss, cost only
Policy rr7 rr8 Statewise constraint term active
Multiplier rr9 cc0 Dual ascent
Target Net cc1 Soft update

4. Complementary Slackness and Feasibility Detection

At optimality, the complementary slackness condition holds per-state:

cc2

  • cc3: constraint strictly inactive, cc4
  • cc5: active constraint, cc6
  • cc7: constraint unsatisfiable, cc8

In practice, large values of cc9 can be thresholded to signal state infeasibility, enabling the multiplier network to function as a feasibility oracle over γ\gamma0.

5. Theoretical Properties and Performance Guarantees

Key theorems underpinning FAC include:

  • Equivalence of Lagrangian Formulations (Theorem 3.1): Maximizing over statewise multipliers and minimizing over policies is equivalent to solving the infinite family of constraints (SP).
  • Statewise Implies Expectation-Based Feasibility (Theorem 3.2): Any policy γ\gamma1 satisfying the strict per-state constraints γ\gamma2 for all γ\gamma3 automatically satisfies the expectation-based constraint γ\gamma4.
  • Performance Comparison (Theorem 3.3): Under standard convexity/concavity, the optimal statewise Lagrangian is always at least as tight as the expectation-based version, guaranteeing that FAC does not degrade reward while tightening constraint satisfaction.

6. Empirical Performance and Practical Implementation

FAC’s statewise formulation leads to empirical improvements in constraint adherence and interpretability. Experiments on robotic locomotion environments (HalfCheetah, Walker2d, Ant) with speed limits show that FAC strictly respects limits in all model seeds, yielding higher or comparable returns to expectation-based methods (CPO, TRPO-Lagrangian, PPO-Lagrangian), which frequently oscillate and can violate constraints.

In Safety-Gym benchmarks (Point-Button, Car-Goal), FAC reduces the episode-cost rate well below the 10% threshold, maintaining safety even in the worst case (upper end of a 95% confidence interval). In contrast, expectation-based baselines incur approximately 50% dangerous runs, highlighting the limitations of those methods for critical applications.

FAC’s multiplier network responds adaptively during policy rollouts: γ\gamma5 rises sharply in inherently unsafe (e.g., obstacle-boxed) regions and falls as the agent re-enters feasible areas, serving as a dynamic feasibility indicator.

Implementation in off-policy actor-critic codebases is direct: one adds a multiplier network γ\gamma6, augments the policy loss with the γ\gamma7 term, and ascends in γ\gamma8 on its dual objective, yielding statewise safe policies and an ancillary infeasibility classifier.

7. Broader Context and Implications

FAC generalizes existing constrained RL by operationalizing statewise safety, bridging the gap between practical safety requirements and tractable model-free learning. While expectation-based approaches can allow frequent or catastrophic violations at the state level, FAC’s state-dependent Lagrangian guarantees constraint satisfaction on all feasible initial states and uniquely signals truly infeasible configurations. This suggests substantial applicability in robotics, autonomous systems, and any domain requiring per-state safety guarantees without full environment modeling.

A plausible implication is that as RL is increasingly deployed in settings where safety must be certified per-instance, methods such as FAC that provide learned feasibility oracles and rigorous theoretical properties will become essential (Ma et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Feasible Actor-Critic (FAC).