Feasible Actor-Critic: Statewise Safe RL
- FAC is a model-free constrained reinforcement learning algorithm that enforces safety by ensuring per-state cost limits are met.
- It employs state-dependent Lagrange multipliers and advanced Q-network updates to balance reward optimization with strict safety constraints.
- Empirical results in robotic locomotion and Safety-Gym benchmarks demonstrate that FAC reliably adheres to safety limits, outperforming expectation-based methods.
Feasible Actor-Critic (FAC) is a model-free constrained reinforcement learning (RL) algorithm designed to ensure statewise safety—guaranteeing that prescribed safety constraints are satisfied for each individual initial state, rather than merely on average across a distribution. FAC addresses the limitations of existing expectation-based safe RL methods, which can expose system states to risk even while satisfying aggregate constraints, and is applicable in domains where strict, per-state safety assurance is critical (Ma et al., 2021).
1. Statewise Safety: Feasibility Definitions and Constraints
Standard RL considers the maximization of expected cumulative reward, subject to constraints on cost signals. In FAC, a sharp distinction is drawn between feasible and infeasible states:
- For a stochastic policy , the cost-value from a state is
where denotes reward, denotes cost, and , are discount factors.
- Given a limit , a state is feasible under if 0 and violated otherwise.
- Some states, denoted 1, are inherently unsafe (or infeasible): no policy can keep their cost below 2,
3
the complement 4 forms the feasible region.
- Restricting to an initial-state support 5, the set of potentially feasible initial states is 6.
- The core statewise safety constraint enforced by FAC is
7
2. Optimization Formulation: Statewise Constrained RL
FAC formalizes the learning problem as maximization of expected discounted reward,
8
subject to an infinite family of statewise constraints
9
To enable stochastic optimization, a state-dependent Lagrange multiplier 0 is introduced, yielding the statewise Lagrangian
1
where 2 is the initial-state distribution, and 3 is the expected discounted reward starting at 4.
The solution is characterized as a saddle-point problem:
5
recovering the solution to the constrained problem (SP) under mild conditions (Theorem 3.1 (Ma et al., 2021)).
3. Algorithmic Components and Training
FAC operationalizes statewise optimization using the following neural function approximators:
- Reward Q-networks: 6 (double-Q for bias mitigation)
- Cost Q-network: 7
- Stochastic policy: 8 (commonly a squashed Gaussian)
- Multiplier network: 9
Parameter updates proceed as follows:
- Q-Function Updates: Reward and cost Q-functions are regressed to n-step Bellman targets with respective discount factors and loss functions; cost-Q omits entropy regularization.
- Policy Update: Using sampling from the replay buffer 0 and the surrogate objective from the Lagrangian,
1
with gradients estimated via the reparameterization trick.
- Multiplier Update: Dual ascent via
2
aligning 3 with the degree of constraint violation at each state.
- Temperature Adaptation: Entropy temperature 4 is tuned to encourage exploration, as in Soft Actor-Critic (SAC).
A concise pseudocode summarizing the full procedure is provided in Table 1.
| Component | Update Rule (abbreviated) | Remarks |
|---|---|---|
| Reward Q | 5 | Double-Q, off-policy |
| Cost Q | 6 | Bellman loss, cost only |
| Policy 7 | 8 | Statewise constraint term active |
| Multiplier 9 | 0 | Dual ascent |
| Target Net | 1 | Soft update |
4. Complementary Slackness and Feasibility Detection
At optimality, the complementary slackness condition holds per-state:
2
- 3: constraint strictly inactive, 4
- 5: active constraint, 6
- 7: constraint unsatisfiable, 8
In practice, large values of 9 can be thresholded to signal state infeasibility, enabling the multiplier network to function as a feasibility oracle over 0.
5. Theoretical Properties and Performance Guarantees
Key theorems underpinning FAC include:
- Equivalence of Lagrangian Formulations (Theorem 3.1): Maximizing over statewise multipliers and minimizing over policies is equivalent to solving the infinite family of constraints (SP).
- Statewise Implies Expectation-Based Feasibility (Theorem 3.2): Any policy 1 satisfying the strict per-state constraints 2 for all 3 automatically satisfies the expectation-based constraint 4.
- Performance Comparison (Theorem 3.3): Under standard convexity/concavity, the optimal statewise Lagrangian is always at least as tight as the expectation-based version, guaranteeing that FAC does not degrade reward while tightening constraint satisfaction.
6. Empirical Performance and Practical Implementation
FAC’s statewise formulation leads to empirical improvements in constraint adherence and interpretability. Experiments on robotic locomotion environments (HalfCheetah, Walker2d, Ant) with speed limits show that FAC strictly respects limits in all model seeds, yielding higher or comparable returns to expectation-based methods (CPO, TRPO-Lagrangian, PPO-Lagrangian), which frequently oscillate and can violate constraints.
In Safety-Gym benchmarks (Point-Button, Car-Goal), FAC reduces the episode-cost rate well below the 10% threshold, maintaining safety even in the worst case (upper end of a 95% confidence interval). In contrast, expectation-based baselines incur approximately 50% dangerous runs, highlighting the limitations of those methods for critical applications.
FAC’s multiplier network responds adaptively during policy rollouts: 5 rises sharply in inherently unsafe (e.g., obstacle-boxed) regions and falls as the agent re-enters feasible areas, serving as a dynamic feasibility indicator.
Implementation in off-policy actor-critic codebases is direct: one adds a multiplier network 6, augments the policy loss with the 7 term, and ascends in 8 on its dual objective, yielding statewise safe policies and an ancillary infeasibility classifier.
7. Broader Context and Implications
FAC generalizes existing constrained RL by operationalizing statewise safety, bridging the gap between practical safety requirements and tractable model-free learning. While expectation-based approaches can allow frequent or catastrophic violations at the state level, FAC’s state-dependent Lagrangian guarantees constraint satisfaction on all feasible initial states and uniquely signals truly infeasible configurations. This suggests substantial applicability in robotics, autonomous systems, and any domain requiring per-state safety guarantees without full environment modeling.
A plausible implication is that as RL is increasingly deployed in settings where safety must be certified per-instance, methods such as FAC that provide learned feasibility oracles and rigorous theoretical properties will become essential (Ma et al., 2021).