Zero-Sum Multichain Lagrangian Formulation
- The paper introduces FAC, a framework that enforces statewise safety by using a state-dependent Lagrange multiplier for each feasible state.
- FAC extends off-policy actor-critic methods like SAC by incorporating a multiplier network, which dynamically adjusts constraint enforcement to improve safety and reward performance.
- Empirical results on robot locomotion and safe exploration tasks demonstrate FAC’s effectiveness in maintaining safety and detecting inherently infeasible states.
The Feasible Actor-Critic (FAC) algorithm is a model-free framework for constrained reinforcement learning (RL) that enforces statewise safety. Unlike conventional constrained RL methods, which apply constraints only in expectation over initial states, FAC imposes safety for every feasible initial state individually. FAC achieves this by leveraging a statewise Lagrange multiplier, parameterized by a neural network, to dynamically adapt constraint enforcement at the state level and distinguish inherently infeasible states from those for which safe policies exist (Ma et al., 2021).
1. Statewise Feasibility and Constraint Formalism
Let denote the state space and a stochastic policy. The cost-value function (safety critic) from any state under policy is
$v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$
for cost function , with discount . Fixing a scalar threshold , a state is called feasible under if 0; otherwise, it is unsafe. Certain states, denoted 1, are inherently infeasible if 2 for all policies 3; the complement, 4, constitutes the feasible region. For a subset of initial support 5, the potentially feasible initial states are 6. The statewise safety constraint requires 7 for every 8.
2. Optimization Problem and Statewise Lagrangian
The learning objective is to maximize the expected discounted return
9
subject to the infinite family of constraints
0
FAC introduces a nonnegative, state-dependent Lagrange multiplier 1, and forms the statewise Lagrangian
2
where 3 is the initial-state distribution. The corresponding saddle-point problem,
4
ensures, under mild conditions, that the optimal policy 5 solves (SP).
3. Algorithmic Architecture and Updates
FAC extends standard off-policy actor-critic frameworks, particularly soft-actor-critic (SAC), by adding a multiplier network. The key components are:
- Two reward Q-networks: 6 (double-Q)
- Cost Q-network: 7 (predicts cost-to-go)
- Stochastic policy network: 8 (e.g., Gaussian with 9 squashing)
- Multiplier network: 0
Each is updated as follows:
Reward Q-update:
1
where 2.
Cost Q-update: Mirrors reward Q, using cost 3, discount 4, and no entropy penalty.
Policy update:
5
with gradient estimated via reparameterization (6).
Multiplier update:
7
with dual ascent on 8.
Entropy temperature 9: Adapted as in SAC by gradient update.
A pseudocode summary is provided:
6
4. Complementary Slackness and Feasibility Detection
At the optimum $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$0, the statewise complementary slackness conditions apply: $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$1 Thus, for $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$2 (strict safety), $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$3; for $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$4, $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$5. If $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$6 is inherently infeasible, dual ascent causes $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$7, so large values of $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$8 can be used as a practical indicator of infeasibility. The trained multiplier network $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$9 thus serves as a learned feasibility oracle over 0.
5. Theoretical Properties
FAC's theoretical guarantees are as follows:
- Equivalence of scaled/unscaled Lagrangians (Theorem 3.1): Optimizing 1 recovers the same 2 as the infinite-constraint problem (SP).
- Statewise implies expectation-based feasibility (Theorem 3.2): Any policy 3 satisfying 4 for all 5 in 6 also satisfies 7.
- Performance comparison (Theorem 3.3): For the optimal statewise Lagrangian 8 and expectation Lagrangian 9, 0 implies 1, i.e., FAC can only improve, or at least match, reward under tighter constraints compared to expectation-based methods.
6. Empirical Evaluation
FAC was assessed on both robot locomotion tasks with speed constraints (HalfCheetah, Walker2d, Ant) and safe exploration scenarios (Safety-Gym: Point-Button, Car-Goal). Key findings are:
- In robot locomotion tasks, FAC consistently respects the speed cap and achieves higher or comparable returns versus expectation-based baselines such as CPO, TRPO-Lagrangian, and PPO-Lagrangian, which exhibit greater constraint violations and instability.
- On Safety-Gym safe exploration, FAC yields an episode-cost rate well below the 10% threshold even at the tails of 95% confidence intervals and fewer dangerous episodes at test time, while scalar-multiplier baselines suffer approximately 50% dangerous runs.
- Feasibility indication emerges: 2 sharply increases when the agent enters inherently infeasible regions (e.g., boxed-in by obstacles) and moderates when the agent returns to feasible zones.
7. Implementation Considerations and Practical Impact
FAC can be implemented in any off-policy actor-critic codebase by adding the multiplier network 3, modifying the policy loss with the 4 term, and updating 5 by dual ascent. This yields a policy that is statewise safe on all feasible initial states and a by-product classifier of infeasible ones. The learned multiplier network presents a direct means for practitioners to identify – in real time – whether specific initializations or encountered states admit any safe policy or are intrinsically unsafe (Ma et al., 2021).