Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zero-Sum Multichain Lagrangian Formulation

Updated 20 May 2026
  • The paper introduces FAC, a framework that enforces statewise safety by using a state-dependent Lagrange multiplier for each feasible state.
  • FAC extends off-policy actor-critic methods like SAC by incorporating a multiplier network, which dynamically adjusts constraint enforcement to improve safety and reward performance.
  • Empirical results on robot locomotion and safe exploration tasks demonstrate FAC’s effectiveness in maintaining safety and detecting inherently infeasible states.

The Feasible Actor-Critic (FAC) algorithm is a model-free framework for constrained reinforcement learning (RL) that enforces statewise safety. Unlike conventional constrained RL methods, which apply constraints only in expectation over initial states, FAC imposes safety for every feasible initial state individually. FAC achieves this by leveraging a statewise Lagrange multiplier, parameterized by a neural network, to dynamically adapt constraint enforcement at the state level and distinguish inherently infeasible states from those for which safe policies exist (Ma et al., 2021).

1. Statewise Feasibility and Constraint Formalism

Let S\mathcal S denote the state space and π\pi a stochastic policy. The cost-value function (safety critic) from any state ss under policy π\pi is

$v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$

for cost function cc, with discount γc\gamma_c. Fixing a scalar threshold dd, a state ss is called feasible under π\pi if π\pi0; otherwise, it is unsafe. Certain states, denoted π\pi1, are inherently infeasible if π\pi2 for all policies π\pi3; the complement, π\pi4, constitutes the feasible region. For a subset of initial support π\pi5, the potentially feasible initial states are π\pi6. The statewise safety constraint requires π\pi7 for every π\pi8.

2. Optimization Problem and Statewise Lagrangian

The learning objective is to maximize the expected discounted return

Ï€\pi9

subject to the infinite family of constraints

ss0

FAC introduces a nonnegative, state-dependent Lagrange multiplier ss1, and forms the statewise Lagrangian

ss2

where ss3 is the initial-state distribution. The corresponding saddle-point problem,

ss4

ensures, under mild conditions, that the optimal policy ss5 solves (SP).

3. Algorithmic Architecture and Updates

FAC extends standard off-policy actor-critic frameworks, particularly soft-actor-critic (SAC), by adding a multiplier network. The key components are:

  • Two reward Q-networks: ss6 (double-Q)
  • Cost Q-network: ss7 (predicts cost-to-go)
  • Stochastic policy network: ss8 (e.g., Gaussian with ss9 squashing)
  • Multiplier network: Ï€\pi0

Each is updated as follows:

Reward Q-update:

Ï€\pi1

where π\pi2.

Cost Q-update: Mirrors reward Q, using cost π\pi3, discount π\pi4, and no entropy penalty.

Policy update:

Ï€\pi5

with gradient estimated via reparameterization (Ï€\pi6).

Multiplier update:

Ï€\pi7

with dual ascent on π\pi8.

Entropy temperature π\pi9: Adapted as in SAC by gradient update.

A pseudocode summary is provided:

γc\gamma_c6

4. Complementary Slackness and Feasibility Detection

At the optimum $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$0, the statewise complementary slackness conditions apply: $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$1 Thus, for $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$2 (strict safety), $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$3; for $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$4, $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$5. If $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$6 is inherently infeasible, dual ascent causes $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$7, so large values of $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$8 can be used as a practical indicator of infeasibility. The trained multiplier network $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$9 thus serves as a learned feasibility oracle over cc0.

5. Theoretical Properties

FAC's theoretical guarantees are as follows:

  • Equivalence of scaled/unscaled Lagrangians (Theorem 3.1): Optimizing cc1 recovers the same cc2 as the infinite-constraint problem (SP).
  • Statewise implies expectation-based feasibility (Theorem 3.2): Any policy cc3 satisfying cc4 for all cc5 in cc6 also satisfies cc7.
  • Performance comparison (Theorem 3.3): For the optimal statewise Lagrangian cc8 and expectation Lagrangian cc9, γc\gamma_c0 implies γc\gamma_c1, i.e., FAC can only improve, or at least match, reward under tighter constraints compared to expectation-based methods.

6. Empirical Evaluation

FAC was assessed on both robot locomotion tasks with speed constraints (HalfCheetah, Walker2d, Ant) and safe exploration scenarios (Safety-Gym: Point-Button, Car-Goal). Key findings are:

  • In robot locomotion tasks, FAC consistently respects the speed cap and achieves higher or comparable returns versus expectation-based baselines such as CPO, TRPO-Lagrangian, and PPO-Lagrangian, which exhibit greater constraint violations and instability.
  • On Safety-Gym safe exploration, FAC yields an episode-cost rate well below the 10% threshold even at the tails of 95% confidence intervals and fewer dangerous episodes at test time, while scalar-multiplier baselines suffer approximately 50% dangerous runs.
  • Feasibility indication emerges: γc\gamma_c2 sharply increases when the agent enters inherently infeasible regions (e.g., boxed-in by obstacles) and moderates when the agent returns to feasible zones.

7. Implementation Considerations and Practical Impact

FAC can be implemented in any off-policy actor-critic codebase by adding the multiplier network γc\gamma_c3, modifying the policy loss with the γc\gamma_c4 term, and updating γc\gamma_c5 by dual ascent. This yields a policy that is statewise safe on all feasible initial states and a by-product classifier of infeasible ones. The learned multiplier network presents a direct means for practitioners to identify – in real time – whether specific initializations or encountered states admit any safe policy or are intrinsically unsafe (Ma et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zero-Sum Multichain Lagrangian Formulation.