Zero-Sum Multichain Lagrangian Formulation

Updated 20 May 2026

The paper introduces FAC, a framework that enforces statewise safety by using a state-dependent Lagrange multiplier for each feasible state.
FAC extends off-policy actor-critic methods like SAC by incorporating a multiplier network, which dynamically adjusts constraint enforcement to improve safety and reward performance.
Empirical results on robot locomotion and safe exploration tasks demonstrate FAC’s effectiveness in maintaining safety and detecting inherently infeasible states.

The Feasible Actor-Critic (FAC) algorithm is a model-free framework for constrained reinforcement learning (RL) that enforces statewise safety. Unlike conventional constrained RL methods, which apply constraints only in expectation over initial states, FAC imposes safety for every feasible initial state individually. FAC achieves this by leveraging a statewise Lagrange multiplier, parameterized by a neural network, to dynamically adapt constraint enforcement at the state level and distinguish inherently infeasible states from those for which safe policies exist (Ma et al., 2021).

1. Statewise Feasibility and Constraint Formalism

Let $\mathcal S$ denote the state space and $\pi$ a stochastic policy. The cost-value function (safety critic) from any state $s$ under policy $\pi$ is

$v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$

for cost function $c$ , with discount $\gamma_c$ . Fixing a scalar threshold $d$ , a state $s$ is called feasible under $\pi$ if $\pi$ 0; otherwise, it is unsafe. Certain states, denoted $\pi$ 1, are inherently infeasible if $\pi$ 2 for all policies $\pi$ 3; the complement, $\pi$ 4, constitutes the feasible region. For a subset of initial support $\pi$ 5, the potentially feasible initial states are $\pi$ 6. The statewise safety constraint requires $\pi$ 7 for every $\pi$ 8.

2. Optimization Problem and Statewise Lagrangian

The learning objective is to maximize the expected discounted return

$\pi$ 9

subject to the infinite family of constraints

$s$ 0

FAC introduces a nonnegative, state-dependent Lagrange multiplier $s$ 1, and forms the statewise Lagrangian

$s$ 2

where $s$ 3 is the initial-state distribution. The corresponding saddle-point problem,

$s$ 4

ensures, under mild conditions, that the optimal policy $s$ 5 solves (SP).

3. Algorithmic Architecture and Updates

FAC extends standard off-policy actor-critic frameworks, particularly soft-actor-critic (SAC), by adding a multiplier network. The key components are:

Two reward Q-networks: $s$ 6 (double-Q)
Cost Q-network: $s$ 7 (predicts cost-to-go)
Stochastic policy network: $s$ 8 (e.g., Gaussian with $s$ 9 squashing)
Multiplier network: $\pi$ 0

Each is updated as follows:

Reward Q-update:

$\pi$ 1

where $\pi$ 2.

Cost Q-update: Mirrors reward Q, using cost $\pi$ 3, discount $\pi$ 4, and no entropy penalty.

Policy update:

$\pi$ 5

with gradient estimated via reparameterization ( $\pi$ 6).

Multiplier update:

$\pi$ 7

with dual ascent on $\pi$ 8.

Entropy temperature $\pi$ 9: Adapted as in SAC by gradient update.

A pseudocode summary is provided:

$\gamma_c$ 6

4. Complementary Slackness and Feasibility Detection

At the optimum $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$0, the statewise complementary slackness conditions apply: $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$1 Thus, for $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$2 (strict safety), $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$3; for $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$4, $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$5. If $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$6 is inherently infeasible, dual ascent causes $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$7, so large values of $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$8 can be used as a practical indicator of infeasibility. The trained multiplier network $v_{C}^{\pi}(s)=\E_{\tau\sim\pi}\left[\sum_{t=0}^\infty\gamma_c^t\,c(s_t,a_t)\;\middle|\;s_0=s\right]$9 thus serves as a learned feasibility oracle over $c$ 0.

5. Theoretical Properties

FAC's theoretical guarantees are as follows:

Equivalence of scaled/unscaled Lagrangians (Theorem 3.1): Optimizing $c$ 1 recovers the same $c$ 2 as the infinite-constraint problem (SP).
Statewise implies expectation-based feasibility (Theorem 3.2): Any policy $c$ 3 satisfying $c$ 4 for all $c$ 5 in $c$ 6 also satisfies $c$ 7.
Performance comparison (Theorem 3.3): For the optimal statewise Lagrangian $c$ 8 and expectation Lagrangian $c$ 9, $\gamma_c$ 0 implies $\gamma_c$ 1, i.e., FAC can only improve, or at least match, reward under tighter constraints compared to expectation-based methods.

6. Empirical Evaluation

FAC was assessed on both robot locomotion tasks with speed constraints (HalfCheetah, Walker2d, Ant) and safe exploration scenarios (Safety-Gym: Point-Button, Car-Goal). Key findings are:

In robot locomotion tasks, FAC consistently respects the speed cap and achieves higher or comparable returns versus expectation-based baselines such as CPO, TRPO-Lagrangian, and PPO-Lagrangian, which exhibit greater constraint violations and instability.
On Safety-Gym safe exploration, FAC yields an episode-cost rate well below the 10% threshold even at the tails of 95% confidence intervals and fewer dangerous episodes at test time, while scalar-multiplier baselines suffer approximately 50% dangerous runs.
Feasibility indication emerges: $\gamma_c$ 2 sharply increases when the agent enters inherently infeasible regions (e.g., boxed-in by obstacles) and moderates when the agent returns to feasible zones.

7. Implementation Considerations and Practical Impact

FAC can be implemented in any off-policy actor-critic codebase by adding the multiplier network $\gamma_c$ 3, modifying the policy loss with the $\gamma_c$ 4 term, and updating $\gamma_c$ 5 by dual ascent. This yields a policy that is statewise safe on all feasible initial states and a by-product classifier of infeasible ones. The learned multiplier network presents a direct means for practitioners to identify – in real time – whether specific initializations or encountered states admit any safe policy or are intrinsically unsafe (Ma et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Feasible Actor-Critic: Constrained Reinforcement Learning for Ensuring Statewise Safety (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zero-Sum Multichain Lagrangian Formulation.

Zero-Sum Multichain Lagrangian Formulation

1. Statewise Feasibility and Constraint Formalism

2. Optimization Problem and Statewise Lagrangian

3. Algorithmic Architecture and Updates

4. Complementary Slackness and Feasibility Detection

5. Theoretical Properties

6. Empirical Evaluation

7. Implementation Considerations and Practical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Zero-Sum Multichain Lagrangian Formulation

1. Statewise Feasibility and Constraint Formalism

2. Optimization Problem and Statewise Lagrangian

3. Algorithmic Architecture and Updates

4. Complementary Slackness and Feasibility Detection

5. Theoretical Properties

6. Empirical Evaluation

7. Implementation Considerations and Practical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research