Multi-Agent Safe Optimal Control

Updated 4 March 2026

MASOCP is a framework that jointly regulates multiple agents by minimizing a global cost while enforcing strict, zero-tolerance safety constraints.
The methodology uses an epigraph reformulation to convert hard safety constraints into bounded training gradients, ensuring stable and robust learning.
A distributed solution architecture enables centralized training with decentralized execution, which is scalable for complex robotic and autonomous applications.

The Multi-Agent Safe Optimal Control Problem (MASOCP) defines the joint regulation of a team of agents to achieve collective optimality objectives subject to hard safety constraints—most commonly, the absolute avoidance of constraint violations over all possible agent trajectories and all time steps. Unlike conventional unconstrained or penalty-based multi-agent control frameworks, the MASOCP centers safety as a zero-tolerance constraint, requiring that every admissible control policy yields trajectories staying within the specified safe set for all agents at all times. This problem is typically modeled as a constrained Markov decision process (CMDP) with per-agent or global avoidance sets, and is particularly relevant in robotics, autonomous vehicles, and distributed cyber-physical systems, where any safety violation is unacceptable and must be provably excluded during both training and execution (Zhang et al., 21 Apr 2025).

1. Mathematical Formulation of the MASOCP

The canonical MASOCP considers $N$ homogeneous agents, each with local control $u_i^k \in \mathcal U_i$ , partial observation $o_i^k = O_i(x^k)$ , and global state evolution: $x^{k+1}=f(x^k, u^k), \quad x^k \in \mathcal X \subseteq \mathbb R^n, \quad u^k = [u_1^k; \ldots; u_N^k]$ The control objective is to minimize a global infinite-horizon cost: $J(\pi) = \sum_{k=0}^\infty l(x^k, \pi(x^k))$ where $l(\cdot,\cdot)$ is the global running cost and $\pi(x^k)$ is the joint policy, potentially decentralized. Safety is encoded with per-agent constraint functions $h_i(o_i)$ and associated avoid sets $\mathcal A_i = \{o_i \mid h_i(o_i) > 0\}$ , yielding the zero-violation constraint: $h_i(o_i^k) \le 0, \quad \forall i = 1,\ldots,N, \;\forall k \ge 0$ The MASOCP is thus

$\begin{aligned} & \min_{\{\pi_i\}} \qquad && \sum_{k=0}^{\infty} l\bigl(x^k, \pi(x^k)\bigr) \ & \text{s.t.} \qquad && x^{k+1} = f\bigl(x^k, \pi(x^k)\bigr), \quad h_i(o_i^k) \le 0, \;\;\forall i, k \end{aligned}$

The cost and safety constraints are typically heterogeneous and may incorporate coupling among agents. Unlike CMDPs with soft (average) constraints, the MASOCP, as defined above, admits no constraint violations at any time (Zhang et al., 21 Apr 2025).

2. Epigraph Reformulation and Training Stability

Traditional Lagrangian relaxation of constraints in RL-based CMDP formulations leads to unbounded gradients as Lagrange multipliers diverge under hard (zero-tolerance) constraints, resulting in highly unstable training. To address this, the MASOCP can be reformulated in the epigraph form (following convex optimization theory): $\min_{z, \pi} \; z \qquad \text{s.t.} \quad V^h(x^0; \pi) \le 0, \; V^l(x^0; \pi) \le z$ Here, $V^l(x^0; \pi)$ is the cumulative cost, $V^h(x^0; \pi)=\max_{k \geq 0, i} h_i(o_i^k)$ is the maximum constraint violation, and $z$ is an auxiliary scalar bounding the total cost. The inner minimax problem enforces that, for any fixed $z$ : $\pi_z = \arg\min_\pi \max \{ V^h(x^0; \pi), V^l(x^0; \pi) - z \} \le 0$ The epigraph method ensures training gradients remain bounded, as $z$ only shifts constraint thresholds, yielding stable and robust convergence even for zero-tolerance constraints in deep RL architectures (Zhang et al., 21 Apr 2025).

3. Distributed Solution Architecture

The MASOCP in epigraph form naturally admits a distributed solution architecture via centralized training and distributed execution (CTDE):

Centralized Training (Inner Problem):

Jointly train a $z$ -conditioned policy $\pi_\theta(o_i, z)$ , global value network $V^l_\phi(x, z)$ , and per-agent constraint networks $V^h_{\psi, i}(o_i, z)$ .
Employ a Bellman recursion for the total value based on the maximum of current constraint violation and cost-depletion, with Generalized Advantage Estimation (GAE) for stable PPO updates.

Distributed Execution (Outer Problem):

Each agent $i$ performs a local root-finding to determine the minimal $z_i^*$ such that $V^h_i(o_i; \pi(\cdot, z_i^*)) \le 0$ .
The global $z^* = \max_i z_i^*$ aggregates local minima; each agent then applies $\pi_i(o_i, z^*)$ independently.
This pipeline admits true decentralized execution with guarantees for zero-violation safety and global cost-optimality under the learned policy (Zhang et al., 21 Apr 2025).

4. Theoretical Guarantees

The epigraph-based MASOCP solution framework provides rigorous correctness and convergence guarantees:

Optimality: The inner-outer minimax decomposition with distributed root-finding yields a globally optimal policy for the original MASOCP with zero constraint violation (Lemma 3).
Convergence: The algorithm recasts the problem as single-agent avoid control on an augmented $(x,z)$ state, so standard convergence theory for actor-critic RL (e.g., PPO) applies under compatible function approximation.
Assumptions: Uniqueness of the mapping $z \mapsto V^l(x; \pi_z)$ , sufficient exploration for root-finding, and bounded error in $V^h_i$ estimation (enforced in practice by enforcing $V^h_i \le -\xi$ for a buffer $\xi$ ).

These results guarantee that, as long as the function approximators are well-behaved and exploration is sufficient, the architecture will converge to safe and cost-optimal control policies (Zhang et al., 21 Apr 2025).

5. Empirical Evaluation and Scalability

Simulation evidence spans multi-agent particle environments (MPE) and Safe Multi-Agent MuJoCo tasks:

Setting	Safety Rate	Cost (Matching Unconstrained)	Scalability
MPE/Target, Spread, etc.	~100%	Yes	Training $N=16$ , Gen $N=512$
Safe MuJoCo (HalfCheetah)	~100%	Yes	As above

Def-MARL, the algorithm synthesizing the epigraph method and distributed execution, achieves $\sim$ 100% safety and maintains cost performance equal to unconstrained baselines with a fixed hyperparameter set.
Penalty and Lagrangian methods display intrinsic cost-safety trade-offs (over-conservative or unsafe) and suffer instability on zero-violation tasks.
Method generalizes to large-scale multi-agent settings, retaining $<0.5\%$ safety loss up to $N=512$ with controlled agent density (Zhang et al., 21 Apr 2025).

Hardware experiments (Crazyflie quadcopters) show:

100% safety and success for tasks such as corridor crossing and multi-quadrotor inspection.
Baseline MPC controllers either fail (centralized stuck in local minima) or induce unsafe behaviors (decentralized yields collisions).

6. Context, Extensions, and Relations to Broader Literature

Epigraph-based policies for MASOCP contrast with Lagrangian methods, penalty-based CMDPs, and other safe multi-agent RL approaches:

Lagrangian approaches become numerically unstable as constraint multipliers diverge for hard constraints.
Penalty-based methods require tedious hyperparameter tuning and fail to achieve robust zero-violation performance (typically oscillating between high cost and unsafe policies).
The distributed root-finding structure in the epigraph approach exploits problem decomposability, supporting scalability and independence from per-task hyperparameter selection.

Comparison with safe Bayesian optimization (Tokmak et al., 19 Aug 2025), barrier function and CBF-based frameworks (Mestres et al., 2024, Song et al., 2022), and safe RL via distributed optimization (Tan et al., 2024) shows the distinctive strength of epigraph-based, RL-driven approaches for zero-violation, scalable, decentralized safe control in multi-agent systems.

7. Algorithmic Summary

Pseudocode representation highlights the two-level structure:

initialize θ, φ, ψ
for each epoch:
    sample x0, z0
    rollout {xk, zk, ok, ak} under πθ
    compute GAE using V^h_ψ, V^l_φ
    update ψ, φ by TD loss; update θ by PPO loss

for each timestep k:
    for agent i:
        solve z_i = root{V^h_i(o_i; z)=0}
    aggregate z = max_i z_i
    each agent applies u_i = πθ(o_i, z)

(Zhang et al., 21 Apr 2025)

The critical insight is that the MASOCP admits a centralized-training, distributed-execution solution with provable safety and optimality, robust to the agent count and scalable to high-dimensional settings. Empirical and theoretical analyses demonstrate that, in the context of multi-agent RL, epigraph reformulation is essential for stable, practical, and rigorous safe optimal control.