Def-MARL: Distributed Epigraph Multi-Agent RL
- The paper presents a novel epigraph reformulation for multi-agent safe optimal control that decomposes constraints into distributed per-agent problems.
- The methodology leverages CTDE with a z-conditioned policy to enable provably zero-violation safety while minimizing cost.
- Empirical evaluations and hardware demonstrations validate Def-MARL’s scalable, stable, and superior performance compared to traditional penalty and Lagrangian methods.
Distributed Epigraph Form Multi-Agent Reinforcement Learning (Def-MARL) is a centralized training, distributed execution (CTDE) algorithm that addresses the problem of multi-agent safe optimal control by recasting the underlying constrained Markov decision process (CMDP) into an epigraph form. Def-MARL achieves stable, high-performance, and provably safe policy learning for collaborative robotic tasks with strict zero-violation safety requirements by enabling distributed constraint satisfaction and cost minimization through a novel decomposition of the epigraph-constrained optimization problem (Zhang et al., 21 Apr 2025).
1. Mathematical Formulation of Multi-Agent Safe Optimal Control
The multi-agent safe optimal control problem (MASOCP) considers a system of agents interacting over a state space . Each agent observes and can act locally with , leading to a joint action . The system dynamics evolve as . A global cost and the associated infinite-horizon cost-value,
are minimized subject to safety constraints. For safety, each agent maintains a local avoid-set function ; the joint system is safe iff . The safety constraint is imposed in its strictest form: zero-violation ( at all times).
The MASOCP is thus: \begin{align*} \min_{\pi_1,\dots,\pi_N}\; &V\ell(x0; \pi)\ \text{s.t.}\quad &Vh(x0; \pi)\le 0, \end{align*} where .
2. Epigraph Reformulation and Decomposition
The epigraph form introduces an auxiliary scalar as an upper bound for the cost, recasting the problem as: \begin{align} \min_{\pi, z}\; &z\ \text{s.t.}\quad &Vh(x0;\pi) \leq 0, \quad V\ell(x0;\pi) \leq z. \end{align} Rewriting, following the approach of Boyd and Vandenberghe, this splits the optimization into an outer and inner problem: \begin{subequations}\label{eq:epigraph-split} \begin{align} &\min_z\; z \tag{outer}\ &\text{s.t.}\quad \min_{\pi} \left[ \max{Vh(x0;\pi),\, V\ell(x0;\pi) - z} \right] \leq 0\,. \tag{inner} \end{align} \end{subequations}
Within this structure, the inner problem optimizes over policies holding fixed; the outer problem line-searches over . In the MASOCP context, safety and cost are further decomposed per agent: allowing the per-agent "total value": The collective constraint then becomes .
3. Distributed Epigraph Decomposition for MARL
A central result is that, with mild uniqueness assumptions, the outer epigraph problem can be distributed across agents. Each agent independently solves: \begin{align} z_i* = \min_{z'}\;z' \quad \text{s.t.}\; Vh_i(o_i0; \pi(\cdot, z')) \leq 0, \tag{2.1} \end{align} These values can be combined via to recover the overall solution. In high-communication settings, omitting the final max operation incurs negligible suboptimality while maintaining safety.
This decomposition makes the global zero-violation constraint tractable in distributed, scalable settings, enabling each agent to autonomously compute the cost threshold compatible with its local safety requirement while maintaining global guarantees.
4. Algorithmic Structure: Centralized Training and Distributed Execution
4.1 Centralized Training (CT)
Def-MARL employs a -conditioned decentralized policy alongside two value networks: a centralized cost-value and a per-agent constraint-value (which estimates the maximum future ). Representation uses a graph neural network or transformer backbone.
During training, rollout proceeds by augmenting the state with : Advantages are computed per agent using GAE, and network parameters are updated with PPO and temporal difference regression as detailed in the provided pseudocode (Zhang et al., 21 Apr 2025).
4.2 Distributed Execution (DE)
At execution, each agent observes and solves a one-dimensional root-finding problem to compute
where is a safety buffer for NN estimation error. If communication is available, agents exchange values and set ; otherwise, is used directly. Agents execute . This realizes a scalable distributed implementation that preserves zero-violation safety.
5. Theoretical Properties
Def-MARL preserves standard dynamic programming structure by adopting the following recursion: per Proposition 1, enabling valid policy gradients and value iteration.
The decomposition theorem ensures the distributed formulation exactly recovers the centralized solution (under mild conditions). Once the epigraph form is established, the inner optimization problem becomes a classical single-agent "avoid" RL problem augmented with , and policy optimization by PPO converges almost surely to a locally optimal, safe policy as per multi-timescale stochastic approximation results (Zhang et al., 21 Apr 2025).
6. Empirical Performance and Evaluation
6.1 Simulation Results
Def-MARL was evaluated across 8 multi-agent tasks in the MPE suite (Target, Spread, Formation, Line, Corridor, ConnectSpread) and Safe Multi-Agent MuJoCo (Safe HalfCheetah and Safe Coupled HalfCheetah). Metrics include cumulative cost , safety-rate (fraction of runs/agents with zero violation), and training stability. Baselines are penalty methods () and MAPPO-Lagrangian.
Key findings:
- Def-MARL achieves near-zero-violation and low cost, consistently dominating the cost-safety tradeoff front.
- Penalty and Lagrangian baselines either fail in safety (low penalty) or are overly conservative (high penalty), and exhibit instability under zero-violation constraints.
- Def-MARL demonstrates stable training, scalability to in simulation with GPU training, and generalization to agents at test time (constant spatial density), maintaining both constraints and cost optimality.
6.2 Hardware Demonstrations
Def-MARL's distributed execution was validated in real-world Crazyflie quadcopter swarms on scenarios including:
- Corridor crossing (N=3,7): Swarm traversal without collision.
- Inspect (N=2): Agents visually track a moving target with collaborative turn-taking and obstacle avoidance.
Comparisons with decentralized MPC (DMPC) and centralized MPC (CMPC) show 100% safety and success for Def-MARL, while baselines either fail to maintain safety or become trapped in local minima.
7. Implementation Considerations
Empirical insights for practitioners implementing Def-MARL include:
- Set with small negative and as a conservative upper bound on worst-case total cost; parameter sensitivity is mild (50%).
- The buffer in the outer root-finding process enhances safety in the presence of neural network estimation error.
- Recommended architectures are GNN or transformer with $2$–$3$ layers and $32$–$64$ hidden units, with GRU added to handle variable inputs.
- Standard PPO hyperparameters apply: observation dropout, , , clip=0.25, entropy .
- Chandrupatla’s method is advocated for 1D monotonic outer root-finding.
- The CTDE approach necessitates access to the full system during training for the epigraph-constrained inner optimization, but execution relies only on each agent’s own local constraint-value network and -conditioned policy (Zhang et al., 21 Apr 2025).
This approach encompasses all steps from defining MASOCP to implementing distributed root-finding and -conditioned policy evaluation, establishing Def-MARL as a state-of-the-art algorithm for safe scalable multi-agent optimal control under strict zero-violation safety constraints.