Papers
Topics
Authors
Recent
Search
2000 character limit reached

Def-MARL: Distributed Epigraph Multi-Agent RL

Updated 4 March 2026
  • The paper presents a novel epigraph reformulation for multi-agent safe optimal control that decomposes constraints into distributed per-agent problems.
  • The methodology leverages CTDE with a z-conditioned policy to enable provably zero-violation safety while minimizing cost.
  • Empirical evaluations and hardware demonstrations validate Def-MARL’s scalable, stable, and superior performance compared to traditional penalty and Lagrangian methods.

Distributed Epigraph Form Multi-Agent Reinforcement Learning (Def-MARL) is a centralized training, distributed execution (CTDE) algorithm that addresses the problem of multi-agent safe optimal control by recasting the underlying constrained Markov decision process (CMDP) into an epigraph form. Def-MARL achieves stable, high-performance, and provably safe policy learning for collaborative robotic tasks with strict zero-violation safety requirements by enabling distributed constraint satisfaction and cost minimization through a novel decomposition of the epigraph-constrained optimization problem (Zhang et al., 21 Apr 2025).

1. Mathematical Formulation of Multi-Agent Safe Optimal Control

The multi-agent safe optimal control problem (MASOCP) considers a system of NN agents interacting over a state space XRn\mathcal{X} \subset \mathbb{R}^n. Each agent observes oi=Oi(x)Oo_i = O_i(x) \in \mathcal{O} and can act locally with uiUiu_i \in \mathcal{U}_i, leading to a joint action uU1××UNu \in \mathcal{U}_1 \times \cdots \times \mathcal{U}_N. The system dynamics evolve as xk+1=f(xk,uk)x^{k+1}=f(x^k,u^k). A global cost :X×UR\ell: \mathcal{X} \times \mathcal{U} \rightarrow \mathbb{R} and the associated infinite-horizon cost-value,

V(x0;π)=k=0(xk,π(xk)),V^\ell(x^0; \pi) = \sum_{k=0}^\infty \ell(x^k, \pi(x^k)),

are minimized subject to safety constraints. For safety, each agent ii maintains a local avoid-set function hi(oi)h_i(o_i); the joint system is safe iff h(x)=maxihi(Oi(x))0h(x)=\max_i h_i(O_i(x)) \le 0. The safety constraint is imposed in its strictest form: zero-violation (Vh(x0;π)=maxk0h(xk)0V^h(x^0; \pi) = \max_{k \geq 0} h(x^k) \leq 0 at all times).

The MASOCP is thus: \begin{align*} \min_{\pi_1,\dots,\pi_N}\; &V\ell(x0; \pi)\ \text{s.t.}\quad &Vh(x0; \pi)\le 0, \end{align*} where π(x)=[π1(o1);;πN(oN)]\pi(x)=[\pi_1(o_1); \ldots; \pi_N(o_N)].

2. Epigraph Reformulation and Decomposition

The epigraph form introduces an auxiliary scalar zz as an upper bound for the cost, recasting the problem as: \begin{align} \min_{\pi, z}\; &z\ \text{s.t.}\quad &Vh(x0;\pi) \leq 0, \quad V\ell(x0;\pi) \leq z. \end{align} Rewriting, following the approach of Boyd and Vandenberghe, this splits the optimization into an outer and inner problem: \begin{subequations}\label{eq:epigraph-split} \begin{align} &\min_z\; z \tag{outer}\ &\text{s.t.}\quad \min_{\pi} \left[ \max{Vh(x0;\pi),\, V\ell(x0;\pi) - z} \right] \leq 0\,. \tag{inner} \end{align} \end{subequations}

Within this structure, the inner problem optimizes over policies holding zz fixed; the outer problem line-searches over zz. In the MASOCP context, safety and cost are further decomposed per agent: Vh(x0;π)=maxiVih(oi0;π),Vih(oi0;π)=maxk0hi(oik),V^h(x^0; \pi) = \max_i V^h_i(o_i^0;\pi),\quad V^h_i(o_i^0;\pi) = \max_{k \geq 0} h_i(o_i^k), allowing the per-agent "total value": Vi(x0,z;π)=max{Vih(oi0;π),V(x0;π)z}.V_i(x^0, z; \pi) = \max\{V^h_i(o_i^0; \pi),\, V^\ell(x^0; \pi) - z\}. The collective constraint then becomes minπmaxiVi(x0,z;π)0\min_\pi \max_i V_i(x^0, z; \pi) \leq 0.

3. Distributed Epigraph Decomposition for MARL

A central result is that, with mild uniqueness assumptions, the outer epigraph problem can be distributed across agents. Each agent ii independently solves: \begin{align} z_i* = \min_{z'}\;z' \quad \text{s.t.}\; Vh_i(o_i0; \pi(\cdot, z')) \leq 0, \tag{2.1} \end{align} These values can be combined via z=maxiziz^* = \max_i z_i^* to recover the overall solution. In high-communication settings, omitting the final max operation incurs negligible suboptimality while maintaining safety.

This decomposition makes the global zero-violation constraint tractable in distributed, scalable settings, enabling each agent to autonomously compute the cost threshold compatible with its local safety requirement while maintaining global guarantees.

4. Algorithmic Structure: Centralized Training and Distributed Execution

4.1 Centralized Training (CT)

Def-MARL employs a zz-conditioned decentralized policy πθ(oi,z)\pi_\theta(o_i, z) alongside two value networks: a centralized cost-value Vϕ(x,z)V^\ell_\phi(x, z) and a per-agent constraint-value Vψh(oi,z)V^h_\psi(o_i, z) (which estimates the maximum future hih_i). Representation uses a graph neural network or transformer backbone.

During training, rollout proceeds by augmenting the state with zkz^k: xk+1=f(xk,π(ok,zk)),zk+1=zk(xk,π(ok,zk)).x^{k+1} = f(x^k, \pi(o^k, z^k)),\quad z^{k+1} = z^k - \ell(x^k, \pi(o^k, z^k)). Advantages are computed per agent using GAE, and network parameters (θ,ϕ,ψ)(\theta, \phi, \psi) are updated with PPO and temporal difference regression as detailed in the provided pseudocode (Zhang et al., 21 Apr 2025).

4.2 Distributed Execution (DE)

At execution, each agent observes oio_i and solves a one-dimensional root-finding problem to compute

zi=min{zVψh(oi,z)ξ},z_i = \min\{z' \mid V^h_\psi(o_i, z') \leq -\xi\},

where ξ0\xi \geq 0 is a safety buffer for NN estimation error. If communication is available, agents exchange ziz_i values and set z=maxiziz = \max_i z_i; otherwise, ziz_i is used directly. Agents execute ui=πθ(oi,zi)u_i = \pi_\theta(o_i, z_i). This realizes a scalable distributed implementation that preserves zero-violation safety.

5. Theoretical Properties

Def-MARL preserves standard dynamic programming structure by adopting the following recursion: V(xk,zk;π)=max{h(xk),V(xk+1,zk+1;π)},zk+1=zk(xk,π(xk)),V(x^k, z^k; \pi) = \max \{ h(x^k), V(x^{k+1}, z^{k+1}; \pi) \},\quad z^{k+1} = z^k - \ell(x^k, \pi(x^k)), per Proposition 1, enabling valid policy gradients and value iteration.

The decomposition theorem ensures the distributed formulation exactly recovers the centralized solution (under mild conditions). Once the epigraph form is established, the inner optimization problem becomes a classical single-agent "avoid" RL problem augmented with (x,z)(x, z), and policy optimization by PPO converges almost surely to a locally optimal, safe policy as per multi-timescale stochastic approximation results (Zhang et al., 21 Apr 2025).

6. Empirical Performance and Evaluation

6.1 Simulation Results

Def-MARL was evaluated across 8 multi-agent tasks in the MPE suite (Target, Spread, Formation, Line, Corridor, ConnectSpread) and Safe Multi-Agent MuJoCo (Safe HalfCheetah and Safe Coupled HalfCheetah). Metrics include cumulative cost \sum\ell, safety-rate (fraction of runs/agents with zero violation), and training stability. Baselines are penalty methods (βReLU(h)\beta \cdot \text{ReLU}(h)) and MAPPO-Lagrangian.

Key findings:

  • Def-MARL achieves near-zero-violation and low cost, consistently dominating the cost-safety tradeoff front.
  • Penalty and Lagrangian baselines either fail in safety (low penalty) or are overly conservative (high penalty), and exhibit instability under zero-violation constraints.
  • Def-MARL demonstrates stable training, scalability to N=16N=16 in simulation with GPU training, and generalization to N512N \simeq 512 agents at test time (constant spatial density), maintaining both constraints and cost optimality.

6.2 Hardware Demonstrations

Def-MARL's distributed execution was validated in real-world Crazyflie quadcopter swarms on scenarios including:

  • Corridor crossing (N=3,7): Swarm traversal without collision.
  • Inspect (N=2): Agents visually track a moving target with collaborative turn-taking and obstacle avoidance.

Comparisons with decentralized MPC (DMPC) and centralized MPC (CMPC) show 100% safety and success for Def-MARL, while baselines either fail to maintain safety or become trapped in local minima.

7. Implementation Considerations

Empirical insights for practitioners implementing Def-MARL include:

  • Set [zmin,zmax][z_{\min},z_{\max}] with zminz_{\min} small negative and zmaxz_{\max} as a conservative upper bound on worst-case total cost; parameter sensitivity is mild (±\pm50%).
  • The buffer ξ\xi in the outer root-finding process enhances safety in the presence of neural network estimation error.
  • Recommended architectures are GNN or transformer with $2$–$3$ layers and $32$–$64$ hidden units, with GRU added to handle variable zz inputs.
  • Standard PPO hyperparameters apply: observation dropout, γ=0.99\gamma=0.99, λGAE=0.95\lambda_{\text{GAE}}=0.95, clip=0.25, entropy 0.01\approx 0.01.
  • Chandrupatla’s method is advocated for 1D monotonic outer root-finding.
  • The CTDE approach necessitates access to the full system during training for the epigraph-constrained inner optimization, but execution relies only on each agent’s own local constraint-value network and zz-conditioned policy (Zhang et al., 21 Apr 2025).

This approach encompasses all steps from defining MASOCP to implementing distributed root-finding and zz-conditioned policy evaluation, establishing Def-MARL as a state-of-the-art algorithm for safe scalable multi-agent optimal control under strict zero-violation safety constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Distributed Epigraph Form Multi-Agent Reinforcement Learning (Def-MARL).