Papers
Topics
Authors
Recent
2000 character limit reached

DA-MAPPO: Double Actions Multi-Agent PPO

Updated 30 November 2025
  • The paper introduces a two-stage decoupled action sampling method that efficiently handles high-dimensional, coupled action spaces.
  • It leverages separate Gaussian and Bernoulli policies to model continuous trajectory control and discrete intent-response actions with explicit causal conditioning.
  • Empirical results indicate faster convergence, higher final rewards, and more stable agent coordination compared to standard MAPPO in complex AAV and IoT applications.

Double Actions based Multi-Agent Proximal Policy Optimization (DA-MAPPO) is a multi-agent policy optimization algorithm designed to address high-dimensional, coupled action domains, with particular relevance to autonomous aerial vehicle (AAV) decision-making and user-intent responsive networking in IoT applications over 6G wireless links. DA-MAPPO is formulated as an extension of MAPPO, introducing a two-stage, decoupled action sampling methodology that preserves high-order interdependencies while enabling tractable exploration and efficient credit assignment in complex joint action spaces (Hu et al., 23 Nov 2025).

1. Motivations and Problem Context

Large-scale AAV-assisted IoT deployments demand coordinated decisions in environments with high-dimensional, hybrid action requirements—such as trajectory control and user-intent-based resource allocation. Vanilla approaches, including standard MAPPO, require the joint modeling and sampling of mixed continuous and discrete action spaces per agent. This leads to exponential growth in action sampling complexity and impedes learning of dependencies between mission-critical control actions (e.g., flight paths) and network intent responses. DA-MAPPO addresses these obstacles by factorizing each agent’s decision at a given step into trajectory selection (continuous) and intent-response (discrete), with explicit causal conditioning between them, thereby efficiently scaling to realistic robotic and network optimization problems (Hu et al., 23 Nov 2025).

2. Policy Structure: Double Action Decooupling and Network Cascade

DA-MAPPO splits the per-agent joint action AtViA_t^{V_i} into two distinct factors:

  • Trajectory Action: At1,Vi=[vxf,vyf]A_t^{1,V_i} = [v_x^f, v_y^f] (continuous, representing horizontal flight velocities)
  • Intent-Response Action: At2,Vi=[mtVi,mtVi,BS]A_t^{2,V_i} = [m_t^{V_i}, m_t^{V_i,BS}] (binary vectors, denoting per-user intent retention/discard decisions and forwarding status to the base station)

Each agent maintains two independently parameterized stochastic policies:

  • πθi1(At1,VistVi)\pi_{\theta_i}^1(A_t^{1,V_i} | s_t^{V_i}): Outputs a Gaussian distribution over continuous flight controls given the local state.
  • πθi2(At2,ViAt1,Vi,stVi)\pi_{\theta_i}^2(A_t^{2,V_i} | A_t^{1,V_i}, s_t^{V_i}): Outputs independent Bernoulli logits for intent flags, explicitly conditioned on both the observation and the trajectory action previously sampled.

By cascading π2\pi^2 on the sampled output of π1\pi^1, DA-MAPPO preserves high-order dependencies between locomotion and network behavior, eliminating the necessity of modeling their full joint distribution. This approach reduces the effective action-space to the sum, rather than the product, of subspace sizes, sharply improving tractability (Hu et al., 23 Nov 2025).

3. Mathematical Formulation and Optimization Objective

Let Fπθi(AtVistVi)\mathcal{F}_{\pi_{\theta_i}(A_t^{V_i} | s_t^{V_i})} denote the factorized policy probability:

Fπθi(AtVistVi)=Fπθi1(At1,VistVi)  ×  Fπθi2(At2,ViAt1,Vi,stVi)\mathcal{F}_{\pi_{\theta_i}(A_t^{V_i}\mid s_t^{V_i})} = \mathcal{F}_{\pi_{\theta_i}^1}(A_t^{1,V_i}\mid s_t^{V_i}) \;\times\; \mathcal{F}_{\pi_{\theta_i}^2}(A_t^{2,V_i}\mid A_t^{1,V_i},s_t^{V_i})

This factorized structure allows the application of a clipped PPO-style surrogate objective, computed as:

LiCLIP(θi)=Et[min(rtVi(θi)OtVi,  clip(rtVi(θi),1ε,1+ε)OtVi)]L^{\mathrm{CLIP}}_i(\theta_i) = \mathbb{E}_t\Biggl[ \min\Bigl( r_{t}^{V_i}(\theta_i)\,O_t^{V_i}, \; \mathrm{clip}\bigl(r_{t}^{V_i}(\theta_i),\,1-\varepsilon,\,1+\varepsilon\bigr)\,O_t^{V_i} \Bigr) \Biggr]

with:

rtVi(θi)=Fπθi(AtVistVi)Fπθiold(AtVistVi)r_{t}^{V_i}(\theta_i) = \frac{\mathcal{F}_{\pi_{\theta_i}(A_t^{V_i}\mid s_t^{V_i})}} {\mathcal{F}_{\pi_{\theta_i}^{\mathrm{old}}(A_t^{V_i}\mid s_t^{V_i})}}

The advantage function OtViO_t^{V_i} is computed using GAE or an nn-step return, and the total per-agent loss includes entropy and value regularization:

Li(θi,ϕi)=LiCLIP(θi)+c1Hi(θi)+c2LiVF(ϕi)L_i(\theta_i, \phi_i) = -L^{\mathrm{CLIP}}_i(\theta_i) + c_1\,H_i(\theta_i) + c_2\,L_i^{\mathrm{VF}}(\phi_i)

Where HiH_i is the entropy bonus enforcing exploration, and LiVFL_i^{\mathrm{VF}} is the value loss (Hu et al., 23 Nov 2025).

4. Algorithmic Workflow and Pseudocode

The algorithm iterates over episodes as follows:

  1. Trajectory Collection: For each agent at each environment step, observe stVis_t^{V_i}, sample A1πθi1A^1 \sim \pi_{\theta_i}^1, then A2πθi2(A1,s)A^2 \sim \pi_{\theta_i}^2(\cdot | A^1, s), apply the joint action [A1,A2][A^1, A^2], and store state-action-reward transitions along with log-probabilities.
  2. Advantage Computation: Compute Gt0G_t^0 and corresponding advantages OtViO_t^{V_i} via nn-step or GAE.
  3. Policy Update: For KK epochs per batch, minibatch transitions are used to compute the surrogate loss, entropy bonus, and value loss. Networks are updated using gradient steps with respect to their parameters.
  4. Hyperparameterization: Typical PPO settings are adopted for learning rates, PPO clipping parameter, discount factor γp\gamma_p, entropy and value loss coefficients, horizon nn, batch size MM, and rollout length TT, as explicit values are to be tuned for task specifics (Hu et al., 23 Nov 2025).

5. Network Architecture and Feature Design

DA-MAPPO leverages domain-appropriate neural structures:

  • Shared Encoder: Each raw state stVis_t^{V_i} is processed by one or two fully connected layers (ReLU activations).
  • Trajectory Subnetwork: MLP with 2–3 FC layers (e.g., 128→64 units, ReLU), outputs Gaussian means and log standard deviations for [vxf,vyf][v_x^f, v_y^f].
  • Intent-Response Subnetwork: MLP with input as the concatenation of encoded state and π1\pi^1’s Gaussian parameters (dimension Ds+4D_s + 4), 2–3 FC layers (128→64), outputting Bernoulli logits for all intent and BS-forward flags.
  • Value Network: Shares encoder structure with π1\pi^1, outputs scalar value estimates.

This modular design enables efficient feature reuse and backpropagation of both trajectory and intent signals for each agent (Hu et al., 23 Nov 2025).

6. Dependency Handling and Empirical Advantages over Vanilla MAPPO

Vanilla MAPPO jointly parameterizes the full action space, facing exponential sampling complexity in high-dimensional continuous ×-discrete regimes. It is also prone to weakly capturing dependencies between sub-actions, such as flight versus network intent decisions. DA-MAPPO's staged cascade samples the continuous trajectory first, restricting the conditional support for subsequent discrete decisions, and explicitly passes π1\pi^1’s output distribution into π2\pi^2. This design both preserves inter-action correlations and facilitates disentangled exploration across subspaces. Empirical results indicate that this method converges faster, achieves higher final rewards, and supports more stable agent coordination than standard MAPPO in coupled, high-dimensional scenarios (Hu et al., 23 Nov 2025).

7. Implementation Considerations and Practical Relevance

DA-MAPPO can be instantiated using two small MLPs (policy networks) and one value network per agent. The core requirements are (i) two-stage action sampling, (ii) standard PPO-clipped loss with factored policy probability, and (iii) advantage-based updates. The approach is broadly applicable to multi-agent control problems in networked robotics, particularly those combining continuous and discrete high-order dependencies where raw MAPPO is inefficient (Hu et al., 23 Nov 2025).

Component Role Notes
π1\pi^1 (Trajectory) Sample continuous action Gaussian head; conditioned on ss
π2\pi^2 (Intent) Sample discrete action Bernoulli head; conditioned on ss, A1A^1
πVF\pi^{VF} (Value) Estimate value function Shares encoder with π1\pi^1

A plausible implication is that DA-MAPPO can generalize to any MARL context where compound, conditional actions arise and where joint-action combinatorics are prohibitive for monolithic policy design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Double Actions based Multi-Agent Proximal Policy Optimization (DA-MAPPO).