DA-MAPPO: Double Actions Multi-Agent PPO

Updated 30 November 2025

The paper introduces a two-stage decoupled action sampling method that efficiently handles high-dimensional, coupled action spaces.
It leverages separate Gaussian and Bernoulli policies to model continuous trajectory control and discrete intent-response actions with explicit causal conditioning.
Empirical results indicate faster convergence, higher final rewards, and more stable agent coordination compared to standard MAPPO in complex AAV and IoT applications.

Double Actions based Multi-Agent Proximal Policy Optimization (DA-MAPPO) is a multi-agent policy optimization algorithm designed to address high-dimensional, coupled action domains, with particular relevance to autonomous aerial vehicle (AAV) decision-making and user-intent responsive networking in IoT applications over 6G wireless links. DA-MAPPO is formulated as an extension of MAPPO, introducing a two-stage, decoupled action sampling methodology that preserves high-order interdependencies while enabling tractable exploration and efficient credit assignment in complex joint action spaces (Hu et al., 23 Nov 2025).

1. Motivations and Problem Context

Large-scale AAV-assisted IoT deployments demand coordinated decisions in environments with high-dimensional, hybrid action requirements—such as trajectory control and user-intent-based resource allocation. Vanilla approaches, including standard MAPPO, require the joint modeling and sampling of mixed continuous and discrete action spaces per agent. This leads to exponential growth in action sampling complexity and impedes learning of dependencies between mission-critical control actions (e.g., flight paths) and network intent responses. DA-MAPPO addresses these obstacles by factorizing each agent’s decision at a given step into trajectory selection (continuous) and intent-response (discrete), with explicit causal conditioning between them, thereby efficiently scaling to realistic robotic and network optimization problems (Hu et al., 23 Nov 2025).

2. Policy Structure: Double Action Decooupling and Network Cascade

DA-MAPPO splits the per-agent joint action $A_t^{V_i}$ into two distinct factors:

Trajectory Action: $A_t^{1,V_i} = [v_x^f, v_y^f]$ (continuous, representing horizontal flight velocities)
Intent-Response Action: $A_t^{2,V_i} = [m_t^{V_i}, m_t^{V_i,BS}]$ (binary vectors, denoting per-user intent retention/discard decisions and forwarding status to the base station)

Each agent maintains two independently parameterized stochastic policies:

$\pi_{\theta_i}^1(A_t^{1,V_i} | s_t^{V_i})$ : Outputs a Gaussian distribution over continuous flight controls given the local state.
$\pi_{\theta_i}^2(A_t^{2,V_i} | A_t^{1,V_i}, s_t^{V_i})$ : Outputs independent Bernoulli logits for intent flags, explicitly conditioned on both the observation and the trajectory action previously sampled.

By cascading $\pi^2$ on the sampled output of $\pi^1$ , DA-MAPPO preserves high-order dependencies between locomotion and network behavior, eliminating the necessity of modeling their full joint distribution. This approach reduces the effective action-space to the sum, rather than the product, of subspace sizes, sharply improving tractability (Hu et al., 23 Nov 2025).

3. Mathematical Formulation and Optimization Objective

Let $\mathcal{F}_{\pi_{\theta_i}(A_t^{V_i} | s_t^{V_i})}$ denote the factorized policy probability:

$\mathcal{F}_{\pi_{\theta_i}(A_t^{V_i}\mid s_t^{V_i})} = \mathcal{F}_{\pi_{\theta_i}^1}(A_t^{1,V_i}\mid s_t^{V_i}) \;\times\; \mathcal{F}_{\pi_{\theta_i}^2}(A_t^{2,V_i}\mid A_t^{1,V_i},s_t^{V_i})$

This factorized structure allows the application of a clipped PPO-style surrogate objective, computed as:

$L^{\mathrm{CLIP}}_i(\theta_i) = \mathbb{E}_t\Biggl[ \min\Bigl( r_{t}^{V_i}(\theta_i)\,O_t^{V_i}, \; \mathrm{clip}\bigl(r_{t}^{V_i}(\theta_i),\,1-\varepsilon,\,1+\varepsilon\bigr)\,O_t^{V_i} \Bigr) \Biggr]$

with:

$r_{t}^{V_i}(\theta_i) = \frac{\mathcal{F}_{\pi_{\theta_i}(A_t^{V_i}\mid s_t^{V_i})}} {\mathcal{F}_{\pi_{\theta_i}^{\mathrm{old}}(A_t^{V_i}\mid s_t^{V_i})}}$

The advantage function $O_t^{V_i}$ is computed using GAE or an $n$ -step return, and the total per-agent loss includes entropy and value regularization:

$L_i(\theta_i, \phi_i) = -L^{\mathrm{CLIP}}_i(\theta_i) + c_1\,H_i(\theta_i) + c_2\,L_i^{\mathrm{VF}}(\phi_i)$

Where $H_i$ is the entropy bonus enforcing exploration, and $L_i^{\mathrm{VF}}$ is the value loss (Hu et al., 23 Nov 2025).

4. Algorithmic Workflow and Pseudocode

The algorithm iterates over episodes as follows:

Trajectory Collection: For each agent at each environment step, observe $s_t^{V_i}$ , sample $A^1 \sim \pi_{\theta_i}^1$ , then $A^2 \sim \pi_{\theta_i}^2(\cdot | A^1, s)$ , apply the joint action $[A^1, A^2]$ , and store state-action-reward transitions along with log-probabilities.
Advantage Computation: Compute $G_t^0$ and corresponding advantages $O_t^{V_i}$ via $n$ -step or GAE.
Policy Update: For $K$ epochs per batch, minibatch transitions are used to compute the surrogate loss, entropy bonus, and value loss. Networks are updated using gradient steps with respect to their parameters.
Hyperparameterization: Typical PPO settings are adopted for learning rates, PPO clipping parameter, discount factor $\gamma_p$ , entropy and value loss coefficients, horizon $n$ , batch size $M$ , and rollout length $T$ , as explicit values are to be tuned for task specifics (Hu et al., 23 Nov 2025).

5. Network Architecture and Feature Design

DA-MAPPO leverages domain-appropriate neural structures:

Shared Encoder: Each raw state $s_t^{V_i}$ is processed by one or two fully connected layers (ReLU activations).
Trajectory Subnetwork: MLP with 2–3 FC layers (e.g., 128→64 units, ReLU), outputs Gaussian means and log standard deviations for $[v_x^f, v_y^f]$ .
Intent-Response Subnetwork: MLP with input as the concatenation of encoded state and $\pi^1$ ’s Gaussian parameters (dimension $D_s + 4$ ), 2–3 FC layers (128→64), outputting Bernoulli logits for all intent and BS-forward flags.
Value Network: Shares encoder structure with $\pi^1$ , outputs scalar value estimates.

This modular design enables efficient feature reuse and backpropagation of both trajectory and intent signals for each agent (Hu et al., 23 Nov 2025).

6. Dependency Handling and Empirical Advantages over Vanilla MAPPO

Vanilla MAPPO jointly parameterizes the full action space, facing exponential sampling complexity in high-dimensional continuous ×-discrete regimes. It is also prone to weakly capturing dependencies between sub-actions, such as flight versus network intent decisions. DA-MAPPO's staged cascade samples the continuous trajectory first, restricting the conditional support for subsequent discrete decisions, and explicitly passes $\pi^1$ ’s output distribution into $\pi^2$ . This design both preserves inter-action correlations and facilitates disentangled exploration across subspaces. Empirical results indicate that this method converges faster, achieves higher final rewards, and supports more stable agent coordination than standard MAPPO in coupled, high-dimensional scenarios (Hu et al., 23 Nov 2025).

7. Implementation Considerations and Practical Relevance

DA-MAPPO can be instantiated using two small MLPs (policy networks) and one value network per agent. The core requirements are (i) two-stage action sampling, (ii) standard PPO-clipped loss with factored policy probability, and (iii) advantage-based updates. The approach is broadly applicable to multi-agent control problems in networked robotics, particularly those combining continuous and discrete high-order dependencies where raw MAPPO is inefficient (Hu et al., 23 Nov 2025).

Component	Role	Notes
$\pi^1$ (Trajectory)	Sample continuous action	Gaussian head; conditioned on $s$
$\pi^2$ (Intent)	Sample discrete action	Bernoulli head; conditioned on $s$ , $A^1$
$\pi^{VF}$ (Value)	Estimate value function	Shares encoder with $\pi^1$

A plausible implication is that DA-MAPPO can generalize to any MARL context where compound, conditional actions arise and where joint-action combinatorics are prohibitive for monolithic policy design.

PDF Markdown Chat (Pro)

References (1)

Wireless Power Transfer and Intent-Driven Network Optimization in AAVs-assisted IoT for 6G Sustainable Connectivity (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Double Actions based Multi-Agent Proximal Policy Optimization (DA-MAPPO).