IPPO-DM: Independent PPO with Dirichlet Modeling

Updated 22 January 2026

IPPO-DM is a novel distributed deep reinforcement learning algorithm that integrates independent PPO agents with Dirichlet modeling for adaptive traffic splitting in dynamic UAV networks.
It formulates the routing challenge as a decentralized partially observable Markov decision process, enabling on-the-fly traffic allocation based on local observations and global delivery requirements.
Empirical results demonstrate that IPPO-DM significantly enhances on-time packet delivery and reduces packet loss compared to heuristic and greedy baseline methods under high network loads.

Independent Proximal Policy Optimization with Dirichlet Modeling (IPPO-DM) is a distributed deep reinforcement learning (DRL) algorithm specifically designed for traffic-adaptive multipath routing in dynamic multi-hop uncrewed aerial vehicle (UAV) networks. It addresses the challenge of meeting latency-sensitive and reliability requirements in highly time-varying and partially observable multi-agent networking environments by leveraging a multi-agent Proximal Policy Optimization (IPPO) framework enhanced with continuous stochastic Dirichlet policy modeling for on-the-fly traffic splitting (Zhao et al., 15 Jan 2026).

1. Problem Context and Decentralized Traffic-Splitting Model

IPPO-DM is designed for the online, distributed solution of a traffic-adaptive multipath routing problem in large-scale, mobile multi-hop UAV networks, where individual UAVs act as intelligent forwarding agents. Each agent must optimally split its outgoing traffic over multiple dynamically changing neighbor links, trading off local buffer status, wireless channel state, and global delivery latency requirements.

This routing problem is formulated as a decentralized partially observable Markov decision process (Dec-POMDP), where:

Global state: Includes all UAV 3D positions, per-node multi-priority split-queue states, and the ground base station (GBS) location.
Observation: Each UAV observes only its own queue and distance to GBS, augmented with a limited view of local neighbor states (e.g., link buffers, one-hop neighbor positions).
Action: Each agent selects a local traffic splitting vector $\mathbf{a}_m = [a_{m,0}, a_{m,1}, \dots, a_{m,N}]$ satisfying $a_{m,n} \in [0,1]$ and $\sum_{n=0}^N a_{m,n} = 1$ , encoding the instantaneous fractions of traffic kept locally and forwarded to each neighbor.
Transition/Reward: Buffer evolution obeys queue dynamics and mobility-induced topology shifts. The local reward is a weighted mixture of progress toward the GBS (urgency- and deadline-aware) and queue/buffer congestion penalties.

The global objective is to maximize the long-term fraction of packets delivered to their destination within their end-to-end latency deadlines, subject to unpredictable queue arrivals and network topology changes.

2. IPPO-DM Algorithmic Framework

2.1 Actor–Critic and Continuous Policy Parameterization

Each UAV executes an independent Proximal Policy Optimization (PPO) agent sharing network weights but using only local observations and local histories. The core actor network processes per-agent observations through:

A feedforward MLP ( $f_1$ ) for the node's own state.
An MLP+GRU ( $f_2$ + $\sigma$ ) encoding neighbor states with short-term memory for temporal information.
A fusion MLP ( $f_3$ ) aggregating features into an unnormalized "tendency" vector $\boldsymbol{\beta}_m$ .

Rather than standard softmax (which produces deterministic splits), IPPO-DM parameterizes the traffic split using the Dirichlet distribution. The vector $\boldsymbol{\beta}_m$ is mapped via sparsemax and linear scaling to a Dirichlet concentration parameter vector $\boldsymbol{\alpha}_m$ , ensuring the output is a valid probability vector on the simplex.

Action sampling proceeds as: $a_{m,n} \in [0,1]$ 0 This Dirichlet modeling allows both stochastic exploration in split ratios and direct control over the expected allocation and its entropy. The expected split for each neighbor is $a_{m,n} \in [0,1]$ 1, where $a_{m,n} \in [0,1]$ 2.

2.2 PPO Losses and Training Regime

The policy and value networks are trained end-to-end using clipped surrogate PPO losses combined with an entropy regularization term that encourages diverse exploration over possible traffic splits. The actor-critic updates employ Generalized Advantage Estimation (GAE) for variance reduction.

Clipped loss and value updates are given (for agent $a_{m,n} \in [0,1]$ 3, time $a_{m,n} \in [0,1]$ 4): $a_{m,n} \in [0,1]$ 5

$a_{m,n} \in [0,1]$ 6

where $a_{m,n} \in [0,1]$ 7 is the entropy of the Dirichlet distribution.

The critic loss incorporates value clipping to stabilize training.

2.3 Execution-Time Resampling and Flow Sparsification

To avoid splitting minute traffic across many potential next hops, IPPO-DM further sparsifies the Dirichlet output at execution by selecting only the top- $a_{m,n} \in [0,1]$ 8 splits based on the size of the current queue, masking others to zero, and re-normalizing on the simplex. This regularizes the split distribution, adaptively minimizing packet scatter when traffic is light.

The overall solution is trained in a centralized, multi-agent reinforcement learning (MARL) setting with decentralized execution. Agents coordinate only via shared model weights; no online exchange is required except standard local state and neighbor queries.

3. Simulation Methodology and Protocol Hyperparameters

IPPO-DM is evaluated via detailed simulation in dynamic 3D UAV networks of size $a_{m,n} \in [0,1]$ 9 drones over a $\sum_{n=0}^N a_{m,n} = 1$ 0 m volume, with Gauss–Markov mobility and realistic wireless physical layer constraints (e.g., $\sum_{n=0}^N a_{m,n} = 1$ 1 subchannels of $\sum_{n=0}^N a_{m,n} = 1$ 2 MHz each, SINR constraints, $\sum_{n=0}^N a_{m,n} = 1$ 3 dBm). Each UAV can forward via up to $\sum_{n=0}^N a_{m,n} = 1$ 4 neighbors, with the practical distribution saturating at $\sum_{n=0}^N a_{m,n} = 1$ 5 for performance.

Key protocol/training parameters:

Parameter	Value
AdamW learning rate	$\sum_{n=0}^N a_{m,n} = 1$ 6
PPO entropy weight	$\sum_{n=0}^N a_{m,n} = 1$ 7
Dirichlet scaling	$\sum_{n=0}^N a_{m,n} = 1$ 8
Dirichlet $\sum_{n=0}^N a_{m,n} = 1$ 9	0.5
PPO clipping	$f_1$ 0, $f_1$ 1
Queue step for split sparsification	$f_1$ 2

The simulation employs stochastic task arrivals, randomized queue priorities, and variable packet deadlines, producing highly nonstationary queueing patterns typical of mobile UAV backhaul.

4. Empirical Performance and Comparative Analysis

Across a range of scenarios and baselines, IPPO-DM yields substantial qualitative and quantitative gains in delivery timeliness and robustness:

On-time packet delivery ratio ( $f_1$ $f_{1}$ 3):
- Low load ( $f_1$ 4– $f_1$ 5 MB/task): $f_1$ 6 (IPPO-DM) vs. $f_1$ 7 (heuristic), $f_1$ 8 (greedy)
- High load ( $f_1$ 9– $f_2$ 0 MB/task): $f_2$ 1 (IPPO-DM) vs. $f_2$ 2 (heuristic), $f_2$ 3 (greedy)
Packet loss ratio:
- Low load: $f_2$ 4 (IPPO-DM) vs. $f_2$ 5 (heuristic), $f_2$ 6 (greedy)
- High load: $f_2$ 7 (IPPO-DM) vs. $f_2$ 8 (heuristic), $f_2$ 9 (greedy)
Scalability: Performance saturates for $\sigma$ 0 neighbors. For $\sigma$ 1 to $\sigma$ 2 UAVs, delivery ratio rises from $\sigma$ 3 to $\sigma$ 4.
Convergence: Average reward plateaus stably after ~2000 episodes.

These findings demonstrate that IPPO-DM's traffic-aware Dirichlet-modeled stochastic splitting outperforms both static uniform splitting and greedy single-path strategies in terms of both latency and loss—especially under high network load and scale (Zhao et al., 15 Jan 2026).

5. Comparisons, Generalization, and Relations to Other Methods

Unlike deterministic heuristic splitting, IPPO-DM's continuous-action Dirichlet modeling flexibly allocates splitting probability across any number of candidates, inherently respecting the flow-conservation constraint and supporting both sparse and diffuse splits as network and queue state dictate.
Compared to overlay multipath methods based on online bandit learning (Zhang, 2020), IPPO-DM incorporates global delivery deadlines, hybrid reward components, and decentralized policy execution.
In contrast to S-MATE (Aly et al., 2010, Aly et al., 2010) or MATE (Aly et al., 2010) which use adaptive per-path splitting but operate with static rules and scalar update loops, IPPO-DM supports fully nonlinear, context-dependent split policies over mobile, dynamic topologies.

6. Implications, Extensions, and Limitations

IPPO-DM demonstrates the efficacy of deep MARL with continuous stochastic policies for fine-grained, on-the-fly traffic adaptation in highly dynamic, decentralized environments:

The Dirichlet output parameterization aligns with the probability simplex, naturally modeling feasible traffic splits.
Flow sparsification at runtime enables practical deployment by limiting scatter without the need for hard-coded thresholds.
The MARL framework generalizes to other Dec-POMDP-based routing contexts, and could be adapted for SDN or hybrid infrastructure/airborne mesh networks.

Limitations include the necessity of shared model synchronization pre-deployment, the inherent sample demand of deep RL training, and the lack of experimental field trials beyond simulation to date.

7. References

Multipath Routing for Multi-Hop UAV Networks (Zhao et al., 15 Jan 2026)
An Online Learning Based Path Selection for Multipath Video Telephony Service in Overlay (Zhang, 2020)
S-MATE: Secure Coding-based Multipath Adaptive Traffic Engineering (Aly et al., 2010)
Protection Over Asymmetric Channels, S-MATE: Secure Multipath Adaptive Traffic Engineering (Aly et al., 2010)