Dirichlet D3PG for MEC Offloading
- Dirichlet D3PG is a deep reinforcement learning algorithm that integrates a Dirichlet policy head for simplex-constrained task partitioning with continuous control for MEC offloading.
- It formulates MEC offloading as a Markov decision process with hybrid actions, enabling efficient allocation of computational resources and balancing multi-objective optimization tasks.
- Experimental evaluations demonstrate that D3PG outperforms standard methods with 10–20% more timely task completions, 15% lower energy use, and reduced service latency.
Dirichlet Deep Deterministic Policy Gradient (D3PG) is a deep reinforcement learning algorithm designed for constrained hybrid action spaces, prominently in dynamic environments encountered in Mobile Edge Computing (MEC). D3PG addresses the challenge of simultaneous task partitioning (distribution over edge servers) and computational power allocation (continuous values), as formulated in a Markov decision process (MDP) with a hybrid, tightly constrained action space. The algorithm extends the conventional Deep Deterministic Policy Gradient (DDPG) framework by introducing a Dirichlet policy head to parameterize simplex-constrained action components, while separately handling standard real-valued actions. The approach directly supports multi-objective optimization tasks critical to MEC—including maximizing task completions before deadlines, minimizing energy expenditure, and reducing service latency—while effectively managing requirements such as sum-to-one constraints and continuous control (Ale et al., 2021).
1. Markov Decision Process Formulation for MEC Offloading
D3PG operates within a Markov decision process framework adapted to represent the MEC setting with multiple IoT devices and heterogeneous edge servers. The state at each decision epoch, , consists of:
- : For each of edge servers, this vector contains the current queue length, remaining running time of the head-of-queue task, and available CPU frequency.
- : The uplink rate matrix, specifying the wireless transmission rate from each user to each server .
- : The set of newly arrived computation tasks, where encodes the task data size, required CPU cycles, and deadline.
The hybrid action at each epoch is , where:
- : The task partition vector, specifying the fraction of the current task to offload to each server; subject to 0 and 1.
- 2: The CPU frequency allocations per server, where 3 is the normalized proportion of each server’s 4.
The transition dynamics 5 arise from the stochastic evolution of server queues, wireless channels, and task arrivals.
The scalar reward at time 6 is a composite of multiple objectives:
7
where 8 indicates task completion within deadline, 9 is the aggregate energy cost (transmission plus computation), 0 is incurred latency, and 1 are normalization scalars; 2 is a small regularization constant and 3 sets a success-versus-cost tradeoff.
2. Actor–Critic Network Architecture
The D3PG agent employs two neural network modules: an actor and a critic. Both utilize deep multilayer perceptron architectures.
- Actor Network, 4: Input dimension is 5. Three hidden layers of 256, 512, and 256 units with ReLU activations culminate in two output heads:
- Dirichlet head: Outputs concentration parameter logits 6, with 7 (8).
- Continuous head: Outputs real values 9, mapped via a bounded activation (e.g., 0 using scaled tanh), with Ornstein-Uhlenbeck noise 1 applied for exploration.
- At run time, 2, 3.
- Critic Network, 4: Input dimension is 5. Three hidden layers mirror the actor (256–512–256, with ReLU). The output is a scalar Q-value. A target critic of identical architecture, 6, is maintained for stability.
This division allows independent learning of the simplex-constrained distributional action and the regular continuous action.
3. Dirichlet Policy Head and Simplex-Constrained Actions
The Dirichlet policy head is central to D3PG’s ability to model simplex-constrained decisions. For each action, a vector 7 of positive concentration parameters is generated; then, 8. The standard probability density is
9
with 0 the multivariate Beta function. The exponentiation and 1-offset ensure each 2.
Sampling follows the Gamma-reparameterization: for 3, sample 4 and set 5. This allows differentiability for policy gradient updates via the score-function estimator or explicit reparameterization.
This approach ensures all partitioning actions satisfy both non-negativity and sum-to-one constraints at every timestep, a property not guaranteed by unconstrained parameterizations or direct softmax transformation followed by Gaussian noise.
4. Training Procedure and Loss Functions
D3PG training follows an off-policy actor-critic routine utilizing deep experience replay. For each time step:
- The actor produces logits 6 and frequency values 7 from 8.
- Concentration parameters are computed as 9, then 0.
- The complete action 1 is executed in the environment, and the outcome 2 sampled.
- Transitions are stored in replay buffer 3.
- For each update cycle, minibatches are drawn from 4.
Loss terms are:
- Critic (Bellman error):
5
where 6.
- Actor (deterministic policy gradient):
7
with actions sampled as 8.
Target networks are softly updated at each step, e.g., 9 for small 0.
A concise high-level pseudocode is provided:
| Algorithm Component | Operation/Role | Details |
|---|---|---|
| Actor Forward | State 1 Dirichlet logits + frequency | 2, 3 |
| Action Sampling | 4 Dirichlet(5), 6 | Gamma reparameterization + noise |
| Critic Forward | 7 to Q-value | Used for Bellman error/actor gradient |
| Target Networks | Gradual update | 8 |
5. Comparison with Existing Methods and Key Algorithmic Innovations
Standard DDPG algorithms are designed for unconstrained, real-valued action spaces. In the MEC task offloading problem, a sub-action (task-slice allocation) must satisfy strict simplex constraints, which standard DDPG cannot natively enforce. D3PG’s Dirichlet parameterization:
- Guarantees simplex-constrained actions without requiring post-processing.
- Provides intrinsic stochasticity for exploration, supplanting ad hoc 9-greedy schemes.
- Removes reliance on softmax transformations followed by Gaussian noise, which can yield suboptimal or locally-trapped solutions.
Ablation studies underline that Dirichlet-based policy heads outperform variants using naïve softmax-plus-noise or those neglecting action constraints (e.g., treating all actions as unconstrained continuous or purely discrete).
6. Experimental Evaluation and Performance
Simulation experiments were conducted on MEC settings with up to 1,000 IoT users and 50 edge servers, encompassing diverse hardware profiles (e.g., 0), and task sizes 1 bits. The D3PG agent was compared against DDPG, DDPG with softmax-partitioning, Twin Delayed DDPG (TD3), and a greedy offloading heuristic. Key configurations included five-layer neural architectures (input–256–512–256–output), batch size 256, learning rate 2, and 3.
Principal performance results:
- D3PG converged to the highest cumulative reward within approximately 1,500 episodes.
- Relative to baselines:
- 10–20% more tasks completed before deadlines.
- Approximately 15% lower energy use per completed task.
- Lower average task latency.
- Improved episode-length stability (servers are less likely to become overloaded).
Ablation analyses demonstrate the efficacy of Dirichlet-constrained partitioning, which is strictly superior to approaches that either do not enforce simplex constraints or rely on softmaxed Gaussian noise.
7. Applicability and Generalization
Although devised for joint task partitioning and computation offloading in MEC with hybrid action types, D3PG provides a general framework for reinforcement learning tasks requiring distribution-valued (simplex-constrained) and real-valued actions. Its architectural division—Dirichlet head for distributional actions and conventional output noise for continuous control—supports principled multi-objective optimization, robust exploration, and compliance with domain-specific constraints in reinforcement learning, both within and beyond MEC environments (Ale et al., 2021).