Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dirichlet D3PG for MEC Offloading

Updated 26 May 2026
  • Dirichlet D3PG is a deep reinforcement learning algorithm that integrates a Dirichlet policy head for simplex-constrained task partitioning with continuous control for MEC offloading.
  • It formulates MEC offloading as a Markov decision process with hybrid actions, enabling efficient allocation of computational resources and balancing multi-objective optimization tasks.
  • Experimental evaluations demonstrate that D3PG outperforms standard methods with 10–20% more timely task completions, 15% lower energy use, and reduced service latency.

Dirichlet Deep Deterministic Policy Gradient (D3PG) is a deep reinforcement learning algorithm designed for constrained hybrid action spaces, prominently in dynamic environments encountered in Mobile Edge Computing (MEC). D3PG addresses the challenge of simultaneous task partitioning (distribution over edge servers) and computational power allocation (continuous values), as formulated in a Markov decision process (MDP) with a hybrid, tightly constrained action space. The algorithm extends the conventional Deep Deterministic Policy Gradient (DDPG) framework by introducing a Dirichlet policy head to parameterize simplex-constrained action components, while separately handling standard real-valued actions. The approach directly supports multi-objective optimization tasks critical to MEC—including maximizing task completions before deadlines, minimizing energy expenditure, and reducing service latency—while effectively managing requirements such as sum-to-one constraints and continuous control (Ale et al., 2021).

1. Markov Decision Process Formulation for MEC Offloading

D3PG operates within a Markov decision process framework adapted to represent the MEC setting with multiple IoT devices and heterogeneous edge servers. The state at each decision epoch, st=(M,ζ,Ω)s_t = (M, \zeta, \Omega), consists of:

  • M=(m1,...,mK)M = (m_1, ..., m_K): For each of KK edge servers, this vector contains the current queue length, remaining running time of the head-of-queue task, and available CPU frequency.
  • ζ={ζi,j}\zeta = \{\zeta_{i,j}\}: The uplink rate matrix, specifying the wireless transmission rate from each user ii to each server jj.
  • Ω={Ωi}\Omega = \{\Omega_i\}: The set of newly arrived computation tasks, where Ωi=(Di,Ci,Δimax)\Omega_i = (D_i, C_i, \Delta_i^{max}) encodes the task data size, required CPU cycles, and deadline.

The hybrid action at each epoch is at=(Φt,Ft)a_t=(\Phi_t,F_t), where:

  • Φt=(Ï•1,...,Ï•K)∼Dirichlet(ψ)\Phi_t = (\phi_1,...,\phi_K)\sim\textrm{Dirichlet}(\psi): The task partition vector, specifying the fraction of the current task to offload to each server; subject to M=(m1,...,mK)M = (m_1, ..., m_K)0 and M=(m1,...,mK)M = (m_1, ..., m_K)1.
  • M=(m1,...,mK)M = (m_1, ..., m_K)2: The CPU frequency allocations per server, where M=(m1,...,mK)M = (m_1, ..., m_K)3 is the normalized proportion of each server’s M=(m1,...,mK)M = (m_1, ..., m_K)4.

The transition dynamics M=(m1,...,mK)M = (m_1, ..., m_K)5 arise from the stochastic evolution of server queues, wireless channels, and task arrivals.

The scalar reward at time M=(m1,...,mK)M = (m_1, ..., m_K)6 is a composite of multiple objectives:

M=(m1,...,mK)M = (m_1, ..., m_K)7

where M=(m1,...,mK)M = (m_1, ..., m_K)8 indicates task completion within deadline, M=(m1,...,mK)M = (m_1, ..., m_K)9 is the aggregate energy cost (transmission plus computation), KK0 is incurred latency, and KK1 are normalization scalars; KK2 is a small regularization constant and KK3 sets a success-versus-cost tradeoff.

2. Actor–Critic Network Architecture

The D3PG agent employs two neural network modules: an actor and a critic. Both utilize deep multilayer perceptron architectures.

  • Actor Network, KK4: Input dimension is KK5. Three hidden layers of 256, 512, and 256 units with ReLU activations culminate in two output heads:
    • Dirichlet head: Outputs concentration parameter logits KK6, with KK7 (KK8).
    • Continuous head: Outputs real values KK9, mapped via a bounded activation (e.g., ζ={ζi,j}\zeta = \{\zeta_{i,j}\}0 using scaled tanh), with Ornstein-Uhlenbeck noise ζ={ζi,j}\zeta = \{\zeta_{i,j}\}1 applied for exploration.
    • At run time, ζ={ζi,j}\zeta = \{\zeta_{i,j}\}2, ζ={ζi,j}\zeta = \{\zeta_{i,j}\}3.
  • Critic Network, ζ={ζi,j}\zeta = \{\zeta_{i,j}\}4: Input dimension is ζ={ζi,j}\zeta = \{\zeta_{i,j}\}5. Three hidden layers mirror the actor (256–512–256, with ReLU). The output is a scalar Q-value. A target critic of identical architecture, ζ={ζi,j}\zeta = \{\zeta_{i,j}\}6, is maintained for stability.

This division allows independent learning of the simplex-constrained distributional action and the regular continuous action.

3. Dirichlet Policy Head and Simplex-Constrained Actions

The Dirichlet policy head is central to D3PG’s ability to model simplex-constrained decisions. For each action, a vector ζ={ζi,j}\zeta = \{\zeta_{i,j}\}7 of positive concentration parameters is generated; then, ζ={ζi,j}\zeta = \{\zeta_{i,j}\}8. The standard probability density is

ζ={ζi,j}\zeta = \{\zeta_{i,j}\}9

with ii0 the multivariate Beta function. The exponentiation and ii1-offset ensure each ii2.

Sampling follows the Gamma-reparameterization: for ii3, sample ii4 and set ii5. This allows differentiability for policy gradient updates via the score-function estimator or explicit reparameterization.

This approach ensures all partitioning actions satisfy both non-negativity and sum-to-one constraints at every timestep, a property not guaranteed by unconstrained parameterizations or direct softmax transformation followed by Gaussian noise.

4. Training Procedure and Loss Functions

D3PG training follows an off-policy actor-critic routine utilizing deep experience replay. For each time step:

  1. The actor produces logits ii6 and frequency values ii7 from ii8.
  2. Concentration parameters are computed as ii9, then jj0.
  3. The complete action jj1 is executed in the environment, and the outcome jj2 sampled.
  4. Transitions are stored in replay buffer jj3.
  5. For each update cycle, minibatches are drawn from jj4.

Loss terms are:

  • Critic (Bellman error):

jj5

where jj6.

  • Actor (deterministic policy gradient):

jj7

with actions sampled as jj8.

Target networks are softly updated at each step, e.g., jj9 for small Ω={Ωi}\Omega = \{\Omega_i\}0.

A concise high-level pseudocode is provided:

Algorithm Component Operation/Role Details
Actor Forward State Ω={Ωi}\Omega = \{\Omega_i\}1 Dirichlet logits + frequency Ω={Ωi}\Omega = \{\Omega_i\}2, Ω={Ωi}\Omega = \{\Omega_i\}3
Action Sampling Ω={Ωi}\Omega = \{\Omega_i\}4 Dirichlet(Ω={Ωi}\Omega = \{\Omega_i\}5), Ω={Ωi}\Omega = \{\Omega_i\}6 Gamma reparameterization + noise
Critic Forward Ω={Ωi}\Omega = \{\Omega_i\}7 to Q-value Used for Bellman error/actor gradient
Target Networks Gradual update Ω={Ωi}\Omega = \{\Omega_i\}8

5. Comparison with Existing Methods and Key Algorithmic Innovations

Standard DDPG algorithms are designed for unconstrained, real-valued action spaces. In the MEC task offloading problem, a sub-action (task-slice allocation) must satisfy strict simplex constraints, which standard DDPG cannot natively enforce. D3PG’s Dirichlet parameterization:

  • Guarantees simplex-constrained actions without requiring post-processing.
  • Provides intrinsic stochasticity for exploration, supplanting ad hoc Ω={Ωi}\Omega = \{\Omega_i\}9-greedy schemes.
  • Removes reliance on softmax transformations followed by Gaussian noise, which can yield suboptimal or locally-trapped solutions.

Ablation studies underline that Dirichlet-based policy heads outperform variants using naïve softmax-plus-noise or those neglecting action constraints (e.g., treating all actions as unconstrained continuous or purely discrete).

6. Experimental Evaluation and Performance

Simulation experiments were conducted on MEC settings with up to 1,000 IoT users and 50 edge servers, encompassing diverse hardware profiles (e.g., Ωi=(Di,Ci,Δimax)\Omega_i = (D_i, C_i, \Delta_i^{max})0), and task sizes Ωi=(Di,Ci,Δimax)\Omega_i = (D_i, C_i, \Delta_i^{max})1 bits. The D3PG agent was compared against DDPG, DDPG with softmax-partitioning, Twin Delayed DDPG (TD3), and a greedy offloading heuristic. Key configurations included five-layer neural architectures (input–256–512–256–output), batch size 256, learning rate Ωi=(Di,Ci,Δimax)\Omega_i = (D_i, C_i, \Delta_i^{max})2, and Ωi=(Di,Ci,Δimax)\Omega_i = (D_i, C_i, \Delta_i^{max})3.

Principal performance results:

  • D3PG converged to the highest cumulative reward within approximately 1,500 episodes.
  • Relative to baselines:
    • 10–20% more tasks completed before deadlines.
    • Approximately 15% lower energy use per completed task.
    • Lower average task latency.
    • Improved episode-length stability (servers are less likely to become overloaded).

Ablation analyses demonstrate the efficacy of Dirichlet-constrained partitioning, which is strictly superior to approaches that either do not enforce simplex constraints or rely on softmaxed Gaussian noise.

7. Applicability and Generalization

Although devised for joint task partitioning and computation offloading in MEC with hybrid action types, D3PG provides a general framework for reinforcement learning tasks requiring distribution-valued (simplex-constrained) and real-valued actions. Its architectural division—Dirichlet head for distributional actions and conventional output noise for continuous control—supports principled multi-objective optimization, robust exploration, and compliance with domain-specific constraints in reinforcement learning, both within and beyond MEC environments (Ale et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dirichlet DDPG (D3PG).