Deep Diffusion Policy Gradient (DDiffPG)

Updated 25 March 2026

Deep Diffusion Policy Gradient (DDiffPG) is a reinforcement learning framework that represents complex, multimodal action distributions using denoising diffusion models.
It integrates diffusion model parameterizations with actor–critic and policy gradient techniques to reverse a fixed noising process for effective continuous control.
DDiffPG enables skill discovery and mode conditioning through unsupervised clustering and intrinsic rewards, yielding superior performance in high-dimensional tasks.

Deep Diffusion Policy Gradient (DDiffPG) refers to a family of reinforcement learning (RL) algorithms that employ denoising diffusion probabilistic models (DDPMs) as expressive policy parameterizations. These approaches leverage the ability of diffusion models to encode highly multimodal action distributions and combine them with actor–critic or policy gradient optimization frameworks, resulting in algorithms capable of discovering, representing, and finely optimizing diverse behaviors in continuous control settings. The defining feature is the direct optimization or training of policies defined implicitly via diffusion generative processes rather than classic parametric distributions, enabling versatility in online and offline RL, skill discovery, and mode conditioning (Li et al., 2024, Ma et al., 1 Feb 2025).

1. Diffusion Policy Parameterization in RL

DDiffPG approaches represent the conditional policy $\pi_\theta(a|s)$ as a DDPM, in which actions $a \in \mathbb{R}^d$ are generated by reversing a fixed noising process. The standard forward (diffusion) process in $T$ steps is: $q(a_t | a_{t-1}) = \mathcal{N}\left(a_t;\, \sqrt{1-\beta_t}\,a_{t-1},\, \beta_t I\right),\quad t=1,\ldots,T,$ where $\beta_t$ is the diffusion schedule. The reverse process, parameterized by neural networks $\mu_\theta$ and typically a (diagonal) $\Sigma_\theta$ , is: $p_\theta(a_{t-1} | a_t, s) = \mathcal{N}\left(a_{t-1};\, \mu_\theta(a_t, s, t),\, \Sigma_\theta(a_t, s, t)\right),\quad t=T,\ldots,1.$ Sampling proceeds by drawing $a_T \sim \mathcal{N}(0, I)$ and iteratively denoising toward $a_0 \sim \pi_\theta(\cdot|s)$ . The score-based formulation employs a learned score network $s_\theta(a, t | s) = \nabla_a \log p_t(a | s)$ , optimized via a denoising score-matching loss: $L(\theta) = \mathbb{E}_{t, a_0, \epsilon}\left[ \|\epsilon - \epsilon_\theta(\sqrt{\alpha_t} a_0 + \sqrt{1-\alpha_t}\,\epsilon, s, t)\|^2 \right],$ with $\alpha_t = \prod_{i=1}^t (1-\beta_i)$ . This construction allows the policy to capture highly multimodal, structured action distributions beyond what can be achieved with simple Gaussian policies (Li et al., 2024, Ma et al., 1 Feb 2025).

2. Policy Gradient and Actor–Critic Training with Diffusion Policies

Classic policy-gradient techniques are nontrivial to apply due to the implicit, multistep nature of diffusion models. DDiffPG algorithms address this challenge via several principled strategies:

Target-Action Diffusion Imitation: The “target action” $a^\mathrm{target}$ is constructed by perturbing existing actions toward higher Q-values (e.g., $a^\mathrm{target} \leftarrow a + \eta \nabla_a Q_\phi(s, a)$ for small $\eta$ ). The policy is then trained via a behavioral cloning loss that pushes the diffusion model toward these improved actions:

$L_\mathrm{Diff}(\theta) = \mathbb{E}_{t, (s, a^\mathrm{target}), \epsilon}\left[ \|\epsilon - \epsilon_\theta(a^\mathrm{target}_t, s, t)\|^2 \right],$

with $a^\mathrm{target}_t = \sqrt{\alpha_t} a^\mathrm{target} + \sqrt{1-\alpha_t} \epsilon$ (Li et al., 2024).

Reweighted Score Matching (RSM): RSM generalizes classical denoising score-matching by introducing a reweighting function $g(a_t;s)$ into the loss:

$\mathcal{L}^g(\theta; s, t) = \int g(a_t; s) \|s_\theta(a_t; s, t) - \nabla_{a_t} \log p_t(a_t|s)\|^2 da_t,$

where choices of $g$ enable tractable sampling and optimization in online RL (see Section 4 below) (Ma et al., 1 Feb 2025).

DPPO and Policy-Gradient Surrogates: Diffusion Policy Policy Optimization (DPPO) frames the entire diffusion chain as a two-layer MDP, enabling tractable policy gradients via the log-derivative trick at every denoising step and employing clipped PPO loss and value function baselines for stable optimization (Ren et al., 2024).
Deterministic Variants: D3PG (Deep Diffusion Deterministic Policy Gradient) embeds a diffusion model within the DDPG framework, enabling deterministic or stochastic action selection and compositional actor–critic updates (Liu et al., 2024).

These mechanisms circumvent the vanishing gradient and computational explosion associated with naively backpropagating through the full reverse diffusion process.

3. Multimodal Behavior Discovery and Skill Conditioning

A key contribution of DDiffPG is the ability to autonomously discover, maintain, and control diverse, temporally extended behaviors (“modes”):

Unsupervised Clustering: The set of successful agent trajectories is periodically clustered using hierarchical dynamic time warping (DTW) on their state-sequence distances. Each cluster represents a distinct behavioral mode, augmented by mode assignment of unsuccessful trajectories to the nearest cluster centroid (Li et al., 2024).
Intrinsic Motivation: Novelty-based intrinsic rewards, using prediction error from a fixed random-network-distillation target, drive exploration toward novel states:

$r^\mathrm{intr}(s, a, s') = \max [\mathrm{novelty}(s') - \alpha\,\mathrm{novelty}(s), 0].$

Mode-Conditioned Policy: Each mode $c_i$ is assigned an embedding $e_i$ ; the diffusion network accepts $(a_t, s, e_i)$ as input, enabling both explicit conditioning and controllable mode execution. Masked embeddings at training time ensure the agent samples from the full skill set, while explicit $e_i$ selection at test time enables “replanning,” i.e., switching among skills for robust obstacle avoidance (Li et al., 2024).
Mode-Specific Q-Learning: Separate Q-networks $Q_{\phi_i}$ for each mode $c_i$ prevent RL's greedy objective from collapsing behaviors into a single dominant solution. This structured critic update, coupled with exploration critics for intrinsic rewards, preserves and improves multimodal capacity.

This framework has demonstrated the capacity to master complex, high-dimensional continuous control tasks with remarkably sparse external rewards.

4. Efficient Online Optimization with Reweighted Score Matching

DDiffPG frameworks address the challenge of efficient online diffusion policy optimization via RSM-based objectives that decouple the need for direct sampling from the (unknown) optimal or target action distribution:

Policy Mirror Descent (DPMD): The policy is updated toward an energy-based model of the form $\pi_\mathrm{MD}(a|s) \propto \pi_\mathrm{old}(a|s) \exp(Q(s,a)/\lambda)$ , with the RSM loss enabling tractable training using only samples from the previous policy, without requiring on-policy targets.
Soft Diffusion Actor-Critic (SDAC): Generalizes to maximum-entropy objectives, yielding Boltzmann policies and corresponding score-matching losses reweighted by $\exp(Q(s,a_0)/\lambda)$ with respect to a proposal distribution (Ma et al., 1 Feb 2025).

Computational efficiency is achieved, as per-iteration cost is $\mathcal{O}(T)$ , contrasting with $\mathcal{O}(T^2)$ for naïve backpropagation through the full diffusion chain. Empirically, DPMD outperforms off-policy and other diffusion RL baselines on MuJoCo locomotion tasks, achieving over 120% improvement over soft actor-critic on tasks like Humanoid.

5. Implementation, Scalability, and Experimental Results

Representative DDiffPG instantiations and findings:

Approach	Policy Param.	Key Features	Empirical Results/Findings
(Li et al., 2024)	DDPM w/ actor–critic	Mode discovery, skill clustering, mode-embeddings	Multimodal behaviors, maze replanning
(Ma et al., 1 Feb 2025)	DDPM w/ RSM	Online RL, DPMD, SDAC variants	+143% (Humanoid), +127% (Ant) vs. SAC
(Ren et al., 2024) (DPPO)	DDPM + PPO	Tractable policy gradient, sim-to-real	80% real-world assembly, state/pixel control
(Davey et al., 23 May 2025)	Continuous-time SDE	Convergence of PPG for control-dependent diffusion	Linear convergence, ODE/neural-net solvers
(Liu et al., 2024) (D3PG)	DDPM + DDPG	Wi-Fi optimization	74.6% throughput gain (dense Wi-Fi)

Architectural details often use lightweight MLPs or U-Nets (for diffusion), mode-embeddings ( $\mathbb{R}^5$ ), and 5–20 denoising steps. Hyperparameters include actor learning rates $3\times10^{-4}$ , batch sizes up to 4096, and frequent trajectory reclustering (e.g., every 100 steps) (Li et al., 2024, Liu et al., 2024).

Real-world transfer is demonstrated (e.g., 16/20 successful zero-shot executions on Franka Panda; significant throughput improvements on simulated dense Wi-Fi scenarios). Replanning by switching mode embeddings during policy rollout yields robust adaptation to nonstationary environments (Li et al., 2024, Liu et al., 2024).

6. Theoretical Properties and Convergence Analysis

For settings with control-dependent diffusion (i.e., where the policy influences both drift and diffusion coefficients in the state process), DDiffPG methods grounded in the proximal policy gradient framework provide theoretical guarantees:

The Hamiltonian-gradient update requires adjoint backward SDEs (BSDEs), with the Fréchet derivative yielding:

$(\nabla J(u))_t = \partial_u H_t(X^{(u)}_t, u_t, Y_t, Z_t),$

where $(X, Y, Z)$ solve a forward–backward SDE (Davey et al., 23 May 2025).

Algorithmic updates utilize proximal maps, yielding linear convergence rates under strong convexity assumptions:

$\|u^{k+1} - u^*\|_{\mathcal{H}^2} \leq c^k \|u^0 - u^*\|_{\mathcal{H}^2},$

with $c \in (0, 1)$ for appropriate step sizes and hyperparameters.

Neural-network parameterizations are shown to approximate the optimal (feedback) control to high accuracy, with empirical runtime scaling sub-linearly in problem dimension for certain ODE-based solvers (Davey et al., 23 May 2025).

7. Limitations and Research Directions

Empirical studies recognize several limitations and avenues for future research:

Sample efficiency remains a challenge compared to off-policy RL when pre-training is unavailable, and performance can degrade with missing expert data modes (Ren et al., 2024).
Critical tuning of noise- and exploration-related hyperparameters is necessary to balance exploitation and broad skill coverage.
Theoretical limitations are present in highly non-convex or partial observability settings, and compositionality of modes remains open.
Potential directions include ODE-based fast samplers (e.g., DPM-Solver), better exploration beyond batched Q-sampling, and extensions to model-based or partially observable RL (Ma et al., 1 Feb 2025).

Collectively, Deep Diffusion Policy Gradient algorithms unify the expressiveness of diffusion models with the rigor of policy-gradient optimization, providing a unified foundation for learning and controlling diverse, multimodal behaviors in challenging RL environments across simulated and real-world domains.