Diffusion Policy (DDPMs)

Updated 30 June 2025

A diffusion policy is a policy representation for reinforcement learning and robot imitation learning based on conditional denoising diffusion probabilistic models (DDPMs). Instead of using standard unimodal policy classes such as Gaussian distributions, diffusion policies leverage the expressive power of deep generative models to accurately model complex, high-dimensional, and often multimodal action distributions. This approach has been shown to be particularly effective in offline reinforcement learning, continuous control, and visuomotor robot manipulation, where traditional policy parameterizations exhibit limited expressiveness and poor generalization.

1. Mathematical Formulation and Policy Structure

The defining characteristic of a diffusion policy is that it models the policy as the reverse process of a diffusion model. The policy, parameterized by $\theta$ , is expressed as a chain of conditional transitions:

$\pi_\theta(a|s) = p_\theta(a^{0:N}|s) = \mathcal{N}(a^N; \mathbf{0}, I) \prod_{i=1}^N p_\theta(a^{i-1} | a^{i}, s)$

where:

$a^N$ is sampled from a standard normal distribution,
$(a^{N-1}, \ldots, a^0)$ are generated by iteratively denoising $a^{N}, \ldots, a^1$ through learned reverse transitions,
the final $a^0$ is the action to execute conditioned on state $s$ .

Each reverse transition $p_\theta(a^{i-1} | a^{i}, s)$ is generally parameterized as a conditional Gaussian, with a mean and (optionally) variance predicted by a neural network as a function of current noisy action $a^i$ , state $s$ , and the timestep $i$ . This formulation enables multimodal and highly flexible policy distributions, unlike the unimodal Gaussians commonly used in reinforcement learning.

2. Training Objectives: Coupling Behavior Cloning and Policy Improvement

Diffusion policies are typically trained via two complementary objectives:

Behavior Cloning (Denoising) Loss: The denoising (behavior cloning) loss encourages the model to match the empirical distribution of observed actions in the offline dataset $\mathcal{D}$ :

$\mathcal{L}_d(\theta) = \mathbb{E}_{i \sim \mathcal{U}(1, N), \epsilon \sim \mathcal{N}(0,I), (s,a) \sim \mathcal{D}} \left[ \left\| \epsilon - \epsilon_\theta( \sqrt{\bar\alpha_i} a + \sqrt{1-\bar\alpha_i} \epsilon, s, i) \right\|^2 \right]$

Here, the network is trained to predict the added noise $\epsilon$ , given a noisy action and the state as input. This training method allows the model to capture all modes of the demonstrated action distribution—including multiple distinct expert behaviors.

Q-learning (Policy Improvement) Loss: To achieve return maximization beyond cloning, the algorithm introduces Q-learning guidance by jointly maximizing action-values:

$\mathcal{L}(\theta) = \mathcal{L}_d(\theta) - \alpha \cdot \mathbb{E}_{s \sim \mathcal{D}, a^0 \sim \pi_\theta(\cdot|s)} [ Q_\phi(s, a^0) ]$

$\alpha$ is a normalization factor ensuring scale compatibility across datasets and environments. The Q-function $Q_\phi$ is trained independently, typically via BeLLMan residual minimization, and its gradients are backpropagated through the entire diffusion chain, enabling end-to-end policy improvement in the space of expressive, generative policies.

This coupling allows the policy to be both regularized towards the data manifold (avoiding out-of-distribution actions) and optimized for high expected return, without being limited by the support or expressivity of traditional policy classes.

3. Practical Implementation and Impact

Empirical Findings

Expressiveness: Diffusion Q-learning (Diffusion-QL) achieves state-of-the-art results on diverse challenging offline RL tasks, particularly those with multimodal or narrow-support behavior policies such as D4RL's AntMaze and Adroit domains. In contrast, prior methods—relying on less expressive (e.g. Gaussian or CVAE) policies—were shown to perform suboptimally when regularization or policy improvement fails to capture the full behavior demonstrator distribution.
Stability: Joint optimization of denoising and Q-value guidance ensures robust performance, mitigating catastrophic extrapolation to unseen actions, and overcoming mode collapse.
Ablation studies: Diffusion policies outperform conditional VAEs and other sequence modeling approaches in recovering all modes of the demonstrated behavior.

Comparison to Prior Approaches

Prior methods (BCQ, BEAR, CQL) are based on Gaussian, CVAE, or constrained policies, which are limited in their ability to represent multi-modal distributions and require explicit regularization to avoid policy collapse.
Diffusion policies capture support coverage, multimodality, and generalization through the inherent expressive power of the diffusion modeling process, simplifying regularization and support constraints by design.

Key Empirical Benchmarks

Diffusion-QL consistently achieves or surpasses the performance of strong model-free, value-constrained, model-based, and sequence-modeling baselines across Gym, AntMaze, Adroit, and Kitchen domains. Notably, it demonstrates significant gains in tasks dominated by complex, multimodal demonstration distributions.

4. Integration with Value and Policy Networks

The action-value critic $Q_\phi$ is trained using conventional BeLLMan minimization:

$\mathcal{L}_Q(\phi) = \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim \mathcal{D}} \left[ \left(r_t + \gamma \min_{i=1,2} Q_{\phi_i'}(s_{t+1}, a_{t+1}^0) - Q_{\phi_i}(s_t, a_t) \right)^2 \right]$

where $a_{t+1}^0$ is sampled from the current (diffusion) policy, including the effects of Q-guidance in policy improvement. This workflow allows for end-to-end differentiability, supporting policy updates by propagating Q-value gradients through the diffusion process.

5. Theoretical Justification and Significance

The choice of using diffusion models is justified by their unique ability to:

Approximate arbitrarily complex, high-dimensional distributions supported by the data.
Provide a mathematically principled pathway to jointly model and improve policies (via denoising score matching and Q-guided policy improvement).
Offer convergence guarantees (subject to accurate score matching and discretization).

This paradigm sets a new standard for how expressive, nonparametric policies can be represented, trained, and improved in offline RL, providing robust support coverage and superior policy improvement compared to sequence models and unimodal policy classes.

6. Future Directions and Broader Impact

The introduction of diffusion policies has inspired further research in online and multi-agent RL, reinforcement learning with transformer-based backbones, and generalization to high-dimensional and real-world robot control. The joint optimization of expressive generative modeling and value-guided policy improvement suggests future extensions such as:

Online RL and continual learning with diffusion-based exploration.
Multimodal trajectory synthesis and planning.
Transfer and sim-to-real robot policy learning.

Diffusion policies offer a flexible and scalable foundation for learning complex behaviors in diverse domains where support coverage, safety, and generalization are essential. Their successful integration with Q-learning and end-to-end differentiable pipelines represents a major advance in the design of practical offline RL algorithms.

PDF Markdown Chat (Pro)