Papers
Topics
Authors
Recent
Search
2000 character limit reached

Efficient Diffusion Policy (EDP)

Updated 24 June 2026
  • EDP is a framework that uses diffusion model score functions and reverse-KL regularization to extract deterministic, support-aligned policies for offline RL.
  • It implements one-shot actor extraction via a feed-forward MLP, reducing inference FLOPs by over 25× while matching state-of-the-art performance on control benchmarks.
  • EDP’s multi-stage training pipeline combines critic pretraining and diffusion behavior modeling to guarantee multimodal support and fast, real-time control.

Efficient Diffusion Policy (EDP) refers to a family of methods designed to overcome the intrinsic computational inefficiency of traditional diffusion policy inference and training, while preserving the expressive modeling power and support-matching properties inherent to diffusion-based policies. These methods are characterized by algorithmic innovations—such as score-based regularization, one-shot or adaptive inference, and lightweight policy extraction—that enable dramatically reduced training and inference costs without compromising policy quality, multimodal action modeling, or empirical performance on challenging reinforcement learning and imitation learning benchmarks.

1. Foundations: Diffusion Modeling for Policy Parameterization

Diffusion modeling parameterizes policies via a learned denoising process that reverses a forward stochastic differential equation (SDE) which gradually corrupts actions with noise. Formally, for behavior policy μ(as)\mu(a|s), a parameterized Markov chain q(atat1,s)q(a_t|a_{t-1},s) is defined, typically in Gaussian form:

q(atat1)=N(at;1βtat1,βtI)q(a_t | a_{t-1}) = \mathcal{N}(a_t; \sqrt{1-\beta_t} a_{t-1}, \beta_t I)

This chain culminates in a highly expressive estimator μ^(as)\hat\mu(a|s), whose reverse process samples an action by iterative denoising. Crucially, the diffusion model provides direct access to the score function alogμ(as)\nabla_a \log \mu(a|s) via the denoising network:

atlogμt(ats)ϵψ(at,s,t)σt\nabla_{a_t} \log \mu_t(a_t|s) \approx - \frac{\epsilon_\psi(a_t, s, t)}{\sigma_t}

where ϵψ\epsilon_\psi is trained to predict noise and σt\sigma_t is the effective noise schedule parameter. While such approaches achieve support-matching with high fidelity, the multi-step sampling process incurs significant inference latency, often requiring T=5100T=5-100 network evaluations per action, and resulting in sub-kHz control rates unsuitable for real-time deployment (Chen et al., 2023).

2. Score-Regularized Policy Optimization (SRPO) and Deterministic Actor Extraction

EDP reframes the policy optimization problem by leveraging the score function of the pretrained diffusion model as a regularizer in policy gradient updates. Starting from a reverse-KL objective:

J(π)=EsD,aπ(s)[Qφ(s,a)]1βKL[π(s)μ(s)]J(\pi) = \mathbb{E}_{s\sim D, a\sim\pi(\cdot|s)}[Q_\varphi(s, a)] - \frac{1}{\beta} \mathrm{KL}[\pi(\cdot|s) \| \mu(\cdot|s)]

Differentiating for a deterministic actor q(atat1,s)q(a_t|a_{t-1},s)0 yields:

q(atat1,s)q(a_t|a_{t-1},s)1

Here, the only nontrivial term is the behavior score, efficiently approximated by the pretrained diffusion model.

Training proceeds by directly updating the parameters of a simple feed-forward MLP actor to maximize q(atat1,s)q(a_t|a_{t-1},s)2, while regularizing with the diffusion score to ensure the action remains in the high-density region of the behavior distribution. A surrogate objective ensembles over different noise levels q(atat1,s)q(a_t|a_{t-1},s)3, further smoothing the optimization. The result is a deterministic “one-shot” policy q(atat1,s)q(a_t|a_{t-1},s)4 whose computational overhead is similar to conventional Gaussian actors, and that retains the support-matching and multimodal capabilities of the original diffusion parameterization (Chen et al., 2023).

3. Algorithmic Structure and Implementation Workflow

The canonical EDP pipeline is as follows:

  1. Critic Pretraining: Train a q(atat1,s)q(a_t|a_{t-1},s)5-function (e.g., with IQL, TD3, or CQL) using the offline dataset.
  2. Diffusion Behavior Model Pretraining: Fit a high-capacity diffusion model to the dataset, learning to predict q(atat1,s)q(a_t|a_{t-1},s)6.
  3. Deterministic Actor Extraction: Optimize the deterministic actor q(atat1,s)q(a_t|a_{t-1},s)7 via the SRPO objective, using

q(atat1,s)q(a_t|a_{t-1},s)8

evaluated at q(atat1,s)q(a_t|a_{t-1},s)9. The weight q(atat1)=N(at;1βtat1,βtI)q(a_t | a_{t-1}) = \mathcal{N}(a_t; \sqrt{1-\beta_t} a_{t-1}, \beta_t I)0 and the subtraction of the q(atat1)=N(at;1βtat1,βtI)q(a_t | a_{t-1}) = \mathcal{N}(a_t; \sqrt{1-\beta_t} a_{t-1}, \beta_t I)1 baseline aid in variance reduction and convergence.

  1. Inference: At test time, action selection is a single deterministic forward pass; no diffusion chain unrolling is necessary.

This procedure bypasses the need for costly multi-step sampling both in training and deployment, achieving a reduction in FLOPs by more than q(atat1)=N(at;1βtat1,βtI)q(a_t | a_{t-1}) = \mathcal{N}(a_t; \sqrt{1-\beta_t} a_{t-1}, \beta_t I)2 relative to diffusion baselines, with empirical policy performance at or above state-of-the-art on D4RL locomotion tasks (Chen et al., 2023).

4. Empirical Performance and Efficiency Gains

Extensive evaluation on D4RL continuous-control benchmarks demonstrates that EDP matches or outperforms leading diffusion-based offline RL algorithms—such as Diffusion-QL, IDQL, and QGPO—and achieves normalized scores averaging q(atat1)=N(at;1βtat1,βtI)q(a_t | a_{t-1}) = \mathcal{N}(a_t; \sqrt{1-\beta_t} a_{t-1}, \beta_t I)387 on key environments. More significantly, EDP’s computational gains are substantial:

Metric Diffusion-QL EDP (SRPO)
Inference speed (Hz, GPU) 50–200 up to 3,000
Relative forward-pass FLOPs 100% <1%
Wall-clock speedup (training, D4RL) 25–1000×

Ablation studies show that both the q(atat1)=N(at;1βtat1,βtI)q(a_t | a_{t-1}) = \mathcal{N}(a_t; \sqrt{1-\beta_t} a_{t-1}, \beta_t I)4-ensembling and the use of the q(atat1)=N(at;1βtat1,βtI)q(a_t | a_{t-1}) = \mathcal{N}(a_t; \sqrt{1-\beta_t} a_{t-1}, \beta_t I)5 variance baseline improve final scores. The approach is robust to the choice of noise schedule and hyperparameters such as q(atat1)=N(at;1βtat1,βtI)q(a_t | a_{t-1}) = \mathcal{N}(a_t; \sqrt{1-\beta_t} a_{t-1}, \beta_t I)6 (Chen et al., 2023).

5. Theoretical Insights and Support-Matching Guarantees

EDP leverages the reverse-KL regularization property to favor “mode-seeking” solutions, ensuring the extracted actor does not cover low-probability modes, in contrast to mode-covering forward-KL baselines. The policy extraction remains stable even in highly multimodal or heterogeneous behavior settings.

Analytically, for each noise level q(atat1)=N(at;1βtat1,βtI)q(a_t | a_{t-1}) = \mathcal{N}(a_t; \sqrt{1-\beta_t} a_{t-1}, \beta_t I)7, the KL-divergence minimizer between the actor’s action distribution and the q(atat1)=N(at;1βtat1,βtI)q(a_t | a_{t-1}) = \mathcal{N}(a_t; \sqrt{1-\beta_t} a_{t-1}, \beta_t I)8-corrupted behavior marginal is the behavior policy itself. Thus, the one-shot actor extracted via SRPO, potentially ensembling across q(atat1)=N(at;1βtat1,βtI)q(a_t | a_{t-1}) = \mathcal{N}(a_t; \sqrt{1-\beta_t} a_{t-1}, \beta_t I)9, remains support-aligned with the demonstration data. This supports the strategy of using pre-trained, expressive diffusion models purely for score evaluation, not iterative sampling (Chen et al., 2023).

6. Practical Considerations and Limitations

EDP’s advantages are most pronounced in regimes where (i) data support alignment is critical (offline RL), (ii) policy evaluation cost must be minimized (real-time control, large-scale experimentation), and (iii) diffusion models capture heterogeneous, multi-modal behavior distributions. It is agnostic to the choice of the underlying critic learning method.

Limitations include:

  • The extracted deterministic policy may not fully represent the diversity of the original diffusion action distribution (potentially less stochasticity).
  • Transfer to high-dimensional, highly multimodal tasks may require careful tuning of ensembling and regularization parameters.

No formal convergence rates are established; however, empirical stability across ablations is strong and variance-reduction techniques are well-motivated and effective.

7. Impact on Offline RL and Broader Directions

The emergence of Efficient Diffusion Policy methods—exemplified by SRPO—has eliminated the primary barrier of high computational overhead in diffusion-based reinforcement learning. Subsequent frameworks (e.g., D3P (Yu et al., 9 Aug 2025), Mamba Policy (Cao et al., 2024), and downstream “efficient” adaptation or equivariant extensions) follow this paradigm of either score-based policy extraction, adaptive or reduced-step inference, or structured lightweight modeling.

EDP enables the practical deployment of diffusion policies in domains that require both complex action modeling and stringent latency, such as high-frequency robotic control and large-scale offline RL research. The conceptual isolation of score regularization as the bridge between multimodal generative modeling and efficient real-time policy optimization continues to underpin advances in the design of expressive, practical sequential decision-making systems (Chen et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Efficient Diffusion Policy (EDP).