Papers
Topics
Authors
Recent
Search
2000 character limit reached

One-Step Diffusion Policy (OneDP)

Updated 9 March 2026
  • OneDP is a one-step framework that approximates diffusion model behavior with a single network evaluation, significantly reducing inference latency in robotic tasks.
  • It leverages methods like diffusion-to-one-step distillation and MeanFlow to enable efficient policy learning in both imitation and reinforcement learning scenarios.
  • Empirical results show 10–700x inference speedups over multi-step models while maintaining high success rates in simulated and real-world robotic benchmarks.

The One-Step Diffusion Policy (OneDP) framework encompasses a set of generative policy learning techniques designed to achieve the expressive capacity of diffusion and flow-matching models while enabling inference in a single network evaluation. OneDP addresses the latency bottleneck inherent in traditional multi-step denoising policies and has been demonstrated in both supervised imitation and on-policy/off-policy reinforcement learning for visuomotor control, robotic manipulation, and continuous control benchmarks (Wang et al., 2024, Chen et al., 31 Jul 2025, Zou et al., 28 Jan 2026, Liu et al., 5 Mar 2026).

1. Motivation and Theoretical Foundations

Diffusion models have exhibited strong generative performance in behavior cloning for robotics but suffer from latency due to the iterative nature of denoising-based action generation. Canonical DDPM-based policies require sequential reverse steps, resulting in tens to hundreds of function evaluations per action and action rates on the order of 1.5 Hz even on GPUs, severely limiting their applicability to high-frequency or resource-constrained robotic tasks (Wang et al., 2024). Low-frequency action outputs impede deployment in dynamic environments, edge systems, and any scenario necessitating closed-loop control or rapid feedback.

The rationale for OneDP rests on theoretical observations about the single-step approximation of continuous transport and denoising models. In time-interpolated flow frameworks, if the target policy is low-variance, the discretization error incurred by a one-step Euler sampler is provably upper-bounded by the variance of the action distribution. As policy optimization—for example, via policy mirror descent—naturally drives the learned policy towards determinism, single-step samplers become increasingly accurate for practical control (Chen et al., 31 Jul 2025, Zou et al., 28 Jan 2026).

2. Core Methodologies for One-Step Generation

Multiple algorithmic instantiations of OneDP have been proposed:

  • Diffusion-to-One-Step Distillation (Behavior Cloning): A distilled generator Gθ(z,x)G_\theta(z, x) mimics the distribution of a trained diffusion policy πϕ(a∣x)\pi_\phi(a|x) by minimizing the reverse KL divergence across the full diffusion chain. Training employs the following sequence: (i) sample z∼N(0,I)z \sim \mathcal{N}(0, I), (ii) generate action a=Gθ(z,x)a = G_\theta(z, x), (iii) align pGθ(a∣x)p_{G_\theta}(a|x) to pπϕ(a∣x)p_{\pi_\phi}(a|x) by matching their score functions across noise levels using an auxiliary score network πψ\pi_\psi (Wang et al., 2024). Deterministic (OneDP-D) and stochastic (OneDP-S) versions are supported, the latter introducing latent noise for better exploration.
  • MeanFlow/Flow-Matching (RL and Imitation): The action space is equipped with a parametric average velocity field uθ(zÏ„,r,Ï„,o)u_\theta(z_\tau, r, \tau, o), enabling mathematically exact single-step inference. For instance, the MeanFlow identity provides

u(zτ,r,τ)=v(zτ,τ)−(τ−r)ddτu(zτ,r,τ)u(z_\tau, r, \tau) = v(z_\tau, \tau) - (\tau - r) \frac{d}{d\tau} u(z_\tau, r, \tau)

where vv is the instantaneous vector field, zτz_\tau is an interpolated state between data and noise, and uu is optimized via MSE regression to this analytic target (Zou et al., 28 Jan 2026). Inference is performed as apred=z1−uθ(z1,0,1,o)a_{\mathrm{pred}} = z_1 - u_\theta(z_1, 0, 1, o) in a single forward pass.

  • Policy Mirror Descent and On-Policy RL Integration: One-step generative models are embedded in frameworks like Policy Mirror Descent (off-policy) and Conditional PPO (on-policy), leveraging their respective analytical or practical benefits. OneDP can thus be trained either via KL-regularized policy improvement (mirror descent) or via PPO-style objectives, yielding closed-form Gaussian parameterizations and tractable entropy regularization (Chen et al., 31 Jul 2025, Liu et al., 5 Mar 2026).

3. Algorithms, Architectures, and Training Regimes

Implementation reflects the underlying learning paradigm:

  • Diffusion Distillation: The generator GθG_\theta is parameterized as a temporal U-Net with a ResNet-18 vision encoder. Training alternates between minimizing the score-difference KL and updating the score predictor πψ\pi_\psi. Warm start from the diffusion teacher is used to improve sample efficiency, reaching convergence with only 2–10% additional epochs (Wang et al., 2024).
  • MeanFlow/DMPO: A lightweight Vision Transformer encodes observations, followed by an MLP head for the mean flow. Dispersive regularization maximizes the marginal entropy of conditional representations, preventing encoder collapse under extreme compression demands of one-step inference. Pre-training combines MeanFlow velocity regression and InfoNCE-L2 dispersive loss; a subsequent fine-tuning stage uses PPO with behavior cloning regularization (Zou et al., 28 Jan 2026).
  • On-Policy One-Step RL: The one-step policy is structured as a two-stage Gaussian:

pθ(a∣a0,s)=N(a;a0+μθ(a0,s),Σθ(a0,s))p_\theta(a|a_0, s) = \mathcal N(a; a_0 + \mu_\theta(a_0, s), \Sigma_\theta(a_0, s))

with a0a_0 sampled from a reference policy (often the previous policy or a parametric flow model). Policy gradients reduce to evaluating Gaussian log-probabilities and their derivatives, avoiding backpropagation through the full denoising chain and making entropy regularization tractable (Liu et al., 5 Mar 2026).

Typical hyperparameters are task- and architecture-dependent (learning rates, regularization weights, exploration noise levels), but strong performance is demonstrated with compact networks (weights <2<2M), batch sizes >256>256, and annealed exploration schedules.

4. Empirical Evaluation and Performance Analysis

The OneDP family has been evaluated extensively in simulated and real-world domains:

  • Simulation Benchmarks: In Robomimic tasks, OneDP (distilled, MeanFlow and RL variants) achieves or surpasses success rates of full diffusion policies (e.g., OneDP-S: 84.3% success vs DP@100 steps: 82.9%; CP@3 steps: 71.2%). DMPO shows competitive or superior success rates in RoboMimic and D4RL tasks (e.g., Hopper: 1364.1 vs ReFlow 1192.6), maintaining performance even with K=1 step (Wang et al., 2024, Zou et al., 28 Jan 2026, Chen et al., 31 Jul 2025).
  • Real-World Robotic Deployment: On a Franka Panda robot, OneDP delivers high-frequency closed-loop control—wall-clock inferences of 0.6–9.6 ms (batch=1, including perception overhead), enabling >100>100 Hz operation and accommodating dynamic disturbances (Wang et al., 2024, Zou et al., 28 Jan 2026).
  • Ablation Findings: Dispersive regularization is critical—removal results in 5–10% success drops on complex tasks. Inclusion of BC loss during RL fine-tuning substantially stabilizes performance. One-step policies maintain multimodal behavior, retaining the expressiveness advantage of diffusion models over unimodal Gaussian policies (Zou et al., 28 Jan 2026, Liu et al., 5 Mar 2026).
  • Speedup and Efficiency: OneDP and DMPO deliver 10–700x inference speedups relative to multi-step baseline diffusion models (DP@100: 2.6 Hz; DMPO: 1,770 Hz). Distillation overhead is minor (2–10%), and PMD/MeanFlow variants obviate the need for explicit distillation altogether (Wang et al., 2024, Zou et al., 28 Jan 2026, Chen et al., 31 Jul 2025).

Summary of empirical metrics (selected from the data):

Reference Method #Steps Success (%) Inference Time (ms) Freq (Hz)
(Wang et al., 2024) OneDP-S 1 84.3 7 142
(Wang et al., 2024) DP@100 (DDPM) 100 82.9 660 1.5
(Zou et al., 28 Jan 2026) DMPO (1-step) 1 ~100 0.6 1770

5. Scientific and Practical Implications

OneDP's adoption enables high-frequency, low-latency action generation compatible with the physical and computational constraints of robotic platforms. The mathematical justification for single-step sampling (variance-controlled error bounds) obviates much of the empirical risk associated with truncating reverse processes in generative models. In practical deployment, OneDP can track rapidly moving objects and counteract perturbations (e.g., human interference) that are infeasible for traditional multi-step diffusion models (Wang et al., 2024, Zou et al., 28 Jan 2026, Chen et al., 31 Jul 2025).

A key insight is that, in RL settings where policy entropy is intentionally minimized, single-step (Euler or MeanFlow) action generators are both theoretically sound and empirically robust. The ease of integrating entropy bonuses, BC regularization, and stabilizing score-based penalties means sample-efficient RL pipelines do not sacrifice the desirable properties of diffusion approaches, such as multimodality.

6. Limitations, Open Questions, and Future Directions

Despite performance gains, open challenges remain. OneDP approaches have not been exhaustively tested on long-horizon, memory-dependent real-world tasks (Wang et al., 2024). The practical throughput of high-frequency inference exceeds what was deployed on-robot due to actuation stability constraints (e.g., capped at 20 Hz for robustness). The choice of distillation objective (KL divergence along the diffusion chain) is sufficient but not necessarily optimal; adversarial objectives or alternative divergences could potentially yield even closer distributional alignment (Wang et al., 2024). Additionally, multimodality, representation collapse prevention, and reward fine-tuning for exploration remain active areas for algorithmic and architectural refinement (Zou et al., 28 Jan 2026, Liu et al., 5 Mar 2026).

A plausible implication is that, as environments and robots demand ever higher control rates and greater policy expressivity, OneDP offers a scalable, architecture-agnostic foundation. The variance-error-bound perspective and the rapid convergence properties also suggest suitability for highly dynamic and resource-limited operational contexts.

The OneDP principle is concurrently reflected in independently developed frameworks:

  • Dispersive MeanFlow Policy Optimization (DMPO): Introduces mathematically exact one-step inference, dispersive regularization, and RL fine-tuning as a unified approach for surpassing expert-level control at hundreds of Hz sampling rates (Zou et al., 28 Jan 2026).
  • Flow Policy Mirror Descent (FPMD): Utilizes policy mirror descent, variance-bound justifications, and MeanFlow for off-policy RL without distillation, delivering competitive sample efficiency and GPU utilization (Chen et al., 31 Jul 2025).
  • Conditional PPO with One-Step Diffusion Kernel: Aligns on-policy RL updates with a single Gaussian "denoise" kernel, bridging the tractability of policy gradients with expressivity of diffusion models while maintaining efficient entropy regularization (Liu et al., 5 Mar 2026).

These works corroborate the practical and theoretical viability of the One-Step Diffusion Policy paradigm across domains and RL methodologies.


References:

(Wang et al., 2024): One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation (Chen et al., 31 Jul 2025): One-Step Flow Policy Mirror Descent (Zou et al., 28 Jan 2026): One Step Is Enough: Dispersive MeanFlow Policy Optimization (Liu et al., 5 Mar 2026): Diffusion Policy through Conditional Proximal Policy Optimization

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to One-Step Diffusion Policy (OneDP).