Streaming Diffusion Policy (SDP)

Updated 16 February 2026

Streaming Diffusion Policy (SDP) is a class of algorithms that re-engineers diffusion models to enable rapid, real-time decision-making in robotic control and reinforcement learning.
It leverages strategies like partial denoising, flow-based integration, and score-regularized policy extraction to streamline sequential action synthesis and reduce computational latency.
SDP demonstrates significant performance gains by combining expressive multi-modal action modeling with efficient online policy optimization, leading to lower latency and competitive success rates.

A Streaming Diffusion Policy (SDP) is a class of algorithms that accelerates decision-making in robotic control and reinforcement learning by re-engineering how diffusion models are utilized for sequential action synthesis. SDP frameworks enable substantially faster online policy execution compared to conventional trajectory-sampling diffusion policies. These methods are motivated by the need to bridge the gap between highly expressive, multi-modal action modeling of diffusion models and the real-time reactivity required in robotic and control tasks. SDP achieves this by streaming actions or action trajectories—either by incremental partial denoising, learned flow integration, or analytic score extraction—such that valid actions can be computed and executed at each control cycle with single or minimal neural evaluations, rather than having to complete a lengthy, multi-step denoising/sampling loop.

1. Background and Motivation

Diffusion models have been adopted in robotic imitation learning and offline RL for their capability to describe complex, multi-modal action or trajectory distributions. However, classical sampling requires tens to hundreds of iterative denoising steps to synthesize a single trajectory, imposing significant computational latency. This bottleneck limits the application of diffusion-based policies in interactive environments where real-time feedback and reaction are essential. Even distillation-based acceleration approaches suffer from accuracy-diversity tradeoffs and high pre-computation costs (Høeg et al., 2024).

Streaming Diffusion Policies (SDPs) reformulate the inference process to enable on-the-fly action generation, either by streaming partially denoised trajectories, by integrating generative flow fields incrementally, or by extracting policies via score regularization, thereby closing the latency gap that previously separated expressive generative policies from classical parametric controllers (Jiang et al., 28 May 2025, Chen et al., 2023, Ma et al., 1 Feb 2025).

2. Core Methodological Innovations

SDP frameworks encompass several algorithmic strategies. The following taxonomy highlights primary approaches synthesized from recent developments:

A. Partial Denoising for Immediate Reactivity

Instead of requiring full denoising of an entire action trajectory, SDP outputs, at each observation, a partially denoised sequence: the immediate action is noise-free, while future actions have higher uncertainty. This partially denoised trajectory is rapidly generated by a small number of denoising steps applied to a prior noisy prediction, shifted by one timestep (Høeg et al., 2024).

B. Flow/ODE-Based Streaming Policies

In Streaming Flow Policy (SFP), action trajectories are treated as flow trajectories in the action space. Rather than initializing from pure noise, the integrator starts from a narrow Gaussian centered on the last executed action. The velocity field is learned by flow-matching to demonstration data, and policy execution is performed by numerically integrating $da/dt = v_\theta(a, t \mid h)$ in real time, streaming each action to the environment after a single forward computation (Jiang et al., 28 May 2025). This approach enables continuous action streaming throughout the receding horizon.

C. Score-Regularized Policy Extraction

Score Regularized Policy Optimization (SRPO) leverages a pre-trained diffusion behavior model to calculate the score $\nabla_a \log\mu(a|s)$ directly and regularizes deterministic or 1-step policies toward the multi-modal behavior distribution, bypassing the slow diffusion sampling procedure (Chen et al., 2023). This enables extraction of a policy network that inherits the diversity of the diffusion model with the efficiency of a single forward pass.

D. Efficient Online RL via Reweighted Score Matching

SDP in online RL can generalize denoising score matching by formulating Reweighted Score Matching (RSM) objectives. By carefully choosing weighting functions that do not require direct sampling from the optimal policy, algorithms such as Diffusion Policy Mirror Descent (DPMD) and Soft Diffusion Actor-Critic (SDAC) allow for tractable and sample-efficient online policy learning without backpropagating through the full diffusion process (Ma et al., 1 Feb 2025).

3. Mathematical Formulation and Training Objectives

Each SDP variant is unified in its attempt to preserve the expressive capacity of the underlying diffusion or flow distribution, while accelerating inference and/or learning. The following summarizes key mathematical components:

Streaming Flow Matching Loss

For a set of demonstrations $\{\xi\}$ , learn a velocity field $v_\theta(a, t|h)$ that matches the analytic flow for single demonstrations,

$v_\xi(a, t) = \dot\xi(t) - k (a - \xi(t))$

The training loss: $L(\theta) = \mathbb{E}_{(h, \xi)} \Bigl[ \mathbb{E}_{t \sim U[0,1]} \mathbb{E}_{a \sim \mathcal{N}(\xi(t), \sigma_0^2 e^{-2kt})} \| v_\theta(a, t | h) - v_\xi(a, t) \|^2 \Bigr]$ (Jiang et al., 28 May 2025)

Score-Regularized Policy Optimization Loss

Given a behavior distribution $\mu(a|s)$ modeled by a diffusion model: $\nabla_\theta J(\pi_\theta) = \mathbb{E}_{s}\Bigl[ \nabla_a Q_\phi(s, a) + \frac{1}{\beta}\,\nabla_a \log \mu(a|s) \Bigr]\Big|_{a=\pi_\theta(s)} \nabla_\theta \pi_\theta(s)$ Scores $\nabla_a \log \mu(a|s)$ are estimated using the diffusion model's instantaneous score output (Chen et al., 2023).

Reweighted Score Matching in Online RL

For both DPMD and SDAC, loss functions employ weighting functions $g(a_t;s)$ to optimize: $\mathcal{L}^{g}(\theta; s, t) = \int g(a_t; s) \| s_\theta(a_t; s, t) - \nabla_{a_t} \log p_t(a_t|s) \|^2 da_t$ with $g$ selected so that $p_t(a_t|s)$ is matched using only accessible samples (from the old policy or other tractable samplers) (Ma et al., 1 Feb 2025).

4. Inference and Control Loop Integration

Streaming Diffusion Policies are tightly integrated with real-time receding-horizon control frameworks:

Warm-started ODE/Flow Integration: Each action chunk is initialized from the last executed control signal (deterministic in inference, stochastic in training), and ODE integration proceeds stepwise. At each $\Delta t$ , the current action is streamed to the robot; the integrator or denoiser computes the next, using updated observations and history (Jiang et al., 28 May 2025).
Partial Denoising Rollout: For policies formulated in trajectory space, each observation triggers partial denoising steps yielding a new action sequence where only the immediate action is fully refined, significantly reducing per-action latency (Høeg et al., 2024).
Score-based Actor Extraction: In methods like SRPO, the final extracted actor is a shallow MLP that maps observation directly to action, enabling sub-millisecond inference per action. The diversity and regularization of the diffusion behavior model is retained via score-based gradient terms (Chen et al., 2023).

5. Empirical Results and Comparative Performance

Streaming Diffusion Policies have demonstrated competitive or superior performance on standard benchmarks relative to both classical and prior generative-policy approaches. Representative empirical findings include:

Task/Policy	Success Rate (%)	Latency per Action (ms)	Key Notes
Diffusion (100 DDPM steps, Push-T)	92.9 / 94.4	40.2	Baseline, slowest
Fast Diffusion (10 DDIM)	87.0 / 89.0	4.4
Flow-matching Policy	80.6 / 82.6	5.8
SDP (Streaming)	87.5 / 91.4	26.7	Uses partial denoising/streaming
SFP w/o stabilization	84.0 / 86.4	3.5
SFP (with stabilization)	95.1 / 96.0	3.5	Best in class, lowest latency

On MuJoCo RL benchmarks, DPMD achieves improvements of $+32.8\%$ (HalfCheetah-v4), $+127.3\%$ (Ant-v4), $+143.5\%$ (Humanoid-v4) over SAC (Ma et al., 1 Feb 2025).
SRPO achieves $25$– $1000\times$ speedups in policy inference compared to standard diffusion policies, with near-identical returns on D4RL locomotion tasks (Chen et al., 2023).

6. Handling Multi-Modality and Distribution Shift

Streaming Diffusion Policies preserve the multi-modal nature of demonstration or behavior data by either:

Flow-matching to mixtures of demonstration tubes, such that distinct action sequence modes persist in the integrated velocity field; early in the horizon, the system can bifurcate onto any of the valid modes, preserving behavioral diversity (Jiang et al., 28 May 2025).
Regularizing deterministic or fast-sampling policies toward the full behavior policy using score functions derived from the underlying diffusion model, rather than by direct replica sampling (Chen et al., 2023).

Stabilizing terms (analytical velocity fields with contraction towards demonstration trajectories) further reduce distribution shift for better generalization and imitation-fidelity in both simulation and real-world execution.

7. Practical Considerations and Limitations

SDP and related frameworks introduce new efficiencies and design constraints:

Neural architectures for streaming methods are typically compact MLPs with up to three layers and 256 units, leveraging time-embedding or explicit velocity field conditioning for ODE-based methods (Ma et al., 1 Feb 2025, Jiang et al., 28 May 2025).
The value of the initial spread (warm start variance) and stabilization constant are critical hyperparameters for ODE-based samplers.
Number of diffusion/integration steps (10–20 for most tasks) balances accuracy and real-time execution constraints.
Empirical ablations show that proper weighting, Q-value normalization, and exploration procedures are essential for stability and optimality (Ma et al., 1 Feb 2025).
A plausible implication is that the reliance on good pre-trained behavior models or demonstration coverage remains a limiting factor in scaling these methods to high-dimensional perception-action tasks or those with severe compounding error (Chen et al., 2023).
Training time for diffusion models is non-trivial, though streaming policy extraction itself is efficient post-training.

References

"Streaming Diffusion Policy: Fast Policy Synthesis with Variable Noise Diffusion Models" (Høeg et al., 2024)
"Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories" (Jiang et al., 28 May 2025)
"Score Regularized Policy Optimization through Diffusion Behavior" (Chen et al., 2023)
"Efficient Online Reinforcement Learning for Diffusion Policy" (Ma et al., 1 Feb 2025)