Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Time-Unified Diffusion Policy for Robotics

Updated 30 June 2025
  • TUDP is an advanced framework for robotic manipulation that unifies the action denoising process across time, simplifying policy inference and improving real-time performance.
  • It introduces a unique action discrimination mechanism to resolve ambiguity in multi-target settings, ensuring consistent convergence towards successful actions.
  • Empirical evaluations on RLBench and real-robot tasks demonstrate TUDP's high success rates, computational efficiency, and robustness in low-data environments.

Time-Unified Diffusion Policy (TUDP) is an advanced framework for robotic manipulation that improves the efficiency and accuracy of diffusion-based policy inference by unifying the action denoising process across all timesteps, augmented by an integrated action discrimination mechanism. TUDP departs from classic time-varying denoising methods by constructing a time-invariant velocity field in action space, simplifying the learning problem and enabling real-time robotic action generation with state-of-the-art performance on standard manipulation benchmarks (2506.09422).

1. Unified Denoising in Diffusion Policy Inference

Conventional diffusion-based policies for robotic manipulation model the policy as a conditional denoising process: starting from Gaussian noise, a neural network predicts incremental corrections (the velocity field) iteratively, often conditioning explicitly on the denoising timestep. This timestep dependency induces temporal complexity and requires many inference steps, leading to elevated computational cost and suboptimal accuracy in scenarios with strict latency constraints.

TUDP circumvents these limitations by learning a single, unified velocity field for action denoising. The model’s neural predictor ϵϑ\epsilon_\vartheta takes as input only the scene observation xx and the current noisy action yty_t, and outputs the velocity ϵ\epsilon for denoising: yt+1=ytϵ,ϵ=ϵϑ(x,yt)y_{t+1} = y_t - \epsilon, \quad \epsilon = \epsilon_\vartheta(x, y_t) The velocity field is constructed to be invariant to timestep, making the entire denoising trajectory consistent and reducing both the sample and computational complexity of policy learning and inference.

2. Action Discrimination Mechanism

TUDP introduces an explicit action discrimination network to resolve the ambiguity present in action spaces with multiple, sparsely distributed successful actions. This network estimates an action score ss for each noisy action yy with respect to a reference successful action y^\hat{y}: s^(y,y^)=emReLU(yy^l)\hat{s}(y, \hat{y}) = e^{m \cdot \text{ReLU}(\|y - \hat{y}\| - l)} where m<10m < -10 is a steep negative slope and ll is a neighborhood threshold.

The score distinguishes whether a noisy action lies within a success neighborhood of a reference action, providing critical disambiguation for the diffusion network during training. This mechanism reduces denoising confusion in multi-target settings and leads to more accurate convergence towards a correct successful action.

3. Construction and Role of the Time-Unified Velocity Field

The core of TUDP is its construction of a timestep-invariant velocity field that consistently points noisy actions toward the nearest successful target.

For a reference successful action y^\hat{y}, the velocity towards it is: ε(yy^)=vyy^max{v,yy^}\varepsilon(y|\hat{y}) = v \cdot \frac{y - \hat{y}}{\max\{v, \|y - \hat{y}\|\}} where vv caps the maximal denoising step per iteration.

To resolve conflicts when multiple potential targets are present, TUDP defines a correlation weight λ\lambda: λ(y,y^i)={0if minjiyy^jl 1otherwise\lambda(y, \hat{y}^i) = \begin{cases} 0 & \text{if}~\min_{j \ne i} \|y - \hat{y}^j\| \leq l \ 1 & \text{otherwise} \end{cases} The unified velocity field aggregates contributions from multiple targets: ϵ(y)=i=1k1kp(yy^i)λ(y,y^i)ε(yy^i)\epsilon(y) = \sum_{i=1}^k \frac{1}{k} \, p(y|\hat{y}^i) \, \lambda(y,\hat{y}^i) \, \varepsilon(y|\hat{y}^i) where p(yy^i)p(y|\hat{y}^i) is the noise distribution around each target.

This construction ensures that the denoising trajectory is globally well-defined, independent of the timestep, and that each noisy action is directed towards a unique, context-appropriate successful action.

4. Action-Wise Training Procedure

TUDP’s training pipeline proceeds in two phases:

  1. Action Discrimination Training: The action discrimination network sθ(x,y)s_\theta(x, y) is trained to regress the action score s^(y,y^)\hat{s}(y, \hat{y}) using:

Lscore=Ep(x,y^)p(yy^)sθ(x,y)s^(y,y^)\mathcal{L}_{score} = \mathbb{E}_{p(x, \hat{y}) p(y|\hat{y})}\|s_\theta(x, y) - \hat{s}(y, \hat{y})\|

  1. Diffusion Network Training: The denoising network ϵϑ(x,y)\epsilon_\vartheta(x, y), informed by the correlation weight (computed via sθs_\theta), learns to predict the velocity:

Laction=Ep(x,y^)p(yy^)ϵϑ(x,y)λ(y,y^)ε(yy^)\mathcal{L}_{action} = \mathbb{E}_{p(x, \hat{y})p(y|\hat{y})} \|\epsilon_\vartheta(x, y) - \lambda(y, \hat{y}) \varepsilon(y|\hat{y})\|

An auxiliary loss on the gripper opening may be included:

Lnoise=Laction+wopenLopen\mathcal{L}_{noise} = \mathcal{L}_{action} + w_{open}\mathcal{L}_{open}

with

Lopen=yopenlog(y^open)+(1yopen)log(1y^open)\mathcal{L}_{open} = y_{\mathrm{open}} \log(\hat{y}_{\mathrm{open}}) + (1 - y_{\mathrm{open}})\log(1 - \hat{y}_{\mathrm{open}})

This approach ensures that the network learns consistent, unambiguous denoising, focusing on the correct neighborhood in action space and efficiently mapping noisy actions to successful behaviors.

5. Empirical Results and Benchmarking

TUDP was evaluated on the RLBench suite in both multi-view (18 tasks, 4 camera views) and single-view (18 tasks, 1 view) setups. Key findings:

  • Multi-view: TUDP achieved a mean success rate of 82.6%, the highest reported value, and average best rank (1.9-th) across all tested policies. The margin over baselines increased as the number of denoising steps was reduced, indicating superior robustness and efficiency.
  • Single-view: TUDP reached a success rate of 83.8% and outperformed all baselines.
  • Few-step efficiency: When the number of denoising steps was reduced, TUDP showed greater performance retention compared to competitors such as 3D Diffuser Actor and database-initialized policies.
  • Real-robot transfer: On six language-conditioned UR5 tasks, TUDP achieved 85% success (51/60 trials) with limited demonstrations.

These results indicate that the time-unified framework and action discrimination mechanisms enable both high accuracy and a substantial reduction in computational and memory costs at inference.

6. Practical Applications and Design Considerations

TUDP’s performance profile and architectural features make it suitable for several practical robotic contexts:

  • Real-time robotic manipulation: TUDP’s ability to generate accurate policies with fewer inference steps makes it compatible with industrial or service robots requiring low-latency responses in dynamic environments.
  • Multi-task robotics: The model’s unification and discrimination capabilities allow straightforward extension to diverse tasks sharing a common action space.
  • Low-data environments: The improved efficiency and reduced model complexity support better generalization when demonstration data is scarce or costly to obtain.
  • Sim-to-real transfer: TUDP maintains high data efficiency and robustness in transferring learned behaviors from simulation to real-world robotic platforms.

Limitations include dependence on proper hyperparameter selection (e.g., neighborhood radius ll) and potential challenges in domains where successful actions are densely clustered or highly ambiguous.

7. Comparison with Alternative Diffusion-Based Policies

Method Denoising Efficiency Accuracy Generalization Notes
TUDP Time-unified High High High SOTA on RLBench, robust to few steps
3D Diffuser Actor Time-varying Moderate High High Slower, less robust to step reduction
READ Database init High (init) Moderate Low Fast but domain-limited
ManiCM, FlowPolicy Distill/Flow High Moderate Varies May trade accuracy for speed
Standard baselines Varies Low/Med Moderate Varies Lag on complex, multi-modal tasks

TUDP’s principal strengths stem from its invariant velocity field, explicit action discrimination, and ability to avoid timestep and teacher-policy dependencies. Comparative performance remains state-of-the-art in both benchmark and real-world scenarios, particularly when computational efficiency is a primary concern.


TUDP establishes an efficient and accurate paradigm for diffusion-based visuomotor policy inference, unifying action denoising in time and enhancing discrimination among successful behaviors. Its approach directly addresses long-standing efficiency and accuracy bottlenecks in diffusion policy inference, and current empirical evidence places it at the forefront of diffusion-based methods for real-world robotic manipulation (2506.09422).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)