Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Time-Unified Diffusion Policy for Robotics

Updated 30 June 2025

TUDP is an advanced framework for robotic manipulation that unifies the action denoising process across time, simplifying policy inference and improving real-time performance.
It introduces a unique action discrimination mechanism to resolve ambiguity in multi-target settings, ensuring consistent convergence towards successful actions.
Empirical evaluations on RLBench and real-robot tasks demonstrate TUDP's high success rates, computational efficiency, and robustness in low-data environments.

Time-Unified Diffusion Policy (TUDP) is an advanced framework for robotic manipulation that improves the efficiency and accuracy of diffusion-based policy inference by unifying the action denoising process across all timesteps, augmented by an integrated action discrimination mechanism. TUDP departs from classic time-varying denoising methods by constructing a time-invariant velocity field in action space, simplifying the learning problem and enabling real-time robotic action generation with state-of-the-art performance on standard manipulation benchmarks (2506.09422).

1. Unified Denoising in Diffusion Policy Inference

Conventional diffusion-based policies for robotic manipulation model the policy as a conditional denoising process: starting from Gaussian noise, a neural network predicts incremental corrections (the velocity field) iteratively, often conditioning explicitly on the denoising timestep. This timestep dependency induces temporal complexity and requires many inference steps, leading to elevated computational cost and suboptimal accuracy in scenarios with strict latency constraints.

TUDP circumvents these limitations by learning a single, unified velocity field for action denoising. The model’s neural predictor $\epsilon_\vartheta$ takes as input only the scene observation $x$ and the current noisy action $y_t$ , and outputs the velocity $\epsilon$ for denoising: $y_{t+1} = y_t - \epsilon, \quad \epsilon = \epsilon_\vartheta(x, y_t)$ The velocity field is constructed to be invariant to timestep, making the entire denoising trajectory consistent and reducing both the sample and computational complexity of policy learning and inference.

2. Action Discrimination Mechanism

TUDP introduces an explicit action discrimination network to resolve the ambiguity present in action spaces with multiple, sparsely distributed successful actions. This network estimates an action score $s$ for each noisy action $y$ with respect to a reference successful action $\hat{y}$ : $\hat{s}(y, \hat{y}) = e^{m \cdot \text{ReLU}(\|y - \hat{y}\| - l)}$ where $m < -10$ is a steep negative slope and $l$ is a neighborhood threshold.

The score distinguishes whether a noisy action lies within a success neighborhood of a reference action, providing critical disambiguation for the diffusion network during training. This mechanism reduces denoising confusion in multi-target settings and leads to more accurate convergence towards a correct successful action.

3. Construction and Role of the Time-Unified Velocity Field

The core of TUDP is its construction of a timestep-invariant velocity field that consistently points noisy actions toward the nearest successful target.

For a reference successful action $\hat{y}$ , the velocity towards it is: $\varepsilon(y|\hat{y}) = v \cdot \frac{y - \hat{y}}{\max\{v, \|y - \hat{y}\|\}}$ where $v$ caps the maximal denoising step per iteration.

To resolve conflicts when multiple potential targets are present, TUDP defines a correlation weight $\lambda$ : $\lambda(y, \hat{y}^i) = \begin{cases} 0 & \text{if}~\min_{j \ne i} \|y - \hat{y}^j\| \leq l \ 1 & \text{otherwise} \end{cases}$ The unified velocity field aggregates contributions from multiple targets: $\epsilon(y) = \sum_{i=1}^k \frac{1}{k} \, p(y|\hat{y}^i) \, \lambda(y,\hat{y}^i) \, \varepsilon(y|\hat{y}^i)$ where $p(y|\hat{y}^i)$ is the noise distribution around each target.

This construction ensures that the denoising trajectory is globally well-defined, independent of the timestep, and that each noisy action is directed towards a unique, context-appropriate successful action.

4. Action-Wise Training Procedure

TUDP’s training pipeline proceeds in two phases:

Action Discrimination Training: The action discrimination network $s_\theta(x, y)$ is trained to regress the action score $\hat{s}(y, \hat{y})$ using:

$\mathcal{L}_{score} = \mathbb{E}_{p(x, \hat{y}) p(y|\hat{y})}\|s_\theta(x, y) - \hat{s}(y, \hat{y})\|$

Diffusion Network Training: The denoising network $\epsilon_\vartheta(x, y)$ , informed by the correlation weight (computed via $s_\theta$ ), learns to predict the velocity:

$\mathcal{L}_{action} = \mathbb{E}_{p(x, \hat{y})p(y|\hat{y})} \|\epsilon_\vartheta(x, y) - \lambda(y, \hat{y}) \varepsilon(y|\hat{y})\|$

An auxiliary loss on the gripper opening may be included:

$\mathcal{L}_{noise} = \mathcal{L}_{action} + w_{open}\mathcal{L}_{open}$

with

$\mathcal{L}_{open} = y_{\mathrm{open}} \log(\hat{y}_{\mathrm{open}}) + (1 - y_{\mathrm{open}})\log(1 - \hat{y}_{\mathrm{open}})$

This approach ensures that the network learns consistent, unambiguous denoising, focusing on the correct neighborhood in action space and efficiently mapping noisy actions to successful behaviors.

5. Empirical Results and Benchmarking

TUDP was evaluated on the RLBench suite in both multi-view (18 tasks, 4 camera views) and single-view (18 tasks, 1 view) setups. Key findings:

Multi-view: TUDP achieved a mean success rate of 82.6%, the highest reported value, and average best rank (1.9-th) across all tested policies. The margin over baselines increased as the number of denoising steps was reduced, indicating superior robustness and efficiency.
Single-view: TUDP reached a success rate of 83.8% and outperformed all baselines.
Few-step efficiency: When the number of denoising steps was reduced, TUDP showed greater performance retention compared to competitors such as 3D Diffuser Actor and database-initialized policies.
Real-robot transfer: On six language-conditioned UR5 tasks, TUDP achieved 85% success (51/60 trials) with limited demonstrations.

These results indicate that the time-unified framework and action discrimination mechanisms enable both high accuracy and a substantial reduction in computational and memory costs at inference.

6. Practical Applications and Design Considerations

TUDP’s performance profile and architectural features make it suitable for several practical robotic contexts:

Real-time robotic manipulation: TUDP’s ability to generate accurate policies with fewer inference steps makes it compatible with industrial or service robots requiring low-latency responses in dynamic environments.
Multi-task robotics: The model’s unification and discrimination capabilities allow straightforward extension to diverse tasks sharing a common action space.
Low-data environments: The improved efficiency and reduced model complexity support better generalization when demonstration data is scarce or costly to obtain.
Sim-to-real transfer: TUDP maintains high data efficiency and robustness in transferring learned behaviors from simulation to real-world robotic platforms.

Limitations include dependence on proper hyperparameter selection (e.g., neighborhood radius $l$ ) and potential challenges in domains where successful actions are densely clustered or highly ambiguous.

7. Comparison with Alternative Diffusion-Based Policies

Method	Denoising	Efficiency	Accuracy	Generalization	Notes
TUDP	Time-unified	High	High	High	SOTA on RLBench, robust to few steps
3D Diffuser Actor	Time-varying	Moderate	High	High	Slower, less robust to step reduction
READ	Database init	High (init)	Moderate	Low	Fast but domain-limited
ManiCM, FlowPolicy	Distill/Flow	High	Moderate	Varies	May trade accuracy for speed
Standard baselines	Varies	Low/Med	Moderate	Varies	Lag on complex, multi-modal tasks

TUDP’s principal strengths stem from its invariant velocity field, explicit action discrimination, and ability to avoid timestep and teacher-policy dependencies. Comparative performance remains state-of-the-art in both benchmark and real-world scenarios, particularly when computational efficiency is a primary concern.

TUDP establishes an efficient and accurate paradigm for diffusion-based visuomotor policy inference, unifying action denoising in time and enhancing discrimination among successful behaviors. Its approach directly addresses long-standing efficiency and accuracy bottlenecks in diffusion policy inference, and current empirical evidence places it at the forefront of diffusion-based methods for real-world robotic manipulation (2506.09422).

PDF Markdown Chat (Upgrade)

References (1)

Time-Unified Diffusion Policy with Action Discrimination for Robotic Manipulation (2025)