Balancing fast inference and multi-modality in diffusion-based robotic policies

Determine algorithmic strategies for diffusion-based visuomotor policies that enable fast inference while preserving and promoting multi-modal action distributions in complex robotic manipulation settings, resolving the trade-off between stochastic reverse-SDE sampling (diverse but slow) and deterministic probability-flow ODE distillation (fast but prone to mode collapse).

Background

Diffusion-based visuomotor policies built on stochastic reverse SDEs can capture diverse behaviors but require many denoising steps, leading to high latency at inference. In contrast, ODE-based distillation and consistency models enable fast, few-step or one-step sampling but risk collapsing to dominant modes, reducing behavioral diversity and robustness.

The paper frames this tension as a central deployment challenge for real-world robot manipulation tasks, where solution spaces are inherently multi-modal. The authors explicitly identify it as an open problem and propose the Hybrid Consistency Policy (HCP) as a partial resolution by combining a short stochastic SDE prefix with a one-step ODE consistency jump. Nonetheless, the broader methodological question of how to universally achieve both speed and multi-modality across complex robotic settings remains open.

References

Consequently, an open problem is how diffusion-based methods can retain fast inference while preserving and promoting multi-modality in complex robotic settings.

Hybrid Consistency Policy: Decoupling Multi-Modal Diversity and Real-Time Efficiency in Robotic Manipulation (2510.26670 - Zhao et al., 30 Oct 2025) in Section 1 (Introduction)