Conditioning Matters: Training Diffusion Policies is Faster Than You Think (2505.11123v1)

Published 16 May 2025 in cs.RO and cs.AI

Abstract: Diffusion policies have emerged as a mainstream paradigm for building vision-language-action (VLA) models. Although they demonstrate strong robot control capabilities, their training efficiency remains suboptimal. In this work, we identify a fundamental challenge in conditional diffusion policy training: when generative conditions are hard to distinguish, the training objective degenerates into modeling the marginal action distribution, a phenomenon we term loss collapse. To overcome this, we propose Cocos, a simple yet general solution that modifies the source distribution in the conditional flow matching to be condition-dependent. By anchoring the source distribution around semantics extracted from condition inputs, Cocos encourages stronger condition integration and prevents the loss collapse. We provide theoretical justification and extensive empirical results across simulation and real-world benchmarks. Our method achieves faster convergence and higher success rates than existing approaches, matching the performance of large-scale pre-trained VLAs using significantly fewer gradient steps and parameters. Cocos is lightweight, easy to implement, and compatible with diverse policy architectures, offering a general-purpose improvement to diffusion policy training.

Summary

Conditioning Matters: Training Diffusion Policies is Faster Than You Think

The paper presents a robust investigation into the challenges and solutions associated with conditional diffusion policy training for vision-language-action (VLA) models in robotics. The authors identify a pivotal issue termed "loss collapse," where the training fails due to indistinguishable generative conditions. To address this, they introduce Cocos, a method that redefines the source distribution to be condition-dependent. Their framework and experimental evaluations highlight how this condition-conditioning approach leads to significant improvements in convergence speed and success rates.

Core Contributions

Loss Collapse Identification: The paper makes a substantial contribution by precisely diagnosing loss collapse. This occurs when the model’s training objective inadvertently reduces to modeling the marginal action distribution rather than considering specific conditions. Such degradation not only impedes training efficiency but also affects the model's ability to accurately interpret condition inputs.
Cocos Proposal: The authors propose Cocos, a solution that modifies the source distribution during conditional flow matching. By anchoring the source distribution around the semantic content of conditions, Cocos ensures that the policy network leverages condition inputs effectively. This theoretically prevents loss collapse and retains condition sensitivity throughout training.
Theoretical Justification: Extensive theoretical exploration supports the premise that loss collapse is avoidable through Cocos. The authors provide a formal proof demonstrating how condition-sensitive source distributions thwart degradation of the training objective. This proof underlies their empirical assertions and model adaptations.
Empirical Validation: The experiments conducted across various simulation and real-world benchmarks affirm Cocos' effectiveness. They achieve more rapid convergence, reaching high performance levels with fewer gradient steps and reduced computational parameters compared to traditional models. Specifically, on the LIBERO benchmarks, the diffusion policy trained with Cocos achieves performance milestones over twice as fast as standard models.

Numerical Results and Implications

The paper highlights compelling numerical results, such as achieving $\pi_0$ performance on the LIBERO benchmarks with only 30,000 gradient steps—a 2.14x faster convergence than traditional methods. Such efficiencies portend transformative changes in scaling up VLA models for broader applications across robotics tasks.

Implications for AI Development

Practically, Cocos represents a significant advancement in training efficiency and accuracy, promising reduced computational costs and faster model deployment in robotic systems. Theoretically, it offers insights into model design strategies that prioritize condition sensitivity—potentially applicable beyond VLA models, extending to other domains of AI requiring conditional generation.

Future Directions

This research opens avenues for further exploration in multiple areas:

Applying Cocos within larger-scale, pre-trained models to assess adaptability across diverse conditional frameworks.
Evaluating the integration of dynamic, learnable distributions as condition anchors.
Investigating its potential within decentralized training environments, where condition inputs might shift dynamically.

In conclusion, the paper aptly diagnoses key inefficiencies within current diffusion policy methods and offers Cocos as a viable, theoretically sound solution. Its findings suggest promising enhancements in VLA model training that could catalyze faster, more efficient developments in adaptive robot control tasks.