Modality-Composable Diffusion Policy
- MCDP is a diffusion-based policy that composes multiple unimodal diffusion policies via convex score combination during inference.
- It avoids costly retraining by integrating specialized sensory modalities, such as RGB images and point-clouds, for robust robotic trajectory generation.
- Empirical evaluations on benchmarks like RoboTwin demonstrate that MCDP improves success rates, offering a modular, plug-and-play approach to multi-modal policy integration.
Modality-Composable Diffusion Policy (MCDP) extends diffusion-based policy models by enabling inference-time composition of multiple pre-trained unimodal diffusion policies, each specialized for a distinct sensor modality. Instead of retraining a single, unified multi-modal policy—a process that incurs significant data and computational cost—MCDP constructs a composite policy by convexly combining the distributional scores (denoising functions) from its constituent unimodal policies during sampling, yielding enhanced adaptability, robustness, and generalization without additional training. Empirical studies in automated robotics tasks, notably on the RoboTwin benchmark, demonstrate that MCDP often outperforms its underlying unimodal policies and establishes a modular, plug-and-play paradigm for integration of arbitrary sensing modalities (Cao et al., 16 Mar 2025, Cao et al., 1 Oct 2025).
1. Theoretical Foundation: Score-Based Diffusion Policies
A diffusion policy (DP) parameterizes a trajectory distribution using a forward noising process and a learned, score-based reverse denoising process. Let denote a trajectory (e.g., robot end-effector pose sequences), and its noisy counterpart at diffusion step . The forward Markov process is defined as: where is a predefined noise schedule. The reverse-time process is parameterized by a neural network estimating the noise added at each step: which provides the score function. The DDPM update for discrete time steps takes the form: where . In the SDE perspective, the forward SDE is: and the reverse SDE uses the neural score: Unimodal DPs are pretrained by behavior cloning on trajectory data conditioned on modality-specific inputs:
- RGB-based DP: , trained on RGB images (CNN + transformer encoding).
- Point-cloud DP: , trained on voxelized point clouds.
The objective is the standard diffusion loss: (Cao et al., 16 Mar 2025).
2. Inference-Time Composition: Policy Combination Mechanism
At inference, MCDP convexly combines the scores from pre-trained policies, each operating on a distinct modality : where is the noise estimate from the -th DP and is its weight. The resulting composite score function is: In practice, normalization may be applied to prevent any modality’s score from dominating:
The composite estimate replaces the unimodal score in the reverse denoising step: Thereby, MCDP generates trajectories under a policy that integrates the strengths of all included modalities (Cao et al., 16 Mar 2025, Cao et al., 1 Oct 2025).
3. Functional Guarantees and System-Level Analysis
The MCDP construction obtains theoretical support in the form of one-step improvement and trajectory-level error bounds (Cao et al., 1 Oct 2025). Given policies with time- score functions , the convexly composed score remains inside the convex hull of the parent policies: Define each policy’s mean-squared error to the oracle score as . For , the error of the mixture score is convex in with minimizer : with strict inequality if the estimators’ errors are non-aligned. This supports the empirical observation that MCDP samplers can outperform all constituent unimodal policies.
For the continuous-time sampling trajectory under the composed score, a Grönwall-type bound holds: where , , , , and are Lipschitz and score-error constants defined in functional analysis of the system-level accuracy (Cao et al., 1 Oct 2025). This suggests that the benefits of composition in single denoising steps can propagate consistently throughout the entire trajectory-generation process.
4. Algorithmic Procedures and Weight Search
The standard two-modality MCDP algorithm proceeds as follows:
- Initialize with pre-trained unimodal DPs (), their input encodings, and composition weights ().
- Sample the noisy trajectory .
- For :
- Compute unimodal noise estimates and .
- Linearly blend them: .
- Update using the composite .
- Return as the action trajectory.
Grid search on the -simplex is performed over possible weights , using empirical rollout success rates to select optimal . For , a coarse-to-fine grid over with suffices. Each candidate is evaluated via multiple rollouts, tracking task success, to select the best-performing mixture for deployment (Cao et al., 16 Mar 2025, Cao et al., 1 Oct 2025).
5. Empirical Evaluation
Quantitative Results: RoboTwin and Robomimic
Empirical tests on the RoboTwin bimanual manipulation suite and Robomimic/PushT show that MCDP generally outperforms both parent unimodal DPs whenever both achieve moderate performance ( success). For example:
| Task | DP_img | DP_pcd | MCDP (best ) |
|---|---|---|---|
| Empty Cup Place | 0.42 | 0.62 | 0.86 () |
| Dual Bottles Pick (H) | 0.49 | 0.64 | 0.71 () |
| Shoe Place | 0.37 | 0.36 | 0.60 () |
- If either parent policy is poor (), MCDP cannot improve over the stronger policy ("Pick Apple Messy").
- Optimal weights place heavier emphasis on the better-performing DP per task (Cao et al., 16 Mar 2025).
On Robomimic, PushT, and RoboTwin, convex MCDP composition yields $2$– improvement on standard benchmarks and approximately in real-world robotic setups:
| Method | Avg SR | Δ vs best parent |
|---|---|---|
| DP+MP | 41.41 | +2.22% |
| Florence-D+DP | 66.76 | +5.51% |
| π₀+FP | 88.94 | +2.52% |
Alternative operators such as logical AND/OR can achieve even greater gains at the cost of per-step recomputation and limited compatibility (e.g., not with flow models) (Cao et al., 1 Oct 2025).
Qualitative Insights
- Action distributions transition smoothly between the behaviors of each unimodal policy as composition weights are varied, yielding trajectory interpolation.
- Case studies highlight blending of complementary strengths; e.g., combining approach direction from vision with force estimation from point-cloud data for improved grasping (Cao et al., 16 Mar 2025).
6. Modality and Model Generality
MCDP, via the General Policy Composition (GPC) framework, is agnostic to the sensory modalities and model architectures of its DPs, allowing composition of:
- Vision-only (RGB), point-cloud, vision–language–action (VLA) policies (e.g., Florence-DiT), and others.
- Both diffusion- and flow-matching–based policies.
Significant empirical performance gains are observed when parent policies offer complementary strengths. For heterogeneous combinations, MCDP boosts average SR (success rate) by $5$– in RoboTwin (vision+point-cloud, VLA+VA pairs), and by similar margins on real-robot tasks (Cao et al., 1 Oct 2025).
Composition requires shared trajectory/action space and diffusion schedule alignment among combined DPs. The approach does not employ classifier-free guidance, thus avoiding doubled computational cost (Cao et al., 16 Mar 2025).
7. Limitations and Extensions
Manual weight tuning is currently required—suboptimal choices can degrade performance, especially if large weights are assigned to poor parent models. MCDP has so far been demonstrated primarily for two visual modalities; extension to additional modalities (e.g., tactile, language-conditioned, proprioceptive DPs) is plausible.
Plausible implications include:
- Adaptive weight tuning (online or via validation rollouts) could further improve results.
- Extension to composition across domains and embodiments may be achieved by aligning latent action representations.
- Investigation of asynchronous modality-specific schedulers and advanced diffusion solvers (such as DPM-Solver or Analytic-DPM) constitutes an open direction (Cao et al., 16 Mar 2025, Cao et al., 1 Oct 2025).
References
- "Modality-Composable Diffusion Policy via Inference-Time Distribution-level Composition" (Cao et al., 16 Mar 2025)
- "Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition" (Cao et al., 1 Oct 2025)