Modality-Composable Diffusion Policy

Updated 8 March 2026

MCDP is a diffusion-based policy that composes multiple unimodal diffusion policies via convex score combination during inference.
It avoids costly retraining by integrating specialized sensory modalities, such as RGB images and point-clouds, for robust robotic trajectory generation.
Empirical evaluations on benchmarks like RoboTwin demonstrate that MCDP improves success rates, offering a modular, plug-and-play approach to multi-modal policy integration.

Modality-Composable Diffusion Policy (MCDP) extends diffusion-based policy models by enabling inference-time composition of multiple pre-trained unimodal diffusion policies, each specialized for a distinct sensor modality. Instead of retraining a single, unified multi-modal policy—a process that incurs significant data and computational cost—MCDP constructs a composite policy by convexly combining the distributional scores (denoising functions) from its constituent unimodal policies during sampling, yielding enhanced adaptability, robustness, and generalization without additional training. Empirical studies in automated robotics tasks, notably on the RoboTwin benchmark, demonstrate that MCDP often outperforms its underlying unimodal policies and establishes a modular, plug-and-play paradigm for integration of arbitrary sensing modalities (Cao et al., 16 Mar 2025, Cao et al., 1 Oct 2025).

1. Theoretical Foundation: Score-Based Diffusion Policies

A diffusion policy (DP) parameterizes a trajectory distribution using a forward noising process and a learned, score-based reverse denoising process. Let $\tau \in \mathbb{R}^D$ denote a trajectory (e.g., robot end-effector pose sequences), and $\tau_t$ its noisy counterpart at diffusion step $t$ . The forward Markov process is defined as: $q(\tau_t \mid \tau_{t-1}) = \mathcal{N}\left(\tau_t; \sqrt{\alpha_t}\,\tau_{t-1}, (1-\alpha_t)I\right)$ where $\alpha_t$ is a predefined noise schedule. The reverse-time process is parameterized by a neural network $\epsilon_\theta$ estimating the noise added at each step: $s_\theta(\tau_t, t) = -\frac{1}{\sigma_t}\epsilon_\theta(\tau_t, t) \approx \nabla_{\tau_t}\log p_\theta(\tau_t)$ which provides the score function. The DDPM update for discrete time steps takes the form: $\tau_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left( \tau_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(\tau_t,t) \right) + \sigma_t \xi, \quad \xi \sim \mathcal{N}(0,I)$ where $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ . In the SDE perspective, the forward SDE is: $d\tau = -\frac{1}{2}\beta(t)\tau\,dt + \sqrt{\beta(t)}\,dw$ and the reverse SDE uses the neural score: $d\tau = \left[ \frac{1}{2}\beta(t)\tau - \beta(t)\nabla_{\tau}\log p_\theta(\tau,t) \right]dt + \sqrt{\beta(t)}\,d\overline{w}$ Unimodal DPs are pretrained by behavior cloning on trajectory data conditioned on modality-specific inputs:

RGB-based DP: $\epsilon^{\text{img}}_\theta(\tau_t, t \mid I_\text{rgb})$ , trained on RGB images (CNN + transformer encoding).
Point-cloud DP: $\epsilon^{\text{pcd}}_\theta(\tau_t, t \mid P)$ , trained on voxelized point clouds.

The objective is the standard diffusion loss: $\mathcal{L}(\theta) = \mathbb{E}_{\tau, t, \xi}\left\| \epsilon_\theta(\sqrt{\bar{\alpha}_t}\,\tau + \sqrt{1-\bar{\alpha}_t}\,\xi, t) - \xi \right\|^2$ (Cao et al., 16 Mar 2025).

2. Inference-Time Composition: Policy Combination Mechanism

At inference, MCDP convexly combines the scores from $n$ pre-trained policies, each operating on a distinct modality $\mathcal{M}_i$ : $\hat{\epsilon}_\text{comp}(\tau_t, t) = \sum_{i=1}^n w_i \epsilon^{(i)}_t, \quad \sum_{i=1}^n w_i = 1, \; w_i \geq 0$ where $\epsilon^{(i)}_t$ is the noise estimate from the $i$ -th DP and $w_i$ is its weight. The resulting composite score function is: $s_\text{comp}(\tau_t, t) = \sum_{i=1}^n w_i s^{(i)}_\theta(\tau_t, t)$ In practice, $L_2$ normalization may be applied to prevent any modality’s score from dominating: $\bar{\epsilon}^{(i)}_t = \epsilon^{(i)}_t / \| \epsilon^{(i)}_t \|_2, \quad \hat{\epsilon}_\text{comp} = \sum_i w_i \bar{\epsilon}^{(i)}_t$

The composite estimate replaces the unimodal score in the reverse denoising step: $\tau_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left( \tau_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\hat{\epsilon}_\text{comp}(\tau_t, t) \right) + \sigma_t \xi$ Thereby, MCDP generates trajectories under a policy that integrates the strengths of all included modalities (Cao et al., 16 Mar 2025, Cao et al., 1 Oct 2025).

3. Functional Guarantees and System-Level Analysis

The MCDP construction obtains theoretical support in the form of one-step improvement and trajectory-level error bounds (Cao et al., 1 Oct 2025). Given $N$ policies $\pi_1,\ldots,\pi_N$ with time- $t$ score functions $s_i(\tau_t, t)$ , the convexly composed score $\hat{s}_\text{comp}$ remains inside the convex hull of the parent policies: $\hat{s}_\text{comp}(\tau_t, t) = \sum_{i=1}^N w_i s_i(\tau_t, t), \quad w_i \geq 0, \sum w_i = 1$ Define each policy’s mean-squared error to the oracle score $s^*$ as $Q_i = \mathbb{E}\|s_i - s^*\|^2$ . For $N=2$ , the error $Q(w)$ of the mixture score $w s_1 + (1-w)s_2$ is convex in $w$ with minimizer $w^*$ : $Q(w^*) \leq \min\{ Q(0), Q(1) \}$ with strict inequality if the estimators’ errors are non-aligned. This supports the empirical observation that MCDP samplers can outperform all constituent unimodal policies.

For the continuous-time sampling trajectory $x_{\hat{s}}(t)$ under the composed score, a Grönwall-type bound holds: $\mathbb{E} \|x_{\hat{s}}(T) - x^*(T)\| \leq \left(\int_0^T e^{2\int_t^T \tilde{L}(\tau)d\tau} L_s(t)^2 dt \right)^{1/2} \left(\int_0^T \kappa(t)^2 dt \right)^{1/2}$ where $\tilde{L}(t)$ , $L_x(t)$ , $L_s(t)$ , $\hat{\Lambda}(t)$ , and $\kappa(t)$ are Lipschitz and score-error constants defined in functional analysis of the system-level accuracy (Cao et al., 1 Oct 2025). This suggests that the benefits of composition in single denoising steps can propagate consistently throughout the entire trajectory-generation process.

4. Algorithmic Procedures and Weight Search

The standard two-modality MCDP algorithm proceeds as follows:

Initialize with pre-trained unimodal DPs ( $\pi_\text{img}, \pi_\text{pcd}$ ), their input encodings, and composition weights ( $w_\text{img}, w_\text{pcd}$ ).
Sample the noisy trajectory $\tau_T \sim \mathcal{N}(0,I)$ .
For $t = T, T-1, \ldots, 1$ $t = T, T - 1, \dots, 1$ :
- Compute unimodal noise estimates $\epsilon_\text{img}$ and $\epsilon_\text{pcd}$ .
- Linearly blend them: $\epsilon_\text{comp} = w_\text{img}\epsilon_\text{img} + w_\text{pcd}\epsilon_\text{pcd}$ .
- Update $\tau_{t-1}$ using the composite $\epsilon_\text{comp}$ .
Return $\tau_0$ as the action trajectory.

Grid search on the $N$ -simplex is performed over possible weights $w$ , using empirical rollout success rates to select optimal $w^*$ . For $N=2$ , a coarse-to-fine grid over $w_1 \in \{0,0.1,0.2,\ldots,1\}$ with $w_2 = 1-w_1$ suffices. Each candidate $w$ is evaluated via multiple rollouts, tracking task success, to select the best-performing mixture for deployment (Cao et al., 16 Mar 2025, Cao et al., 1 Oct 2025).

5. Empirical Evaluation

Quantitative Results: RoboTwin and Robomimic

Empirical tests on the RoboTwin bimanual manipulation suite and Robomimic/PushT show that MCDP generally outperforms both parent unimodal DPs whenever both achieve moderate performance ( $\gtrsim 30\%$ success). For example:

Task	DP_img	DP_pcd	MCDP (best $w$ )
Empty Cup Place	0.42	0.62	0.86 ( $w_\text{img}=0.4$ )
Dual Bottles Pick (H)	0.49	0.64	0.71 ( $w_\text{img}=0.3$ )
Shoe Place	0.37	0.36	0.60 ( $w_\text{img}=0.5$ )

If either parent policy is poor ( $<10\%$ ), MCDP cannot improve over the stronger policy ("Pick Apple Messy").
Optimal weights place heavier emphasis on the better-performing DP per task (Cao et al., 16 Mar 2025).

On Robomimic, PushT, and RoboTwin, convex MCDP composition yields $2$– $10\%$ improvement on standard benchmarks and approximately $10\%$ in real-world robotic setups:

Method	Avg SR	Δ vs best parent
DP+MP	41.41	+2.22%
Florence-D+DP	66.76	+5.51%
π₀+FP	88.94	+2.52%

Alternative operators such as logical AND/OR can achieve even greater gains at the cost of per-step recomputation and limited compatibility (e.g., not with flow models) (Cao et al., 1 Oct 2025).

Qualitative Insights

Action distributions transition smoothly between the behaviors of each unimodal policy as composition weights are varied, yielding trajectory interpolation.
Case studies highlight blending of complementary strengths; e.g., combining approach direction from vision with force estimation from point-cloud data for improved grasping (Cao et al., 16 Mar 2025).

6. Modality and Model Generality

MCDP, via the General Policy Composition (GPC) framework, is agnostic to the sensory modalities and model architectures of its DPs, allowing composition of:

Vision-only (RGB), point-cloud, vision–language–action (VLA) policies (e.g., Florence-DiT), and others.
Both diffusion- and flow-matching–based policies.

Significant empirical performance gains are observed when parent policies offer complementary strengths. For heterogeneous combinations, MCDP boosts average SR (success rate) by $5$– $7\%$ in RoboTwin (vision+point-cloud, VLA+VA pairs), and by similar margins on real-robot tasks (Cao et al., 1 Oct 2025).

Composition requires shared trajectory/action space and diffusion schedule alignment among combined DPs. The approach does not employ classifier-free guidance, thus avoiding doubled computational cost (Cao et al., 16 Mar 2025).

7. Limitations and Extensions

Manual weight tuning is currently required—suboptimal choices can degrade performance, especially if large weights are assigned to poor parent models. MCDP has so far been demonstrated primarily for two visual modalities; extension to additional modalities (e.g., tactile, language-conditioned, proprioceptive DPs) is plausible.

Plausible implications include:

Adaptive weight tuning (online or via validation rollouts) could further improve results.
Extension to composition across domains and embodiments may be achieved by aligning latent action representations.
Investigation of asynchronous modality-specific schedulers and advanced diffusion solvers (such as DPM-Solver or Analytic-DPM) constitutes an open direction (Cao et al., 16 Mar 2025, Cao et al., 1 Oct 2025).

References

"Modality-Composable Diffusion Policy via Inference-Time Distribution-level Composition" (Cao et al., 16 Mar 2025)
"Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition" (Cao et al., 1 Oct 2025)

Markdown Report Issue Upgrade to Chat

References (2)

Modality-Composable Diffusion Policy via Inference-Time Distribution-level Composition (2025)

Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Modality-Composable Diffusion Policy (MCDP).