Papers
Topics
Authors
Recent
Search
2000 character limit reached

Modality-Composable Diffusion Policy

Updated 8 March 2026
  • MCDP is a diffusion-based policy that composes multiple unimodal diffusion policies via convex score combination during inference.
  • It avoids costly retraining by integrating specialized sensory modalities, such as RGB images and point-clouds, for robust robotic trajectory generation.
  • Empirical evaluations on benchmarks like RoboTwin demonstrate that MCDP improves success rates, offering a modular, plug-and-play approach to multi-modal policy integration.

Modality-Composable Diffusion Policy (MCDP) extends diffusion-based policy models by enabling inference-time composition of multiple pre-trained unimodal diffusion policies, each specialized for a distinct sensor modality. Instead of retraining a single, unified multi-modal policy—a process that incurs significant data and computational cost—MCDP constructs a composite policy by convexly combining the distributional scores (denoising functions) from its constituent unimodal policies during sampling, yielding enhanced adaptability, robustness, and generalization without additional training. Empirical studies in automated robotics tasks, notably on the RoboTwin benchmark, demonstrate that MCDP often outperforms its underlying unimodal policies and establishes a modular, plug-and-play paradigm for integration of arbitrary sensing modalities (Cao et al., 16 Mar 2025, Cao et al., 1 Oct 2025).

1. Theoretical Foundation: Score-Based Diffusion Policies

A diffusion policy (DP) parameterizes a trajectory distribution using a forward noising process and a learned, score-based reverse denoising process. Let τRD\tau \in \mathbb{R}^D denote a trajectory (e.g., robot end-effector pose sequences), and τt\tau_t its noisy counterpart at diffusion step tt. The forward Markov process is defined as: q(τtτt1)=N(τt;αtτt1,(1αt)I)q(\tau_t \mid \tau_{t-1}) = \mathcal{N}\left(\tau_t; \sqrt{\alpha_t}\,\tau_{t-1}, (1-\alpha_t)I\right) where αt\alpha_t is a predefined noise schedule. The reverse-time process is parameterized by a neural network ϵθ\epsilon_\theta estimating the noise added at each step: sθ(τt,t)=1σtϵθ(τt,t)τtlogpθ(τt)s_\theta(\tau_t, t) = -\frac{1}{\sigma_t}\epsilon_\theta(\tau_t, t) \approx \nabla_{\tau_t}\log p_\theta(\tau_t) which provides the score function. The DDPM update for discrete time steps takes the form: τt1=1αt(τt1αt1αˉtϵθ(τt,t))+σtξ,ξN(0,I)\tau_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left( \tau_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(\tau_t,t) \right) + \sigma_t \xi, \quad \xi \sim \mathcal{N}(0,I) where αˉt=i=1tαi\bar{\alpha}_t = \prod_{i=1}^t \alpha_i. In the SDE perspective, the forward SDE is: dτ=12β(t)τdt+β(t)dwd\tau = -\frac{1}{2}\beta(t)\tau\,dt + \sqrt{\beta(t)}\,dw and the reverse SDE uses the neural score: dτ=[12β(t)τβ(t)τlogpθ(τ,t)]dt+β(t)dwd\tau = \left[ \frac{1}{2}\beta(t)\tau - \beta(t)\nabla_{\tau}\log p_\theta(\tau,t) \right]dt + \sqrt{\beta(t)}\,d\overline{w} Unimodal DPs are pretrained by behavior cloning on trajectory data conditioned on modality-specific inputs:

  • RGB-based DP: ϵθimg(τt,tIrgb)\epsilon^{\text{img}}_\theta(\tau_t, t \mid I_\text{rgb}), trained on RGB images (CNN + transformer encoding).
  • Point-cloud DP: ϵθpcd(τt,tP)\epsilon^{\text{pcd}}_\theta(\tau_t, t \mid P), trained on voxelized point clouds.

The objective is the standard diffusion loss: L(θ)=Eτ,t,ξϵθ(αˉtτ+1αˉtξ,t)ξ2\mathcal{L}(\theta) = \mathbb{E}_{\tau, t, \xi}\left\| \epsilon_\theta(\sqrt{\bar{\alpha}_t}\,\tau + \sqrt{1-\bar{\alpha}_t}\,\xi, t) - \xi \right\|^2 (Cao et al., 16 Mar 2025).

2. Inference-Time Composition: Policy Combination Mechanism

At inference, MCDP convexly combines the scores from nn pre-trained policies, each operating on a distinct modality Mi\mathcal{M}_i: ϵ^comp(τt,t)=i=1nwiϵt(i),i=1nwi=1,  wi0\hat{\epsilon}_\text{comp}(\tau_t, t) = \sum_{i=1}^n w_i \epsilon^{(i)}_t, \quad \sum_{i=1}^n w_i = 1, \; w_i \geq 0 where ϵt(i)\epsilon^{(i)}_t is the noise estimate from the ii-th DP and wiw_i is its weight. The resulting composite score function is: scomp(τt,t)=i=1nwisθ(i)(τt,t)s_\text{comp}(\tau_t, t) = \sum_{i=1}^n w_i s^{(i)}_\theta(\tau_t, t) In practice, L2L_2 normalization may be applied to prevent any modality’s score from dominating: ϵˉt(i)=ϵt(i)/ϵt(i)2,ϵ^comp=iwiϵˉt(i)\bar{\epsilon}^{(i)}_t = \epsilon^{(i)}_t / \| \epsilon^{(i)}_t \|_2, \quad \hat{\epsilon}_\text{comp} = \sum_i w_i \bar{\epsilon}^{(i)}_t

The composite estimate replaces the unimodal score in the reverse denoising step: τt1=1αt(τt1αt1αˉtϵ^comp(τt,t))+σtξ\tau_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left( \tau_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\hat{\epsilon}_\text{comp}(\tau_t, t) \right) + \sigma_t \xi Thereby, MCDP generates trajectories under a policy that integrates the strengths of all included modalities (Cao et al., 16 Mar 2025, Cao et al., 1 Oct 2025).

3. Functional Guarantees and System-Level Analysis

The MCDP construction obtains theoretical support in the form of one-step improvement and trajectory-level error bounds (Cao et al., 1 Oct 2025). Given NN policies π1,,πN\pi_1,\ldots,\pi_N with time-tt score functions si(τt,t)s_i(\tau_t, t), the convexly composed score s^comp\hat{s}_\text{comp} remains inside the convex hull of the parent policies: s^comp(τt,t)=i=1Nwisi(τt,t),wi0,wi=1\hat{s}_\text{comp}(\tau_t, t) = \sum_{i=1}^N w_i s_i(\tau_t, t), \quad w_i \geq 0, \sum w_i = 1 Define each policy’s mean-squared error to the oracle score ss^* as Qi=Esis2Q_i = \mathbb{E}\|s_i - s^*\|^2. For N=2N=2, the error Q(w)Q(w) of the mixture score ws1+(1w)s2w s_1 + (1-w)s_2 is convex in ww with minimizer ww^*: Q(w)min{Q(0),Q(1)}Q(w^*) \leq \min\{ Q(0), Q(1) \} with strict inequality if the estimators’ errors are non-aligned. This supports the empirical observation that MCDP samplers can outperform all constituent unimodal policies.

For the continuous-time sampling trajectory xs^(t)x_{\hat{s}}(t) under the composed score, a Grönwall-type bound holds: Exs^(T)x(T)(0Te2tTL~(τ)dτLs(t)2dt)1/2(0Tκ(t)2dt)1/2\mathbb{E} \|x_{\hat{s}}(T) - x^*(T)\| \leq \left(\int_0^T e^{2\int_t^T \tilde{L}(\tau)d\tau} L_s(t)^2 dt \right)^{1/2} \left(\int_0^T \kappa(t)^2 dt \right)^{1/2} where L~(t)\tilde{L}(t), Lx(t)L_x(t), Ls(t)L_s(t), Λ^(t)\hat{\Lambda}(t), and κ(t)\kappa(t) are Lipschitz and score-error constants defined in functional analysis of the system-level accuracy (Cao et al., 1 Oct 2025). This suggests that the benefits of composition in single denoising steps can propagate consistently throughout the entire trajectory-generation process.

The standard two-modality MCDP algorithm proceeds as follows:

  1. Initialize with pre-trained unimodal DPs (πimg,πpcd\pi_\text{img}, \pi_\text{pcd}), their input encodings, and composition weights (wimg,wpcdw_\text{img}, w_\text{pcd}).
  2. Sample the noisy trajectory τTN(0,I)\tau_T \sim \mathcal{N}(0,I).
  3. For t=T,T1,,1t = T, T-1, \ldots, 1:
    • Compute unimodal noise estimates ϵimg\epsilon_\text{img} and ϵpcd\epsilon_\text{pcd}.
    • Linearly blend them: ϵcomp=wimgϵimg+wpcdϵpcd\epsilon_\text{comp} = w_\text{img}\epsilon_\text{img} + w_\text{pcd}\epsilon_\text{pcd}.
    • Update τt1\tau_{t-1} using the composite ϵcomp\epsilon_\text{comp}.
  4. Return τ0\tau_0 as the action trajectory.

Grid search on the NN-simplex is performed over possible weights ww, using empirical rollout success rates to select optimal ww^*. For N=2N=2, a coarse-to-fine grid over w1{0,0.1,0.2,,1}w_1 \in \{0,0.1,0.2,\ldots,1\} with w2=1w1w_2 = 1-w_1 suffices. Each candidate ww is evaluated via multiple rollouts, tracking task success, to select the best-performing mixture for deployment (Cao et al., 16 Mar 2025, Cao et al., 1 Oct 2025).

5. Empirical Evaluation

Quantitative Results: RoboTwin and Robomimic

Empirical tests on the RoboTwin bimanual manipulation suite and Robomimic/PushT show that MCDP generally outperforms both parent unimodal DPs whenever both achieve moderate performance (30%\gtrsim 30\% success). For example:

Task DP_img DP_pcd MCDP (best ww)
Empty Cup Place 0.42 0.62 0.86 (wimg=0.4w_\text{img}=0.4)
Dual Bottles Pick (H) 0.49 0.64 0.71 (wimg=0.3w_\text{img}=0.3)
Shoe Place 0.37 0.36 0.60 (wimg=0.5w_\text{img}=0.5)
  • If either parent policy is poor (<10%<10\%), MCDP cannot improve over the stronger policy ("Pick Apple Messy").
  • Optimal weights place heavier emphasis on the better-performing DP per task (Cao et al., 16 Mar 2025).

On Robomimic, PushT, and RoboTwin, convex MCDP composition yields $2$–10%10\% improvement on standard benchmarks and approximately 10%10\% in real-world robotic setups:

Method Avg SR Δ vs best parent
DP+MP 41.41 +2.22%
Florence-D+DP 66.76 +5.51%
π₀+FP 88.94 +2.52%

Alternative operators such as logical AND/OR can achieve even greater gains at the cost of per-step recomputation and limited compatibility (e.g., not with flow models) (Cao et al., 1 Oct 2025).

Qualitative Insights

  • Action distributions transition smoothly between the behaviors of each unimodal policy as composition weights are varied, yielding trajectory interpolation.
  • Case studies highlight blending of complementary strengths; e.g., combining approach direction from vision with force estimation from point-cloud data for improved grasping (Cao et al., 16 Mar 2025).

6. Modality and Model Generality

MCDP, via the General Policy Composition (GPC) framework, is agnostic to the sensory modalities and model architectures of its DPs, allowing composition of:

  • Vision-only (RGB), point-cloud, vision–language–action (VLA) policies (e.g., Florence-DiT), and others.
  • Both diffusion- and flow-matching–based policies.

Significant empirical performance gains are observed when parent policies offer complementary strengths. For heterogeneous combinations, MCDP boosts average SR (success rate) by $5$–7%7\% in RoboTwin (vision+point-cloud, VLA+VA pairs), and by similar margins on real-robot tasks (Cao et al., 1 Oct 2025).

Composition requires shared trajectory/action space and diffusion schedule alignment among combined DPs. The approach does not employ classifier-free guidance, thus avoiding doubled computational cost (Cao et al., 16 Mar 2025).

7. Limitations and Extensions

Manual weight tuning is currently required—suboptimal choices can degrade performance, especially if large weights are assigned to poor parent models. MCDP has so far been demonstrated primarily for two visual modalities; extension to additional modalities (e.g., tactile, language-conditioned, proprioceptive DPs) is plausible.

Plausible implications include:

  • Adaptive weight tuning (online or via validation rollouts) could further improve results.
  • Extension to composition across domains and embodiments may be achieved by aligning latent action representations.
  • Investigation of asynchronous modality-specific schedulers and advanced diffusion solvers (such as DPM-Solver or Analytic-DPM) constitutes an open direction (Cao et al., 16 Mar 2025, Cao et al., 1 Oct 2025).

References

  • "Modality-Composable Diffusion Policy via Inference-Time Distribution-level Composition" (Cao et al., 16 Mar 2025)
  • "Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition" (Cao et al., 1 Oct 2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Modality-Composable Diffusion Policy (MCDP).