Modality-Composable Diffusion Policy

Updated 10 November 2025

The paper introduces a framework that convexly combines independently trained diffusion policies via weighted score summation, ensuring improved policy distributions.
It leverages cross-modal cues from sensors like RGB vision and point clouds to enhance generalization and robustness in complex control tasks.
Empirical results demonstrate that optimal expert weighting significantly boosts success rates, validating the approach in multi-modal robotic control.

A Modality-Composable Diffusion Policy is a framework in which multiple pre-trained diffusion-based policy models, each conditioned on a distinct sensory or contextual modality (e.g., RGB vision, point clouds, tactile, language), are flexibly combined during inference to yield superior or more robust policy distributions without the cost of additional joint training. This approach leverages principles from compositional score-based generative modeling and energy-based models, enabling cross-modal adaptability, improved generalization, and enhanced robustness in settings such as robotic control, multi-modal generative AI, and complex sensorimotor tasks.

1. Foundations of Diffusion Policy and the Modality Composition Problem

Diffusion Policies (DPs) are based on denoising diffusion probabilistic models (DDPMs) fit over entire action trajectories for visuomotor control. The general DP models the reverse process of a forward noising SDE, reconstructing expert demonstration trajectories from Gaussian-distributed noise via a learned score network, typically parameterized as a noise predictor $\epsilon_\theta$ . For a given input modality $M$ , the reverse step is

$\tau^{t-1} = \alpha^t\bigl(\tau^t - \gamma^t\,\epsilon_\theta(\tau^t,t,M)\bigr) + \xi, \quad \xi\sim\mathcal{N}(0,\sigma_t^2 I)$

with $\epsilon_\theta(\tau_t, t, M) = -\sigma_t\,\nabla_{\tau_t}\log p_\theta(\tau_t\mid M)$ .

However, training a single DP jointly on multiple heterogeneous modalities (e.g., RGB + point cloud) is statistically prohibitive, requiring exponentially more samples due to the curse of dimensionality. Most existing policies are trained per-modality, leading to sub-optimal exploitation of available multisensory cues.

The central question is: Can one combine multiple pre-trained, unimodal diffusion policies into a single, expressive multi-modal policy at inference time without additional joint training or retrofitting?

2. Inference-Time Modality Composition: Formulation and Algorithm

The Modality-Composable Diffusion Policy (MCDP) addresses the above problem via distribution-level composition at test time. Suppose there are $n$ pre-trained policies $\{\pi_i\}$ , each conditioned on modality $M_i$ . The composite trajectory distribution is defined as:

$p_*(\tau) \propto \prod_{i=1}^n p_{\mathcal M_i}(\tau\mid M_i)^{w_i},\qquad \sum_{i=1}^n w_i = 1,\quad w_i \geq 0$

where $w_i$ are non-negative weights. By properties of score-based models and energy-based models:

$\nabla_\tau\log p_*(\tau) = \sum_{i=1}^n w_i\,\nabla_\tau\log p_{\mathcal M_i}(\tau\mid M_i)$

and in noise-prediction parameterization,

$\hat\epsilon_{*}(\tau_t, t) = \sum_{i=1}^n w_i\;\epsilon_\theta(\tau_t, t, M_i)$

Sampling Procedure: For an example with two modalities, the denoising loop proceeds as follows:

τ = sample_normal(shape)
for t in reversed(range(N)):
    ε_img = π_img(τ, t, M_img)
    ε_pcd = π_pcd(τ, t, M_pcd)
    ε_star = w_img * ε_img + w_pcd * ε_pcd
    ξ = normal(0, σ_t^2 I)
    τ = α^t * (τ - γ^t * ε_star) + ξ
return τ_0

No classifier-free guidance or temperature adjustment is necessary; per-step computational cost is linear in the number of experts.

3. Theoretical Justification and Guarantees

The distribution-level composition draws from theory on compositional energy-based models. The product of marginal distributions (as weighted geometric means) yields a valid joint posterior, provided the component scores are well-calibrated. Summing independent scores achieves a "mixture-of-experts" effect in the score field.

["Compose Your Policies!" (Cao et al., 1 Oct 2025)] establishes that a convex combination of pre-trained scores strictly reduces pointwise mean squared score error (unless the estimators are perfectly aligned) and, crucially, this improvement propagates along the entire generation trajectory according to a Grönwall-type stability bound. Formally, for two estimators at a given $(t, x)$ with errors $b_1, b_2$ , the expected error $Q(w)$ of the convex composition $\varepsilon(w)$ :

$Q(w) = \mathbb{E}\|\varepsilon(w) - s^*(t, x)\|^2$

is convex in $w$ and $\min_{w \in [0, 1]} Q(w) \leq \min\{Q(0), Q(1)\}$ . This one-step improvement extends over all time steps.

In practice, when both base experts are reasonable ( $>30\%$ SR), the composite policy $p_*$ often outperforms either constituent. When one is poor, the composite is limited by the best expert.

4. Empirical Evaluation and Practical Implementation

MCDP has been empirically validated primarily on the RoboTwin dual-arm manipulation benchmark, combining RGB- and point-cloud-based diffusion experts. Performance is measured via success rate over multiple rollouts. In the principal experiments:

Task	DP_img	DP_pcd	MCDP (best $w_\mathrm{img}$ )
Empty Cup Place	0.42	0.62	0.86
Dual Bottles Pick (Hard)	0.49	0.64	0.71
Shoe Place	0.37	0.36	0.60
Dual Shoes Place	0.08	0.23	0.20
Pick Apple Messy	0.05	0.26	0.25
Dual Bottles Pick (Easy)	0.77	0.36	0.85
Block Hammer Beat	0.00	0.76	0.61

When both unimodal experts are above $\sim$ 30% SR, composition yields substantial improvements (up to $+0.24$ absolute).
If one expert fails (<10%), composite success remains capped by the best constituent.
The optimal $w_i$ consistently favors the stronger expert.

["Compose Your Policies!" (Cao et al., 1 Oct 2025)] generalizes this to heterogeneous modalities (e.g., vision-action, vision-language-action, or flow-based policies) and demonstrates SR improvements across Robomimic, PushT, RoboTwin, and in real robots.

5. Extensions: Modality Prioritization and Structured Fusion

Several advanced frameworks extend modality-composable diffusion policies beyond flat score addition.

Factorized Diffusion Policies (FDP): ["Factorizing Diffusion Policies for Observation Modality Prioritization" (Patil et al., 20 Sep 2025)] shows that splitting the full log-likelihood into prioritized (e.g., proprioceptive) and residual (e.g., vision, tactile) branches explicitly enables prioritization. This is achieved by first training a base DP on a subset of modalities and then training a residual network to match the difference in scores for all modalities. At each transformer block or via output addition, residual corrections are applied to the base policy’s output. Practical findings:

FDP gains 15–40 pp absolute SR in low-data and out-of-distribution (e.g., visual distractor) settings compared to standard joint-modality DP.
Best results occur when modality prioritization aligns with task structure (e.g., proprioception priority for repetitive motions, vision priority for spatial variability).

Composable Diffusion for Generative AI: ["Any-to-Any Generation via Composable Diffusion" (Tang et al., 2023)] applies similar composable policy logic to multi-modal generative models, aligning prompt/token encoders and latent spaces, then inferring with cross-attention between modality-specific denoisers.

6. Generalization Across Modalities, Domains, and Embodiments

MCDP and its theoretically related frameworks support several axes of generalization:

Cross-modality: No barrier exists to composing DPs for vision, language, tactile, or any modality, provided score networks are available. For instance, the formalism extends to composition of language-conditioned, tactile-conditioned, and visual policies.
Cross-domain: Provided the same action/trajectory space is defined, experts trained for distinct domains (e.g., table-top vs. door-opening) may be composed to elicit hybrid behaviors.
Cross-embodiment: With shared action parameterization, DPs trained on different robots can be composed, supporting behavioral transfer.

A central prerequisite for success is that score fields induced by the expert policies are not fundamentally at odds; highly conflicting vector fields can result in oscillatory or collapsed composite policies. This suggests that careful weight tuning and expert alignment are essential for safe composability.

7. Limitations and Future Directions

Key constraints include:

Manual Weight Selection: $w_i$ are typically hand-tuned or selected via small batch test-time search, although automation (e.g., maximizing held-out SR or incorporating expert uncertainty) is an identified future direction.
Score Field Conflict: Strongly contradictory expert policies can yield degenerate behavior.
Scalability: Each additional expert adds per-step inference cost, though this is manageable for small $n$ .

Open research questions include theoretical analyses of the stability of mixed score networks, data-efficient weight estimation strategies, automated prioritization in dynamically changing sensor environments, and extension to time-varying or task-adaptive weights.

In summary, modality-composable diffusion policy frameworks provide a simple yet theoretically principled route to test-time, multi-distribution generalization by convexly combining the score fields of pre-trained diffusion models. With demonstrated empirical gains in multi-modal robotic control and generative modeling, these methods promise to substantially reduce cross-modal sample complexity, allow plug-and-play expert fusion, and improve both adaptability and robustness in complex, multi-sensor domains.

PDF Markdown Chat (Pro)

References (3)

Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition (2025)

Factorizing Diffusion Policies for Observation Modality Prioritization (2025)

Any-to-Any Generation via Composable Diffusion (2023)

Follow Topic

Get notified by email when new papers are published related to Modality-Composable Diffusion Policy.