Modular Diffusion Policy Framework

Updated 2 January 2026

Modular Diffusion Policy Framework is an architectural paradigm that decouples diffusion-based decision policies into explicit modules for improved multi-task performance.
It integrates distinct components such as perception, code generation, diffusion, and guidance to handle multimodal inputs and ambiguous conditions.
Its design enables efficient adaptation, transfer learning, and continual improvement across robotics, reinforcement learning, and multi-agent systems.

A modular diffusion policy framework is an architectural paradigm for decision-making policies based on denoising diffusion probabilistic models (DDPMs), where key architectural, functional, and training components are decoupled into explicit modules, each with well-defined interfaces and roles. Such frameworks are designed to address challenges in multitask learning, multimodal action distributions, variable environmental or sensory conditions, ambiguous instructions, transfer and continual adaptation, and requirements for interpretability and robustness. Modular diffusion policy frameworks have been developed for domains including robotic manipulation, offline and online RL, observation-modality prioritization, imitation learning, multi-agent coverage control, and networked system diffusion, among others (Yuan, 2024, Liu et al., 26 Dec 2025, Patil et al., 20 Sep 2025, Chen et al., 19 May 2025, Yin et al., 19 Jun 2025, Defilippo et al., 3 Jun 2025, Wang et al., 13 Feb 2025, Li et al., 2024, Dong et al., 2024, Xu et al., 5 Sep 2025, Vatnsdal et al., 21 Sep 2025).

1. Principles and Motivation

Modularity in diffusion policy design addresses both epistemological and practical limitations of monolithic or end-to-end architectures. When a model is forced to jointly encode high-level semantic understanding, perception, and low-level action synthesis—as in language-conditioned robot policies—generalization and interpretability are often poor under distribution shift or task ambiguity. By decomposing the system into explicit modules (e.g., perception, attention, code generation, modality adapters, expert routing, guidance), each can be independently engineered, trained, replaced, or extended.

In multitask or highly multimodal settings, a single diffusion model may underfit diverse or competing skill modes and be brittle when exposed to new behaviors. Modularization—via expert factorization, residual adapters, or routed mixture-of-experts—yields better mode coverage and stable adaptation to new task components without catastrophic forgetting (Liu et al., 26 Dec 2025, Patil et al., 20 Sep 2025, Xu et al., 5 Sep 2025). The modular abstraction is also crucial for system-level integration in practical robot pipelines, policy-guided planning, or interpretability-driven deployments (Yin et al., 19 Jun 2025, Dong et al., 2024, Defilippo et al., 3 Jun 2025).

2. Modular Architectural Components

Modular frameworks are typically organized as directed dataflow graphs where each node is a module with rigorously specified I/O:

Perception/Embedding: Extracts scene representations from sensory streams via CNNs, point-cloud networks, or spatial transformers; can include multi-view fusion, object detection, and instance segmentation (Yin et al., 19 Jun 2025, Vatnsdal et al., 21 Sep 2025).
Code/Instruction Generation: For language-driven tasks, a vision-LLM (VLM) module emits interpretable code, which is executed to produce attention masks or reference zones (Yin et al., 19 Jun 2025).
Attention Modules: Generate attention maps (e.g., 2D or 3D binary masks) to localize task-relevant spatial regions; typically fused with point clouds or feature tensors before diffusion (Yin et al., 19 Jun 2025).
Diffusion Modules: One or more score-based denoising networks; may be U-Net, Transformer, MLP, or specialized backbones. Can be factorized as per-skill "experts" (Liu et al., 26 Dec 2025), or further modularized via prioritized residual adapters (Patil et al., 20 Sep 2025), or modulated by context with attention or FiLM (Wang et al., 13 Feb 2025, Yuan, 2024).
Guidance Modules: Provide external value- or reward-based signal for classifier-free or reward-guided denoising; can be pre-trained and swapped at will (Chen et al., 19 May 2025, Dong et al., 2024).
Router/Adapter: Routes observations to expert modules or adjusts modular outputs for composition (e.g., soft MoE, residual, product-of-experts).
Planner/Execution Control: Receding-horizon, block execution, and chunked rollout modules integrate sample-efficient planning and action release (Yuan, 2024).

Example architecture from CodeDiffuser (Yin et al., 19 Jun 2025):

VLM-based code generation (language + multi-view RGB-D → Python code)
Executable perception APIs (code→3D point clouds, instance masks, attention)
Diffusion-based action generator (PointNet++ backbone, conditioning on point clouds and attention)

3. Mathematical Formulation of Modular Diffusion Policy

The central mathematical element is the DDPM, parameterizing the policy distribution $\pi_\theta(a|s)$ as a learned reverse process. In modular settings, both policy and conditioning are factorized:

Single-module DDPM (Yuan, 2024, Dong et al., 2024):

Forward process:

$q(a^k | a^0) = \mathcal N (a^k; \sqrt{\bar\alpha_k} a^0, (1-\bar\alpha_k)I), \quad \bar\alpha_k = \prod_{i=1}^k \alpha_i$

Reverse update:

$a^{k-1} = \frac{1}{\sqrt{\alpha_k}} \left( a^k - \frac{1-\alpha_k}{\sqrt{1-\bar\alpha_k}} \epsilon_\theta(a^k, s, k) \right) + \sigma_k \xi_k$

Loss (denoising score-matching):

$\mathcal L(\theta) = \mathbb E_{a^0, k, \epsilon} \left\| \epsilon - \epsilon_\theta(\sqrt{\bar\alpha_k} a^0 + \sqrt{1-\bar\alpha_k} \epsilon, s, k) \right\|^2$

Modular/Factorized Policy (Liu et al., 26 Dec 2025, Patil et al., 20 Sep 2025):

Product-of-experts for $M$ modules:

$p(a_t|o_t) \propto \prod_{i=1}^M p_i(a_t|o_t)^{w_{t,i}}$

Each $p_i$ is a module (e.g., skill expert, prioritized modality). The reverse diffusion step aggregates the weighted sum of module scores:

$a_t^{k-1} = \alpha_k \left( a_t^k - \gamma_k \sum_{i=1}^M w_{t,i} \epsilon_{\theta_i}(a_t^k, o_t, k) \right) + \mathcal N(0, \sigma_k^2 I)$

Residual adapters (modality prioritization):

$s(x_t,y,t) = s_\text{base}(x_t, y^{1:k}, t) + s_\text{res}(x_t, y^{1:m}, t)$

with training objectives ensuring only the residual learns the correction for weaker modalities (Patil et al., 20 Sep 2025).

Guidance decoupling (Chen et al., 19 May 2025):

$Q_\phi:\, \mathcal S \times \mathcal A \to \mathbb R$

is pretrained and then frozen; $\epsilon_\theta$ trained via MSE. At inference, $Q_\phi$ is applied via classifier-free or reward-guided corrections.

4. Training, Adaptation, and Inference Mechanisms

Training regimes, updating, and inference reflect the modularity:

End-to-end joint training: All modules (e.g., expert scores and routers) updated by soft-aggregated loss, ensuring every module gets a signal (Liu et al., 26 Dec 2025).
Guidance-first or guidance-decoupled: Guidance module (e.g., Q-function) is pretrained on off-policy data, then fixed; the diffusion policy is subsequently trained, reducing memory footprint and accelerating sample efficiency by avoiding early-stage guidance "noise" (Chen et al., 19 May 2025).
Residual-base training: π_base is first learned on prioritized modalities, frozen, then π_res is trained on all modalities as a correction (Patil et al., 20 Sep 2025).
Plug-and-play adaptation: For lifelong/continual learning, new modules are appended (via upcycling or copying) and only the new component and router re-trained. Frozen components resist catastrophic forgetting (Liu et al., 26 Dec 2025).

Inference involves composing module outputs, typically by:

Evaluating router weights or priorities,
Aggregating expert/module scores (soft or sparse weighted sum),
Running reverse diffusion to obtain action(s),
Executing via receding horizon or block-wise planning (Yuan, 2024).

For attention-modulated systems (e.g., point cloud + 3D attention), no explicit cross-attention layers are required: conditioning vectors are concatenated or fused via residual skips (Yin et al., 19 Jun 2025).

5. Modularity for Interpretability, Robustness, and Scalability

Key advantages empirically demonstrated include:

Interpretability: With explicit attention maps, code, or expert module routing, system decisions can be externally inspected, analyzed, or debugged (e.g., visualizing which expert handled a scenario, or which spatial region was attended) (Yin et al., 19 Jun 2025, Xu et al., 5 Sep 2025).
Robustness to Distribution Shift: Modular residual adaptation enables robust recovery under input noise, distractors, or OOD conditions, as weaker/ambiguous modalities are automatically downweighted at inference (Patil et al., 20 Sep 2025).
Sample and Compute Efficiency: By decoupling, only necessary modules are updated upon new task introduction, minimizing the parameter count in adaptation (typically ~27%) (Liu et al., 26 Dec 2025).
Scalability to Multi-Agent or Task Settings: For decentralized or agent-based systems, modularity facilitates set-based processing and variable agent counts without dimension mismatch (Vatnsdal et al., 21 Sep 2025).
Reusability and Transfer: Pretrained guidance, perception, or skill modules can be interchanged or reused across pipelines; cross-module transfer can smooth variance and accelerate convergence (Chen et al., 19 May 2025).
Empirical Gains: Modular frameworks achieve higher average and worst-case success rates compared to monolithic, joint, or naïve MoE baselines, with consistent gains as the number or diversity of tasks scales (Liu et al., 26 Dec 2025, Patil et al., 20 Sep 2025, Dong et al., 2024).

6. Exemplar Implementations and Library Support

Highly modular and extensible libraries such as CleanDiffuser provide explicit submodules for diffusion modeling, network backbones, guidance, and policy wrappers, together with APIs for swapping (1) noise schedules, (2) backbones, (3) solvers, (4) guidance mechanisms, and (5) planning or RL wrappers (Dong et al., 2024). CodeDiffuser instantiates VLM-based code generation, perception-attention APIs, and a diffusion-based action generator connected via explicit interfaces (Yin et al., 19 Jun 2025). Table-top planners and RL agents (DQL, DPPO, DDiffPG) encapsulate module composition for advanced action synthesis, adaptation, and exploration (Li et al., 2024, Ren et al., 2024, Yang et al., 2023). ExDiff modularizes simulation, policy intervention, and XAI explainability for network diffusion, admitting new policy types as plug-ins (Defilippo et al., 3 Jun 2025).

Empirical recommendations generally favor: (a) modularization when task domains, input modalities, or interaction structure are heterogeneous; (b) factorized architectures for robust multitask and continual learning environments; and (c) explicit module-reuse to optimize adaptation and system-level scaling (Liu et al., 26 Dec 2025, Patil et al., 20 Sep 2025, Dong et al., 2024, Yin et al., 19 Jun 2025).

7. Applications, Limitations, and Outlook

Modular diffusion policy frameworks have been validated across robotic manipulation with ambiguous language (Yin et al., 19 Jun 2025), multitask and continual learning (Liu et al., 26 Dec 2025), observation-modality prioritization for robustness (Patil et al., 20 Sep 2025), end-to-end autonomous driving with knowledge-driven expert routing (Xu et al., 5 Sep 2025), decentralized multi-agent coverage (Vatnsdal et al., 21 Sep 2025), and explainable simulation of complex network diffusion (Defilippo et al., 3 Jun 2025). They underpin recent advances in sim-to-real RL transfer, lifelong skill composition, adaptive perception, and safety-critical planning.

Limitations include additional design and tuning overhead, possible increased inference latency (when many modules are engaged), and current reliance on manually specified module boundaries or hyperparameters for routing and prioritization. Open challenges involve fully dynamic module composition, lifelong expansion without manual intervention, and seamless integration with emerging foundation models for vision-language-action.

Key sources:

"Unpacking the Individual Components of Diffusion Policy" (Yuan, 2024)
"Flexible Multitask Learning with Factorized Diffusion Policy" (Liu et al., 26 Dec 2025)
"Factorizing Diffusion Policies for Observation Modality Prioritization" (Patil et al., 20 Sep 2025)
"Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL" (Chen et al., 19 May 2025)
"CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity" (Yin et al., 19 Jun 2025)
"CleanDiffuser: An Easy-to-use Modularized Library for Diffusion Models in Decision Making" (Dong et al., 2024)
"A Knowledge-Driven Diffusion Policy for End-to-End Autonomous Driving Based on Expert Routing" (Xu et al., 5 Sep 2025)
"Scalable Multi Agent Diffusion Policies for Coverage Control" (Vatnsdal et al., 21 Sep 2025)
"Policy Representation via Diffusion Probability Model for Reinforcement Learning" (Yang et al., 2023)