Composable Diffusion (CoDi) Framework

Updated 4 February 2026

Composable Diffusion (CoDi) is a modular framework that composes independent diffusion sub-models to generate complex, multimodal outputs.
It leverages score fusion and product-of-experts principles to integrate diverse concepts, data sources, and modalities without re-training.
The framework supports continual learning, selective forgetting, and logical constraints, making it practical for structured generative tasks.

Composable Diffusion (CoDi) is a methodological framework for constructing generative models whose outputs are governed by the composition of independently trained diffusion sub-models. This paradigm enables principled, zero-shot combination of concepts, modalities, data sources, or policies at inference, leveraging the linearity of the score function in diffusion models and the product-of-experts formalism of energy-based models. CoDi and its variants address the exponential combinatorics of compositional generative tasks—such as assembling complex scene specifications, synthesizing multimodal content, or enforcing logical constraints—while enabling modular training, continual learning, selective forgetting, and training data compartmentalization.

1. Mathematical Foundation of Composable Diffusion

At the core of CoDi is the observation that denoising diffusion models, trained to represent conditional distributions $p_i(x)$ , can be interpreted as implicit energy-based models (EBMs) with energy $E_i(x) \approx -\log p_i(x)$ (Liu et al., 2022). Product-of-experts composition then yields a joint distribution: $p_{\text{joint}}(x) \propto \prod_{i=1}^k p_i(x)$ which, in energy terms, is $E_{\text{joint}}(x) = \sum_i E_i(x)$ . The corresponding score (the gradient of the log-density) is additive: $\nabla_x \log p_{\text{joint}}(x) = \sum_{i=1}^k \nabla_x \log p_i(x)$ In conditional contexts (e.g., classifier-free guidance or attribute composition), this can generalize to arbitrary logic (AND, NOT) at the cost of independence assumptions: $\nabla_x\log p(x \mid C_1=c_1 \land C_2=c_2) \approx \nabla_x\log p(x \mid C_1) + \nabla_x\log p(x \mid C_2) - \nabla_x\log p(x)$ (Gaudi et al., 3 Mar 2025, Liu et al., 2022).

This foundation underpins a variety of practical architectures and composition schemes, including layer-wise compositional synthesis, multimodal joint generation, and score fusion across data partitions or modalities.

2. Theoretical Guarantees and Failure Modes

Projective composition is the formal guarantee sought in compositional diffusion: for projections $\Pi_i: \mathbb{R}^n \to \mathbb{R}^{d_i}$ , the composed model $\hat{p}$ is a projective composition if $\Pi_i \sharp \hat{p} = \Pi_i \sharp p_i$ for all $i$ —i.e., the composed distribution matches the marginals of each constituent in the relevant feature space (Bradley et al., 6 Feb 2025).

Linear score composition achieves projective composition when certain conditional independence (factorized-conditional) or orthogonality assumptions hold in pixel or transformed feature spaces. Specifically, if the variable partitioning or an invertible feature map $\mathcal{A}$ induces mutual independence in the relevant coordinates, then reverse-diffusion with the composed score recovers the target distributive product.

Failing these assumptions results in pathological outcomes:

“Bayes composition” with unconditional background $p_u$ generally fails to yield projective compositions and can exhibit fractional or ambiguous outputs.
Non-orthogonal feature transforms invalidate the commutation between noising and reparameterization, leading to non-Lipschitz evolutions in Wasserstein space and generative instability.

The orthogonality heuristic—empirically validated using CLIP embeddings—offers a necessary, easily computable condition: mean difference vectors between constituent models and the background should be nearly orthogonal to predict successful composition in CoDi (Bradley et al., 6 Feb 2025).

3. Organizational and Algorithmic Instantiations

Modular Training and Inference-time Composition

Composable Diffusion is operationalized in various ways, all leveraging inference-time composition:

Component-wise concept models: Each concept (attribute, object, relation) is modeled by a conditional denoising network; compositions are performed at sampling with no additional fine-tuning (Liu et al., 2022).
Sharded data compartmentalization: Sub-models are trained on disjoint data or attribute subsets; score fusion at inference reconstructs or approximates the joint generative process, enabling selective forgetting, continual learning, and strict access control (Golatkar et al., 2023).
Cross-modality policies: Policies are independently trained per observation modality (e.g., RGB, point cloud), then fused by weighted score addition to construct a multimodal controller without retraining (Cao et al., 16 Mar 2025).
Layered and spatial composition: For tasks such as multi-layered image synthesis, per-layer denoisers with inter- and intra-layer attention modules are composed at each timestep (Huang et al., 2024).

Algorithmically, the generic composition scheme at each denoising step is: $\epsilon_{\text{comp}} = \sum_i w_i \epsilon_i(x_t, t)$ where $\epsilon_i$ is the predicted noise for expert $i$ and $w_i$ are scalar or pixel-wise weights (e.g., softmax masks in view fusion (Spiegl et al., 2024)). Variants include explicit background score subtraction, classifier-free guidance, and logical (AND/NOT) operations.

Training Objectives and Conditional Independence

Standard conditional diffusion models often fail to enforce the independence structure required by compositional inference, especially in non-uniform or partial training regimes (Gaudi et al., 3 Mar 2025). This leads to demonstrable failures in generating correct logical attribute compositions (e.g., AND/NOT), as measured by Conformity Score and Jensen-Shannon divergence.

The CoInD method augments the score-matching objective with a Fisher divergence penalty: $L_{\rm CI} = \mathbb E_{X,C}\left\|\nabla_X\log p_\theta(X|C) - \sum_i\nabla_X\log p_\theta(X|C_i) + (n-1)\,\nabla_X\log p_\theta(X)\right\|_2^2$ which is minimized alongside the denoising loss, ensuring the joint model factorizes as in the causal hypothesis. Practically, this corrects “composition by score addition” even under severe training support sparsity (Gaudi et al., 3 Mar 2025).

4. Application Domains and Empirical Behavior

Structured Visual and Multimodal Generation

CoDi variants robustly generalize to combinatorial compositions far outside the training support:

Compositional visual generation: For scenes and faces with multi-attribute or relational queries, CoDi outperforms unfactored diffusion and GAN-based methods in both adherence and FID, achieving correct binding and relational structure even for zero-shot attribute combinations (Liu et al., 2022).
Multi-layered synthesis: LayerDiff introduces inter- and intra-layer attention together with self-mask guidance, attaining competitive FID and CLIP-Score to single-image baselines while enabling fine-grained layer control (inpainting, style transfer) (Huang et al., 2024).
Novel view synthesis: ViewFusion leverages composable denoising across multiple pose-free views with per-pixel softmax score fusion, outperforming NeRF-based and transformer baselines on small-scale datasets (Spiegl et al., 2024).
Arbitrary input-output mapping: The CoDi framework for any-to-any modality generation introduces bridging and latent alignments, supporting synchronized generation across text, image, audio, and video with competitive or state-of-the-art class fidelity (COCO-FID, MSR-VTT R@1, AudioCaps FAD) and semantic consistency metrics (Tang et al., 2023).

Policy and Task Composition

Modality Composable Policies: Modality-Composable Diffusion Policy (MCDP) fuses independently trained single-modality policies, yielding consistently higher task success rates than any constituent on the RoboTwin benchmark, especially when both unimodal experts are competent (Cao et al., 16 Mar 2025).
Instruction-following navigation: ComposableNav decomposes natural language instructions into atomic motion primitives, each implemented as an independent diffusion model fine-tuned by RL. This compositional approach achieves superior generalization to unseen instruction conjunctions in both simulation and hardware (SR≈76% for $k=2$ instructions vs <40% for baselines) (Hu et al., 22 Sep 2025).

Privacy, Forgetting, and Data Attribution

Compartmentalized Diffusion Models (CDMs) facilitate perfect selective forgetting, attribution, and a-la-carte access control: removing or updating a sub-model (trained on a data shard $D_i$ ) requires no retraining of others, and class-conditional FID increases by at most 10% relative to a monolithic model. Data attribution per sample is quantifiable via the proportion of diffusion path log-likelihood sourced from each shard (Golatkar et al., 2023).

5. Limitations, Practical Heuristics, and Extensions

Limitations of current CoDi methodology include:

Independence assumptions: Empirical and theoretical results show that compositional score addition is only valid under strict conditional independence or sufficient orthogonality between component domains/features (Bradley et al., 6 Feb 2025, Gaudi et al., 3 Mar 2025).
Scalability: As the number of composed primitives increases, generation quality can degrade due to accumulated conflicts in the score fields (Hu et al., 22 Sep 2025). Approaches such as energy-based sampling, MCMC, or HMC may improve robustness at large $k$ .
Attribute supervision: Some approaches (e.g., CoInD) require labeled discrete attributes or reward heuristics, and do not yet generalize to unsupervised or open-set composition (Gaudi et al., 3 Mar 2025).
Heuristic validation: The orthogonality-of-means heuristic provides a necessary (not sufficient) criterion for composition, evaluated via cosine similarity in CLIP or feature space (Bradley et al., 6 Feb 2025).

Future directions include learned or meta-optimized weight selection for expert fusion, fully end-to-end parsing of compositional instructions, unsupervised factorization of component domains, and energy-based or sampling-improved joint generation methods (Hu et al., 22 Sep 2025, Bradley et al., 6 Feb 2025).

6. Representative Experimental Results

Empirical findings across domains consistently report that CoDi models, when composed under the outlined assumptions, achieve:

Structured generalization to unseen combination queries: FFHQ attribute adherence $\sim$ 69% vs GAN baselines; multi-object CLEVR placement 31.4% accuracy vs 7.3% EBM baselines; compositional navigation SR 76% for $k=2$ vs <40% for prior methods (Liu et al., 2022, Hu et al., 22 Sep 2025).
Data compartmentalization with minimal generative penalty: 8-split FID 6.54 ( $+10.3\%$ vs paragon FID 5.93), TIFA text-image alignment improved by 14.33% over monolithic models (Golatkar et al., 2023).
Cross-modality or layer fusion with competitive or improved metrics: ViewFusion LPIPS 0.033, LayerDiff (+SMG) FID 21.3 outperforming vanilla SD, CoDi any-to-any generation matching or surpassing unimodal baselines (Spiegl et al., 2024, Huang et al., 2024, Tang et al., 2023).
Enhanced logical composition and controlled sample generation: CoInD delivers $\sim$ 7x gains in Conformity Score and halves JSD compared to naively composed diffusion models under partial training support (Gaudi et al., 3 Mar 2025).

7. Summary Table: Core Methodological Elements

Approach/Domain	Training Structure	Inference-time Composition	Notable Capabilities
Compositional Vision (Liu et al., 2022)	Per-attribute model	Score/energy summation	Relational/logic comp, binding
Any-to-any Modality (Tang et al., 2023)	Unimodal LDMs + paired cross-attention	Latent cross-attention + score sum	Synchronized, arbitrary joint outputs
Compartmentalization (Golatkar et al., 2023)	Data-sharded models	Weighted score mixture	Continual learning, selective forgetting
Multimodal Policy (Cao et al., 16 Mar 2025)	Per-modality DPs	Weighted score fusion	Cross-modality, no retraining
Multi-layered Composition (Huang et al., 2024)	Multi-layer, prompt-conditioned	Inter-/intra-layer attention, mask guidance	Object-wise editing and control
ComposableNav (Hu et al., 22 Sep 2025)	RL-finetuned motion primitives	Parallel noise addition	Instruction composition in navigation
Logical Composition (Gaudi et al., 3 Mar 2025)	CI-penalized joint model	Score composition (AND/NOT)	Arbitrary Boolean attribute composition

CoDi provides a theoretically grounded, empirically validated, and methodologically modular framework for generative model composition across vision, audio, policy, and control, establishing a toolkit for structured, scalable, and privacy-respecting generative inference.