Compositional Diffusion Models

Updated 24 October 2025

Compositional diffusion models are structured generative frameworks that modularly combine independent diffusion processes to enable controllable synthesis and scalable scene complexity.
They leverage linear and energy-based score combinations, using projective composition and constrained optimization to integrate separately trained generative components.
Practically, these models enhance compositional fidelity in applications like photorealistic imagery, trajectory planning, and scientific simulation, although challenges in smooth interpolation and relational composition remain.

Compositional diffusion models are a structured class of generative frameworks that enable the synthesis and manipulation of data by modularly combining independent generative components, typically instantiated as diffusion processes. By explicitly factoring the generative process into separable, composable model components—each responsible for individual concepts, objects, or attributes—compositional diffusion models depart from the traditional paradigm of monolithic, entangled latent representations. This approach enables structured generalization, scalable scene complexity, controllable attribute and relation binding, and robust handling of multimodal or multi-constraint generation, spanning vision, language, trajectory planning, and scientific simulation domains.

1. Foundational Principles and Theoretical Foundations

Compositional diffusion models leverage the connection between denoising diffusion probabilistic models (DDPMs) and energy-based models (EBMs). In this formulation, the sampling process of a DDPM is interpreted as an instance of Langevin dynamics, with the learned denoising network $\epsilon_\theta(x, t)$ estimating the score function $\nabla_x \log p_\theta(x)$ . This identification enables composition in terms of energy or score combinations.

A key theoretical construct is the “projective composition” formalism (Bradley et al., 6 Feb 2025), which seeks a distribution $\hat{p}$ that, when projected via a set of feature maps $\{\Pi_i\}$ , matches the marginals of constituent models, i.e., $\Pi_i^\# \hat{p} = \Pi_i^\# p_i$ . To achieve this, individual conditional models with score functions $s_i(x)$ are composed linearly: $\hat{s}(x, t) = s_0(x, t) + \sum_{i} w_i [ s_i(x, t) - s_0(x, t) ],$ where $s_0$ is a base or unconditional score (typically the “background”), $s_i$ is conditionally trained on a particular concept/attribute, and $w_i$ are weighting coefficients. The composition is valid (provably achieves projective composition) when the underlying distributions display “factorized conditionals” (i.e., each $p_i$ perturbs only an independent subset of features) (Bradley et al., 6 Feb 2025).

Extensions explore constrained optimization paradigms, formulating composition and reward alignment as KL-divergence minimization subject to explicit constraints (Khalafi et al., 26 Aug 2025). The primal-dual algorithms derived under strong duality yield reward-tilted or product-of-experts (PoE) solutions,

$p^*(x) \propto q(x) \exp\{\lambda^T r(x)\} \quad \text{or} \quad p^*(x) \propto \prod_{i=1}^m q_i(x)^{\alpha_i^*}$

where $q$ (and $q_i$ ) are pretrained models, $r(x)$ is a vector of reward functions, and the $\lambda^*$ , $\alpha^*$ are dual variables optimized to satisfy constraints.

2. Model Architectures and Compositional Mechanisms

The canonical architecture involves training separate diffusion models or adapters on disjoint or specialized data shards, each modeling a particular concept, object, modality, or constraint (Liu et al., 2022, Golatkar et al., 2023). At inference, the models are composed via weighted score summation, product-of-experts, or energy function addition. For example, composing $n$ conditional models yields a joint score: $\hat{\epsilon}(x, t) = \epsilon_\theta(x, t) + \sum_i w_i [\epsilon_\theta(x, t|c_i) - \epsilon_\theta(x, t)]$ where $c_i$ indexes concepts or conditions (Liu et al., 2022). Negation and mixture operators are readily defined by score subtraction or convex blending.

Recent developments extend compositional diffusion to:

Modular scene factorization via unsupervised latent encoders, with each factor controlling a specific denoising channel (Su et al., 27 Jun 2024),
Parallel factor graph-based generation for large/multi-modal content (Zhang et al., 2023),
Compositional prompt-guided medical data synthesis (e.g., hierarchical prompt spectrum: coarse and fine-grained prompts) (Yu et al., 25 Feb 2025),
Multi-agent and hierarchical region-aware diffusion for complex scene assembly with MLLM-based scene parsing (Li et al., 5 May 2025).

Formulations explicitly support the composition of models trained independently on heterogeneous, asynchronous, or domain-divergent data (“compartmentalization” (Golatkar et al., 2023)), as well as constrained or reward-aligned composition with Lagrangian dual variable-based weighting (Khalafi et al., 26 Aug 2025).

3. Compositionality in Practice: Applications and Empirical Findings

Practical applications exploit the capacity to combine modular concepts at inference to generate:

Photorealistic scenes with extensive attribute, relational, or object composition (Liu et al., 2022),
Long-horizon, high-resolution, or “infinite” images and sequences via parallel generation in factor graphs (Zhang et al., 2023),
Multi-object and multi-attribute binding improvements in text-to-image synthesis (Dat et al., 2 May 2025, Li et al., 5 May 2025),
Flexible, constraint-satisfying trajectory planning for spacecraft or robotics (Briden et al., 5 Oct 2024),
Preservation and attribution of data provenance for privacy-preserving model training and selective forgetting (Golatkar et al., 2023),
Synthetic data generation in clinical and scientific domains (polyp detection, coupled PDE systems) where joint simulation or annotation is costly (Yu et al., 25 Feb 2025, Dhulipala et al., 23 Oct 2025).

Empirical evaluations consistently show improved compositional fidelity (i.e., correct entity, attribute, and relation binding) compared to monolithic or non-compositional baselines, as measured by metrics such as FID, CLIP score, T2I alignment, TIFA, mDice, mIoU, and application-specific reward functions. Examples include up to 14.3% TIFA improvement (Golatkar et al., 2023), significant boost in rare concept generation by LLM-guided composition (Park et al., 29 Oct 2024), and 2–3% increase in clinical detection F1-scores (Yu et al., 25 Feb 2025).

4. Technical Challenges and Limitations

Despite strong empirical successes, several technical obstacles and limitations are identified:

Linear score composition is only theoretically justified in the presence of (approximate) factorization or orthogonality of the conditional distributions’ supports. Failures arise when there is significant overlap or correlation between component scores, or when background distributions are not chosen appropriately (Bradley et al., 6 Feb 2025).
Mixture, product, and tempered compositions require careful handling due to nonlinearity of the joint log-probability; naive additive score rules may yield biased samples. MCMC-inspired samplers and density ratio corrections have been proposed, but robust estimation remains a challenge (Du et al., 2023).
Training cost grows as the number of shards/components increases; efficient sharding and adapter-based schemes (e.g., LoRA, prompt tuning) partially mitigate this. Synergistic cross-component information may be lost with naive averaging, so classifier-based weighting and principled mixing weights are actively investigated (Golatkar et al., 2023, Khalafi et al., 26 Aug 2025).
For hierarchical and rare/low-frequency compositional generalization, multiplicative emergence and frequency bottlenecks are observed—rare attributes require much larger sample sizes or more optimization steps for compositional mastery (Okawa et al., 2023).
In relational composition tasks (object relationships, spatial reasoning), all contemporary models—diffusion, CLIP, and ViLT—struggle, suggesting a fundamental limitation in current feature disentanglement and representation learning (Pearson et al., 28 Aug 2025).
Interpolative smoothness in factorized representations is limited: diffusion models often learn near-orthogonal (categorical-like) factors, facilitating composition but hindering smooth interpolation (Liang et al., 23 Aug 2024).

5. Data-Centric and Training-Efficient Mechanisms

Several studies highlight the crucial data-centric perspective:

Sample complexity for compositional generalization scales polynomially with the depth of hierarchical context, mirroring correlation-based clustering akin to the renormalization group in physics (Favero et al., 17 Feb 2025).
Data with isolated factor coverage plus few compositional examples enables linear (rather than quadratic/exponential) scaling of training samples for compositionality (Liang et al., 23 Aug 2024).
Controlled studies diagnose “emergent” phase transitions in representation: compositional capabilities appear suddenly (“multiplicative emergence”) once all constituent sub-tasks are mastered, with the onset sharply governed by data frequency and combinatorial structure (Okawa et al., 2023).
Data sharding, compartmentalization, and compositional sampling enhance continual learning, forensic attribution, differential privacy, and unlearning, offering major functional advantages in large-scale distributed and privacy-sensitive systems (Golatkar et al., 2023).

6. Advances in Inference, Evaluation, and Training-Free Methods

Beyond explicit model training, new methods for compositional inference and sample selection improve compositional alignment and reliability:

Lift score-based rejection sampling quantifies the “fit” of generated samples with individual conditions (attributes/concepts) using only the original diffusion model and approximations to $\log p(x|c) - \log p(x)$ . This enables training-free compositional consistency improvement, especially under complex or rare-prompt scenarios (Yu et al., 19 May 2025).
LLM-guided inference, notably the “Rare-to-Frequent” (R2F) framework, utilizes external LLMs to replace rare concept prompts with frequent ones during early diffusion, then phases toward the true rare prompt, yielding state-of-the-art rare compositional alignment (Park et al., 29 Oct 2024).
Region- and agent-based collaborative scene parsing (using object, action, spatial, and layout agents) combined with hierarchical cross-attention, bounding box masking, and weighted latent fusion have been introduced to handle complex prompts with many objects and spatial relations in a training-free compositional pipeline (Li et al., 5 May 2025).

7. Outlook and Research Directions

Key open topics and active research directions include:

Developing robust, theoretically principled algorithms for compositional sampling, including adaptive weighting, MCMC corrections, and learned transformation to disentangled feature space (Du et al., 2023, Bradley et al., 6 Feb 2025).
Designing scalable, interpretable, and efficient architectures for large-scale compositionality, including factor graphs, unsupervised decomposition, and modular transfer across tasks and modalities (Zhang et al., 2023, Su et al., 27 Jun 2024).
Addressing persistent limitations in relational compositionality, smooth factor interpolation, and global structure emergence—potentially through novel data curation, pretraining, and objective function engineering (Okawa et al., 2023, Pearson et al., 28 Aug 2025).
Integrating constrained optimization directly into model training and adaption—for instance, in reward-aligned alignment under sample or privacy constraints—enabling more tractable and interpretable control of generation tradeoffs (Khalafi et al., 26 Aug 2025).
Extending compositional diffusion to coupled scientific simulation, long-horizon modeling, and physical surrogate systems, exploiting the efficiency of decoupled training plus symmetric composition for computational scalability (Dhulipala et al., 23 Oct 2025).

In summary, compositional diffusion models provide a mathematically principled, practically validated, and theoretically rich framework for modular, scalable, and controllable generative modeling. They are distinguished by their ability to realize novel concept binding, structured generalization, and compositional scene synthesis—in settings ranging from photorealistic imagery and text-to-image translation to scientific simulation and privacy-sensitive model deployment.