Compositional Diffusion with Guided Search (CDGS)
- The paper introduces CDGS, a framework that intertwines guided search with diffusion denoising to overcome mode-averaging in structured generative tasks.
- It employs batch-based sampling, population pruning, and iterative local-to-global message passing to cohesively compose outputs from overlapping local models.
- Empirical results demonstrate CDGS's superior performance in robotic planning, layout generation, and video synthesis compared to traditional diffusion methods.
Compositional Diffusion with Guided Search (CDGS) is a framework for structured generative modeling that synthesizes globally coherent outputs—such as long-horizon robotic plans, complex multi-object layouts, panoramas, or videos—by composing and coordinating the outputs of locally trained diffusion models via an embedded search procedure within the denoising process. CDGS addresses the breakdowns of naïve compositional diffusion methods when faced with multimodal local distributions, achieving robust, globally consistent synthesis by coupling denoising with batch-based selection, population-based pruning, and iterative local-to-global message passing.
1. Foundational Principles of Compositional Diffusion
Classical diffusion models, such as those based on DDPM, provide a flexible generative backbone but are typically monolithic, sampling entire data instances directly. In compositional settings, the generative process aims to assemble a global configuration from overlapping, locally valid factors—e.g., local state transitions in planning or object relations in layout synthesis. Given a set of local generative models , the goal is to sample global structures with strong local and global consistency. The Bethe-approximate factor graph posterior takes the form
where are local (possibly overlapping) subsequences, and represents the number of factors involving (Mishra et al., 31 Dec 2025).
CDGS was introduced to overcome the failure of naïve composition of diffusion models, which, when faced with multimodal , averages over incompatible local modes, producing incoherent or infeasible global samples. The key innovation is intertwining guided search operations with each denoising step, enabling the selective exploration and reinforcement of compatible configurations throughout the sampling trajectory (Mishra et al., 31 Dec 2025, Fan et al., 24 Sep 2025).
2. Mathematical Formulation and Denoising Algorithms
For a given compositional task—plan synthesis, layout generation, or panoramic construction—the CDGS framework defines specific local models and their integration:
- Layout Planning Example: Let be the number of objects, their sizes, relationships, and the positions. Each relationship is modeled with an energy function , leading to the joint:
Denoising proceeds using an annealed Unadjusted Langevin Algorithm (ULA) or DDPM-style update, where at each timestep , predicted noise vectors for each relation are summed:
Reverse update:
The loss for these denoising networks is the MSE between the actual and summed predictive noise (Fan et al., 24 Sep 2025).
- Long-Horizon Planning Example: For structured planning, the process divides the trajectory into overlapping segments, each denoised individually using local models and then reconciled. The forward diffusion and reverse denoising are:
- Forward: ,
- Reverse (DDIM):
3. Guided Search: Batch-Based Sampling and Pruning
CDGS interleaves population-based search and pruning within the denoising procedure at each diffusion timestep. At time , a batch of candidate global samples is evolved:
- Each is generated from using compositional denoising updates.
- A global cost is defined, often in terms of DDIM-inversion curvature or a surrogate for log-likelihood across segments.
- Guided proposal density:
where is the default reverse diffusion transition and controls exploration (Mishra et al., 31 Dec 2025).
- Candidates with the lowest global cost are retained (“elite” selection, often top-), and the batch is repopulated by duplicating these elite samples.
- Iterative forward and backward resampling within the batch propagates information via overlapping segments, akin to belief-propagation, thereby enforcing global coherence.
This approach mitigates the “mode-averaging” phenomenon found in naïve compositional diffusion by explicitly favoring candidates that respect both local multimodality and global feasibility.
4. Integration with Symbolic Reasoning and Vision-Language Agents
In tasks such as spatial layout synthesis, CDGS is tightly coupled with symbolic representations and VLMs:
- A vision-language agent preprocesses input instances, extracting object instances, estimating physical sizes, and constructing scene graphs where and encodes relations as logical predicates.
- Each predicate indicates satisfaction of relationship for a given layout .
- During denoising, hard constraints based on can be enforced by pruning or by injecting penalty gradients into the reverse update:
Often, hard satisfaction is preferred, i.e., candidates are rejected if for any (Fan et al., 24 Sep 2025).
- Ultimately, the output is a set of valid bounding boxes or trajectory states, which serve as input (e.g., via inpainting) to downstream conditional generative models.
5. Implementation Details and Pseudocode
CDGS is implemented with domain-adaptive score networks and clearly prescribed training regimes:
- Network Architecture: Mixture-of-Experts (MOE) transformer, with each layer employing a gating network and injecting the diffusion timestep via positional encoding and adaptive layer normalization (AdaLN) (Mishra et al., 31 Dec 2025).
- Training Objective: Denoising score matching:
A high-level pseudocode for the CDGS main loop (abridged from (Mishra et al., 31 Dec 2025)):
1 2 3 4 5 6 7 8 9 10 11 12 |
Initialize B candidates Z_b(T) ~ N(0,I) for t = T ... 1: for each candidate b: Z_b(t-1) <- compositional_DDIM(Z_b(t), net, alpha_t) for u in 1 ... U-1: Z <- forward_noising(Z, alpha_t) Z <- compositional_DDIM(Z, net, alpha_t) for b: J_b <- plan_cost(Z_b(0)) Select top-K candidates by J_b Repopulate to B by duplicating elites return Z_b(0) |
Compositional segment updates and aggregation are managed as per the compositional_DDIM routine, segmenting and denoising each local window, then merging results to form the next global sample (Mishra et al., 31 Dec 2025).
6. Applications and Empirical Results
CDGS demonstrates versatility and high performance across several domains (Mishra et al., 31 Dec 2025, Fan et al., 24 Sep 2025):
- Robot Manipulation Planning: On OGbench Maze and Scene tasks, CDGS matches or exceeds diffusion-based and RL baselines without long-horizon training data. For example, on PointMaze stitch, CDGS achieves 82% versus Diffuser's 29%, evidencing successful composition of short-horizon skills.
- Task-and-Motion Planning (TAMP): CDGS outperforms no-PDDL baselines and rivals privileged PDDL+CEM methods, e.g., 0.64 for Hook Reach Task 1 vs 0.66 for the best baseline.
- Panoramic Image Generation: By composing Stable Diffusion 2.0 patches into large panoramas, CDGS outperforms Multi-Diffusion and Sync-Diffusion on global coherence and prompt alignment metrics.
- Long Video Synthesis: CDGS produces coherent 350-frame CogVideoX-2B samples, surpassing baselines on subject consistency, although with minor trade-offs in aesthetic quality.
In spatial layout generation, as realized in LayoutAgent, CDGS produces object layouts that respect geometric and semantic constraints, outperforming prior models on criteria such as layout coherence and aesthetic alignment (Fan et al., 24 Sep 2025).
7. Limitations and Future Development
Reported limitations include:
- Requires explicit specification of start/goal states (for planning) or scene constraints (for layouts); generalization to variable-goal or unconstrained synthesis is not yet fully established.
- The compositional horizon must be selected in advance, with no automated mechanism for optimizing sequence length or resizing domains.
- While overlapping segments and iterative resampling communicate local information, global consistency is ultimately limited by the factorization structure; more expressive message passing or attention across non-local, long-range dependencies could further enhance performance (Mishra et al., 31 Dec 2025).
A plausible implication is that future advances may focus on adaptive horizon selection, automated factor graph construction, and richer integration of symbolic reasoning with end-to-end generative architectures.