Compositional World Models (COMBO)

Updated 29 May 2026

Compositional World Models are generative and planning frameworks that factorize environment dynamics into modular, interpretable components.
They enable robust generalization and flexible recombination of object-centric, causal, and skill-conditioned modules for effective simulation and decision-making.
Key applications include embodied agents, robotics, model-based reinforcement learning, and multi-agent systems, demonstrating significant empirical performance gains.

Compositional World Models (COMBO) are a class of generative models and planning frameworks that represent, learn, and reason about environment dynamics by decomposing the world into modular, interpretable components. These models are characterized by explicit factorizations over objects, features, causal mechanisms, or skills and are designed to facilitate data-efficient learning, robust generalization, and flexible recombination of knowledge to support planning and imagination in complex domains. COMBO finds extensive application in embodied agents, robotics, model-based reinforcement learning, multi-agent systems, and symbolic reasoning environments.

1. Formal Definitions and Core Principles

Compositional World Models structurally factor the environment into a set of entities, modules, or experts, each responsible for a distinct aspect or factor of the world’s dynamics. The state is typically partitioned as $s_t = [s^{(1)}_t, \dotsc, s^{(K)}_t]$ , with transitions factorized as

$p(s_{t+1} \mid s_t, a_t) = \prod_{k=1}^K p\bigl(s_{t+1}^{(k)} \mid \mathrm{pa}^{(k)}(s_t, a_t)\bigr)$

where $\mathrm{pa}^{(k)}$ denotes the subset of relevant parent factors for module $k$ (Costa, 2024). The notion of compositionality can be instantiated via:

Object-centric models: Each module governs one object or attribute (Baek et al., 24 Jan 2025, Sehgal et al., 2023).
Causal programs: World models composed as products of programmatic experts, each encoding a local, symbolic causal rule (2505.10819).
Skill-conditioned models: Video or latent models conditioned on discrete, LLM-generated low-level skill prompts (Vuong et al., 11 Mar 2026).
Policy composition: “Jumpy” world models learn multi-step transition dynamics induced by sequenced pre-trained policies (Farebrother et al., 23 Feb 2026).

The key principle is that knowledge learned by each module/generalizes and recombines across new settings, supporting zero-shot transfer and robust planning.

2. Model Architectures and Learning Algorithms

Architectures and learning strategies for COMBO span probabilistic graphical models, neural networks, program synthesis, and hybrid neural-symbolic systems.

2.1 Programmatic and Bayesian COMBO

PoE-World: Constructs the world model as an exponentially-weighted product of small, deterministic Python programs ("experts"), each capturing a narrow causal mechanism (e.g., “if touching ladder and action=RIGHT, set y-velocity to –4”). LLMs (such as GPT-4) synthesize experts from observed trajectories. The composition is performed as

$p_\theta(o_{t+1} \mid o_{1:t}, a_{1:t}) \propto \prod_{i} [p_i^{\mathrm{expert}}(o_{t+1} \mid o_{1:t}, a_{1:t})]^{\theta_i}$

with $\theta$ learned via likelihood maximization and $\ell_1$ sparsity penalty, and experts pruned at a threshold (2505.10819).

Generic Modular Bayesian Networks: States and transitions are represented as sparse, interpretable Bayesian networks over modules, and the graph structure is learned using Bayesian model selection (e.g., BIC score), with prior favoring sparsity. Structure search iterates over graph edits that maximize posterior or BIC (Costa, 2024).

2.2 Neural & Diffusion-based COMBO

Slot-Block Hierarchies: Dreamweaver instantiates a neural hierarchy wherein slots represent entities and “blocks” within slots encode object attributes (e.g., shape, color, direction). A Recurrent Block-Slot Unit (RBSU) extracts and updates these representations, which support attribute-wise recombination for compositional imagination (Baek et al., 24 Jan 2025).
Skill-Compositional Video Models: World2Act employs a latent diffusion model conditioned on atomic skill prompts, which are derived automatically via LLM-driven gripper-trajectory segmentation and skill schema alignment. This enables temporal stitching of skill-conditioned rollouts to simulate arbitrary-length action sequences (Vuong et al., 11 Mar 2026). Contrastive objectives align world model latents to policy/action latents for robust bridging.
Compositional Diffusion for Multi-Agent Dynamics: COMBO for embodied cooperation factorizes joint actions of multiple agents, training single-agent and multi-agent diffusion models whose scores are composed as

$\hat \epsilon(x, t | A_{t}) = \epsilon_\theta(x, t) + \sum_{i=1}^n [\epsilon_\theta(x, t | a_{i}) - \epsilon_\theta(x, t)]$

supporting explicit action-conditioned rollouts for decentralized planning (Zhang et al., 2024).

2.3 Physics-Integrated and Digital Twin Approaches

DreMa constructs explicit 3D digital twins of environments using Gaussian Splatting for per-object photorealistic geometry and PyBullet for physics. Compositionality is realized via object-wise segmentation, enabling modular augmentation (e.g., roto-translation of objects or the whole scene) and robust imagination from minimal demonstrations (Barcellona et al., 2024).
Jumpy World Models: COMBO as formulated for temporal abstraction learns policy and horizon-conditioned dynamics over multi-step transitions via flow-matching objectives with temporal-difference horizon consistency, explicitly supporting sequence composition of heterogeneous policy fragments (Farebrother et al., 23 Feb 2026).

3. Training Objectives and Optimization

Learning COMBO models involves optimizing objectives aligned with the decomposition:

Maximum Likelihood: For programmatic and probabilistic modules, maximizing the likelihood of observed transitions—often regularized for sparsity or simplicity (e.g., $\ell_1$ norm or graph sparsity penalty) (2505.10819, Costa, 2024).
Predictive/Generative Losses: Slot/block and diffusion-based models optimize reconstruction or cross-entropy losses on future frames or tokenizations, with disentanglement and dynamic-informativeness ablations demonstrating modular utility (Baek et al., 24 Jan 2025).
Latent Contrastive Losses: Skill-compositional models use bidirectional InfoNCE or similar contrastive terms between world model and policy latents to ensure robust action-world alignment (Vuong et al., 11 Mar 2026).
Flow-Matching and Consistency: Jumpy COMBO utilizes a temporal-difference flow loss, with additional terms enforcing consistency across horizons for long-term planning fidelity (Farebrother et al., 23 Feb 2026).
Auxiliary Progress/Value Losses: COMBO in RISE trains a progress value model using regression and temporal-difference losses for effective policy improvement via imagined rollouts (Yang et al., 11 Feb 2026).

4. Planning, Inference, and Application Workflows

COMBO models enable a spectrum of planning and inference protocols tailored to their compositional structure:

Simulator-based RL and Forward Simulation: Product-of-experts and skill-compositional models can be used as fast, closed-loop simulators for offline RL policy pre-training and model-based online execution, often with hierarchical or beam/planner integration (2505.10819, Vuong et al., 11 Mar 2026).
Look-ahead Tree Search: Multi-agent COMBO and symbolic/Bayesian models can embed their dynamics within online tree search or MCTS expansions, leveraging VLM submodules for action/inference and dynamic scores for node selection (Zhang et al., 2024).
Compositional Policy Planning: Jumpy COMBO supports random-shooting or goal-conditioned proposal-based policy sequence planning, with Chapman–Kolmogorov composition over successor measures, accommodating various abstractions (one-step, GGPI, etc.) (Farebrother et al., 23 Feb 2026).

Example Table: Inference and Planning Modes Across COMBO Classes

Approach	World Model Structure	Planning/Inference Method
PoE-World (2505.10819)	Product of LLM-synthesized program experts	Hierarchical planner, MCTS
COMBO Multi-agent (Zhang et al., 2024)	Compositional video-diffusion	Tree search with VLM modules
Dreamweaver (Baek et al., 24 Jan 2025)	Slot-block neural hierarchy	Autoregressive imagination/rollout
RISE (Yang et al., 11 Feb 2026)	Separate controllable dynamics and value models	On-policy RL in imagination
Jumpy COMBO (Farebrother et al., 23 Feb 2026)	Policy/horizon-conditioned flow model	Random shooting/Monte Carlo planning

5. Empirical Performance and Ablation

Multiple studies demonstrate COMBO's advantages in sample efficiency, generalization, and robustness:

Multi-agent cooperation: Achieves 100% success on TDW-Game tasks, outperforming LLaVA, MAPPO, and recurrent world model baselines, with steep degradation on ablation of agent-inference modules (Zhang et al., 2024).
Disentanglement and Imagination: Dreamweaver's modular structure yields $\geq2\times$ higher Informativeness-Dynamic (I-D) metric than RSSM or STEVE, and demonstrates robust OOD imagination by recombining latent blocks not seen during training (Baek et al., 24 Jan 2025).
Policy Improvement in Robotics: RISE’s dual-module architecture yields +35% absolute improvement over monolithic world models in manipulation tasks, with significant gains in both convergence speed and robustness (Yang et al., 11 Feb 2026).
Skill-compositional rollouts: World2Act’s staged curriculum results in +2.5% to +6.7% improvement in real-robot and simulated multi-skill settings, with ablations confirming the necessity of skill decomposition and latent-space alignment (Vuong et al., 11 Mar 2026).
Digital twin-driven data augmentation: DreMa (Dream to Manipulate) improves real-robot one-shot imitation by +31 pts in success rate versus voxel CNN baselines, with no additional behavioral loss terms required—gains attributed purely to compositional, equivariant data augmentation (Barcellona et al., 2024).
Long-horizon planning: Jumpy COMBO yields $p(s_{t+1} \mid s_t, a_t) = \prod_{k=1}^K p\bigl(s_{t+1}^{(k)} \mid \mathrm{pa}^{(k)}(s_t, a_t)\bigr)$ 0 relative improvement over primitive-action planners on offline maze and manipulation benchmarks (Farebrother et al., 23 Feb 2026).

6. Theoretical Properties and Generalization Guarantees

COMBO frameworks often provide concrete theoretical guarantees and claims:

Bisimulation equivalence: If the induced module libraries and dynamics models are expressive enough, the compositional simulator can be provably bisimulation-equivalent to the real process—guaranteeing transfer of optimality (Segler, 2019).
Sample complexity: For graph-program-based COMBO, sample complexity for transition learning depends only on the size of local rewrites ( $p(s_{t+1} \mid s_t, a_t) = \prod_{k=1}^K p\bigl(s_{t+1}^{(k)} \mid \mathrm{pa}^{(k)}(s_t, a_t)\bigr)$ 1) and VC-dimension of the module, not on global state/action space cardinality (Segler, 2019).
Compositional generalization: Both empirical results (OOD tests) and formal corollaries confirm that models generalize robustly to novel compositions of known “atomic” modules (e.g., unseen object-attribute combinations, new molecular graphs, or skill sequences), as model parameters localize to structural motifs observed in training (Sehgal et al., 2023, Baek et al., 24 Jan 2025).

7. Future Directions and Open Problems

COMBO research points to a range of future developments:

Neuromodulatory and Continual Module Expansion: Extending attribute vocabularies and module libraries via continual or LLM-guided expansion to address open-world and unbounded compositionality (Sehgal et al., 2023).
Symbol-grounded and Physics-integrated Models: Deeper integration of symbolic reasoning, foundation models for perception, and physics-consistent simulation for complex real-world manipulation (Barcellona et al., 2024).
Scalable, Unsupervised Structure Discovery: Improved Bayesian structure search and neural-symbolic learning algorithms for sparse, interpretable representations (Costa, 2024).
Planning Under Partial Observability and Uncertainty: More robust multi-agent and decentralized planning in partially observable worlds by combining differentiable inference with explicit uncertainty (Zhang et al., 2024).
Long-horizon, Multi-timescale Composition: Jumpy models and hybrid abstraction planners to facilitate robust, sample-efficient composition of policies at varied temporal and semantic levels (Farebrother et al., 23 Feb 2026).

COMBO thus provides a unified, extensible paradigm for modular world modeling, systematically bridging combinatorial generalization, program synthesis, latent variable inference, and planning in interactive AI systems.