Compositional World Model Architecture
- Compositional World Model Architecture is a structured approach that decomposes environment dynamics into modular components to improve generalization and interpretability.
- It leverages factorized latent state representations and diverse neural architectures, such as mixture-of-experts and block-slot models, to enable flexible recombination of learned components.
- Applications include robotics, model-based planning, multi-agent systems, and digital twin construction, demonstrating enhanced zero-shot and few-shot learning performance.
A compositional world model architecture is a structured approach to learning environment dynamics that decomposes the latent state, dynamics, or observation models into modular, interacting components rather than a monolithic neural network. This design supports generalization, modularity, and adaptability in decision-making, planning, and reinforcement learning agents. The implementations span a range of mathematical formalisms and neural architectures, but all share modularity as a core principle. Compositional world models are applied across domains such as robotics, model-based planning, multi-agent systems, and digital-twin construction.
1. Foundational Concepts and Formal Principles
Compositional world models generalize the latent state of the environment, , from a single vector to a structured collection of modules or factors. For example, (Costa, 2024) mathematically defines the compositional state as , where is a global context, are object (or sub-entity) modules, and are additional modules (e.g., mediators of interactions). The temporal generative process is a dynamic Bayesian network (DBN): but the transition and observation functions are further factorized according to the modular decomposition.
Compositional architectures are not limited to probabilistic graphical models. They manifest in hybrid system automata (Capiluppi et al., 2013), learned programmatic expert ensembles (2505.10819), slot/block neural architectures (Baek et al., 24 Jan 2025), and multi-modal cross-attentional policies (Yoo et al., 4 Sep 2025). The recurring theme is that independent components—objects, primitives, sub-models, symbolic attributes—are learned or instantiated so that the agent can recompose them at test time for combinatorial generalization.
2. Architectural Variants and Key Mechanisms
A typology of compositional world model architectures is outlined in the following table:
| Architecture (Paper) | Type of Composition | Main Mechanism |
|---|---|---|
| Prototype-Implanting (WorMI) (Yoo et al., 4 Sep 2025) | Model-level (module selection) | k-center prototype retrieval, cross-attention fusion between world models and LLM planner |
| Mixture-of-Experts (PRISM-WM) (Li et al., 9 Dec 2025) | Mode-level (Mixture of Experts) | Context-aware gating and latent orthogonalization to specialize experts for dynamic regimes |
| Programmatic Experts (PoE-World) (2505.10819) | Rule-level (Product of Experts) | LLM-synthesized symbolic programs, multiplicative fusion via likelihood product |
| Object-centric Graph Programs (Segler, 2019) | Subgraph/action composition | GNN-parameterized transitions, inductive subgraph-rewrite rules, symbolic action induction |
| Block-Slot Neural Decomposition (Dreamweaver) (Baek et al., 24 Jan 2025) | Concept/slot recombination | RBSU for self-discovering object/attribute slots, multi-step predictive coding |
| Causal Block Factoring (WM3C) (Wang et al., 13 May 2025) | Latent space block factorization | Language tokens gate blocks, mutual information regularization, sparse decoding masks |
| Hybrid Automata/WAs (Capiluppi et al., 2013) | System/environment hierarchy | Parallel/inplacement operators over variable hierarchies, trajectory algebra for interaction |
| Multi-agent Compositional Score-based (Zhang et al., 2024) | Agent/action factorization | Diffusion model with per-agent prompt modulation, compositional joint-action rollouts |
These models often utilize modular slot-encodings, product-of-experts probabilistic structure, cross-task and cross-domain module re-use, and explicit interface layers (e.g., cross-attention, gating networks).
3. Retrieval, Fusion, and Cross-domain Adaptation
Many architectures are designed for adaptation to previously unseen domains or tasks via modular recombination. For example, WorMI (Yoo et al., 4 Sep 2025) retrieves the most relevant domain-specific world models at test time using prototype-based k-center clustering in trajectory embedding space: where are prototypes of training domains and of the current state. Retrieved modules are injected into a frozen LLM policy head via a compound attention mechanism that hierarchically aligns hidden representations.
Block-factorized models (e.g., (Wang et al., 13 May 2025)) partition the latent state so that each component is governed directly by a compositional token (often from language), and only a small set of parameters are adapted for novel combinations.
Mixture-of-experts models (e.g., (Li et al., 9 Dec 2025)) switch between dynamics regimes using context-aware gating networks and latent Gram–Schmidt orthogonalization, lacing compositionality into the very structure of the dynamical flow.
4. Learning, Inference, and Training Algorithms
Compositional world models require learning not just in the parameters of neural modules but in the structure of the modular decomposition itself. Structural learning is carried out via:
- Bayesian structure search with sparsity priors and ARD (Costa, 2024)
- Symbolic induction by subgraph-diff (Segler, 2019)
- Programmatic expert synthesis via LLM prompts and MLE weight fitting (2505.10819)
- Language-guided block identification and mutual-information-based regularization (Wang et al., 13 May 2025)
- Prototype clustering and representation matching (Yoo et al., 4 Sep 2025)
- Combinatorial training objectives that enforce single-module and product-of-modules denoising (Zhou et al., 2024, Zhang et al., 2024)
Variational Bayes, black-box inference, and soft/structured masking in decoders are applied as per-module parameterizations. Inference propagates distributions or states between modules according to DAGs (graphical models), attention patterns, or explicit message-passing in symbolic frameworks.
5. Compositionality in Planning, Imitation, and Reinforcement Learning
Compositional models are especially beneficial for planning and generalization tasks. In (Zhou et al., 2024), zero-shot and few-shot generalization is demonstrated by recombining previously learned video-planning primitives, with a compositional product-of-score formulation: allowing for novel instruction compositions.
Graph-based models (Segler, 2019) and programmatic expert architectures (2505.10819) integrate easily with tree-search planners, as each compositional module defines a branch or independent factor in the search.
Value-predictive and dynamics modules can be independently composed for improved imagination and sample efficiency, as in RISE (Yang et al., 11 Feb 2026), where world-prediction and value-estimation are performed by different but compositional neural backbones.
Multi-agent cooperation is enabled by compositional diffusion models that factorize the contributions of agent-specific actions at each step, as seen in COMBO (Zhang et al., 2024).
6. Empirical Results, Generalization, and Evaluation
Compositional world models achieve superior empirical results across domains:
- Improved zero-shot and few-shot generalization on RT-X and RLBench with compositional video generation primitives (Zhou et al., 2024)
- Human success rates increase from 46–69% (monolithic) to over 81% (compositional) on previously unseen tasks (Zhou et al., 2024)
- Multitask planning scores on DMControl MT30 increase by +23.5% over monolithic world models when using compositional mixture-of-experts dynamics (Li et al., 9 Dec 2025)
- Retrosynthetic planning in chemical domains solves 95% of held-out targets using compositional GNN graph-rewrite programs (Segler, 2019)
- Atari generalization in PoE-World matches or surpasses baseline RL performance with far less data or task-specific tuning (2505.10819)
- One-shot imitation learning in robotics becomes feasible via explicit scene-object composition and physically grounded digital twin reconstructions (Barcellona et al., 2024)
Critically, compositional architectures are empirically observed to reduce extrapolation errors due to smoothing at regime boundaries, enable combinatorial state and action recombination, and facilitate symbolic/equivariant reasoning.
7. Interpretability, Modularity, and Theoretical Guarantees
Compositional models often provide interpretability: modules correspond to semantic entities (objects, verbs, rules, primitives), and their causal roles can be traced via attention weights, slot activations, or symbolic traces. Explicit structure learning (e.g., (Costa, 2024)) and symbolic neurosymbolic models (Sehgal et al., 2023) enable causal graph analysis and modular diagnostics.
In settings like WM3C (Wang et al., 13 May 2025), identifiability theorems guarantee that language-controlled compositional latent blocks will be uniquely recovered, provided mild regularity and independence assumptions.
This suggests that compositional world models offer a tractable theoretical foundation for structure discovery, generalization, and scalable inference in complex open-ended environments.
References
All claims, metrics, algorithmic steps, and architectural details above are strictly derived from (Yoo et al., 4 Sep 2025, Segler, 2019, Zhou et al., 2024, Hayashi et al., 13 Mar 2025, Capiluppi et al., 2013, Zhang et al., 2024, Wang et al., 13 May 2025, 2505.10819, Li et al., 9 Dec 2025, Barcellona et al., 2024, Baek et al., 24 Jan 2025, Sehgal et al., 2023), and (Costa, 2024).