Compositional World Model Architecture

Updated 15 February 2026

Compositional World Model Architecture is a structured approach that decomposes environment dynamics into modular components to improve generalization and interpretability.
It leverages factorized latent state representations and diverse neural architectures, such as mixture-of-experts and block-slot models, to enable flexible recombination of learned components.
Applications include robotics, model-based planning, multi-agent systems, and digital twin construction, demonstrating enhanced zero-shot and few-shot learning performance.

A compositional world model architecture is a structured approach to learning environment dynamics that decomposes the latent state, dynamics, or observation models into modular, interacting components rather than a monolithic neural network. This design supports generalization, modularity, and adaptability in decision-making, planning, and reinforcement learning agents. The implementations span a range of mathematical formalisms and neural architectures, but all share modularity as a core principle. Compositional world models are applied across domains such as robotics, model-based planning, multi-agent systems, and digital-twin construction.

1. Foundational Concepts and Formal Principles

Compositional world models generalize the latent state of the environment, $z_t$ , from a single vector to a structured collection of modules or factors. For example, (Costa, 2024) mathematically defines the compositional state as $z_t = \{ c_t, o_t^{(1)},\ldots,o_t^{(K)}, d_t \}$ , where $c_t$ is a global context, $o_t^{(i)}$ are object (or sub-entity) modules, and $d_t$ are additional modules (e.g., mediators of interactions). The temporal generative process is a dynamic Bayesian network (DBN): $p(z_{1:T}, x_{1:T} \mid a_{1:T-1}) = p(z_1) \prod_{t=2}^T p(z_t \mid z_{t-1}, a_{t-1}) \prod_{t=1}^T p(x_t \mid z_t)$ but the transition and observation functions are further factorized according to the modular decomposition.

Compositional architectures are not limited to probabilistic graphical models. They manifest in hybrid system automata (Capiluppi et al., 2013), learned programmatic expert ensembles (2505.10819), slot/block neural architectures (Baek et al., 24 Jan 2025), and multi-modal cross-attentional policies (Yoo et al., 4 Sep 2025). The recurring theme is that independent components—objects, primitives, sub-models, symbolic attributes—are learned or instantiated so that the agent can recompose them at test time for combinatorial generalization.

2. Architectural Variants and Key Mechanisms

A typology of compositional world model architectures is outlined in the following table:

Architecture (Paper)	Type of Composition	Main Mechanism
Prototype-Implanting (WorMI) (Yoo et al., 4 Sep 2025)	Model-level (module selection)	k-center prototype retrieval, cross-attention fusion between world models and LLM planner
Mixture-of-Experts (PRISM-WM) (Li et al., 9 Dec 2025)	Mode-level (Mixture of Experts)	Context-aware gating and latent orthogonalization to specialize experts for dynamic regimes
Programmatic Experts (PoE-World) (2505.10819)	Rule-level (Product of Experts)	LLM-synthesized symbolic programs, multiplicative fusion via likelihood product
Object-centric Graph Programs (Segler, 2019)	Subgraph/action composition	GNN-parameterized transitions, inductive subgraph-rewrite rules, symbolic action induction
Block-Slot Neural Decomposition (Dreamweaver) (Baek et al., 24 Jan 2025)	Concept/slot recombination	RBSU for self-discovering object/attribute slots, multi-step predictive coding
Causal Block Factoring (WM3C) (Wang et al., 13 May 2025)	Latent space block factorization	Language tokens gate blocks, mutual information regularization, sparse decoding masks
Hybrid Automata/WAs (Capiluppi et al., 2013)	System/environment hierarchy	Parallel/inplacement operators over variable hierarchies, trajectory algebra for interaction
Multi-agent Compositional Score-based (Zhang et al., 2024)	Agent/action factorization	Diffusion model with per-agent prompt modulation, compositional joint-action rollouts

These models often utilize modular slot-encodings, product-of-experts probabilistic structure, cross-task and cross-domain module re-use, and explicit interface layers (e.g., cross-attention, gating networks).

3. Retrieval, Fusion, and Cross-domain Adaptation

Many architectures are designed for adaptation to previously unseen domains or tasks via modular recombination. For example, WorMI (Yoo et al., 4 Sep 2025) retrieves the most relevant domain-specific world models at test time using prototype-based k-center clustering in trajectory embedding space: $d(p_j, p) = W_2(\text{Uniform}(\{c_j^i\}), \text{Uniform}(\{c^i\}))$ where $p_j$ are prototypes of training domains and $p$ of the current state. Retrieved modules are injected into a frozen LLM policy head via a compound attention mechanism that hierarchically aligns hidden representations.

Block-factorized models (e.g., (Wang et al., 13 May 2025)) partition the latent state so that each component is governed directly by a compositional token (often from language), and only a small set of parameters are adapted for novel combinations.

Mixture-of-experts models (e.g., (Li et al., 9 Dec 2025)) switch between dynamics regimes using context-aware gating networks and latent Gram–Schmidt orthogonalization, lacing compositionality into the very structure of the dynamical flow.

4. Learning, Inference, and Training Algorithms

Compositional world models require learning not just in the parameters of neural modules but in the structure of the modular decomposition itself. Structural learning is carried out via:

Bayesian structure search with sparsity priors and ARD (Costa, 2024)
Symbolic induction by subgraph-diff (Segler, 2019)
Programmatic expert synthesis via LLM prompts and MLE weight fitting (2505.10819)
Language-guided block identification and mutual-information-based regularization (Wang et al., 13 May 2025)
Prototype clustering and representation matching (Yoo et al., 4 Sep 2025)
Combinatorial training objectives that enforce single-module and product-of-modules denoising (Zhou et al., 2024, Zhang et al., 2024)

Variational Bayes, black-box inference, and soft/structured masking in decoders are applied as per-module parameterizations. Inference propagates distributions or states between modules according to DAGs (graphical models), attention patterns, or explicit message-passing in symbolic frameworks.

5. Compositionality in Planning, Imitation, and Reinforcement Learning

Compositional models are especially beneficial for planning and generalization tasks. In (Zhou et al., 2024), zero-shot and few-shot generalization is demonstrated by recombining previously learned video-planning primitives, with a compositional product-of-score formulation: $p_\theta(\tau \mid L) \propto \prod_{i=1}^N p_\theta(\tau\mid \ell_i)^{1/N}$ allowing for novel instruction compositions.

Graph-based models (Segler, 2019) and programmatic expert architectures (2505.10819) integrate easily with tree-search planners, as each compositional module defines a branch or independent factor in the search.

Value-predictive and dynamics modules can be independently composed for improved imagination and sample efficiency, as in RISE (Yang et al., 11 Feb 2026), where world-prediction and value-estimation are performed by different but compositional neural backbones.

Multi-agent cooperation is enabled by compositional diffusion models that factorize the contributions of agent-specific actions at each step, as seen in COMBO (Zhang et al., 2024).

6. Empirical Results, Generalization, and Evaluation

Compositional world models achieve superior empirical results across domains:

Improved zero-shot and few-shot generalization on RT-X and RLBench with compositional video generation primitives (Zhou et al., 2024)
Human success rates increase from 46–69% (monolithic) to over 81% (compositional) on previously unseen tasks (Zhou et al., 2024)
Multitask planning scores on DMControl MT30 increase by +23.5% over monolithic world models when using compositional mixture-of-experts dynamics (Li et al., 9 Dec 2025)
Retrosynthetic planning in chemical domains solves 95% of held-out targets using compositional GNN graph-rewrite programs (Segler, 2019)
Atari generalization in PoE-World matches or surpasses baseline RL performance with far less data or task-specific tuning (2505.10819)
One-shot imitation learning in robotics becomes feasible via explicit scene-object composition and physically grounded digital twin reconstructions (Barcellona et al., 2024)

Critically, compositional architectures are empirically observed to reduce extrapolation errors due to smoothing at regime boundaries, enable combinatorial state and action recombination, and facilitate symbolic/equivariant reasoning.

7. Interpretability, Modularity, and Theoretical Guarantees

Compositional models often provide interpretability: modules correspond to semantic entities (objects, verbs, rules, primitives), and their causal roles can be traced via attention weights, slot activations, or symbolic traces. Explicit structure learning (e.g., (Costa, 2024)) and symbolic neurosymbolic models (Sehgal et al., 2023) enable causal graph analysis and modular diagnostics.

In settings like WM3C (Wang et al., 13 May 2025), identifiability theorems guarantee that language-controlled compositional latent blocks will be uniquely recovered, provided mild regularity and independence assumptions.

This suggests that compositional world models offer a tractable theoretical foundation for structure discovery, generalization, and scalable inference in complex open-ended environments.

References

All claims, metrics, algorithmic steps, and architectural details above are strictly derived from (Yoo et al., 4 Sep 2025, Segler, 2019, Zhou et al., 2024, Hayashi et al., 13 Mar 2025, Capiluppi et al., 2013, Zhang et al., 2024, Wang et al., 13 May 2025, 2505.10819, Li et al., 9 Dec 2025, Barcellona et al., 2024, Baek et al., 24 Jan 2025, Sehgal et al., 2023), and (Costa, 2024).