Unified World Model Specification

Updated 4 February 2026

Unified Design Specification for World Models is a comprehensive framework that formalizes mathematical principles and modular architecture to predict, simulate, plan, and control embodied AI systems.
The framework defines a systematic integration of multimodal perception, dynamics, symbolic reasoning, and spatial grounding to enable coherent, cross-modal world modeling.
It employs hierarchical architectures, joint training objectives, and explicit interfaces to achieve scalability, temporal consistency, and robust simulation across diverse tasks.

A unified design specification for world models formalizes the principles, mathematical structures, and modular components required to construct architectures that simultaneously serve as predictors, simulators, planners, and controllers for embodied AI systems. Unified world models depart from fragmented, task-specific “world knowledge injection” by assembling perception, dynamics, symbolic reasoning, and spatial grounding into a single, closed-loop architecture, enabling robust, coherent, and extensible modeling across modalities, time horizons, and interaction regimes (Chi et al., 23 Jun 2025, Li et al., 21 Oct 2025, Team et al., 25 Nov 2025, Li et al., 19 Oct 2025, Zeng et al., 2 Feb 2026).

1. Mathematical Foundations and Problem Setting

A unified world model $M$ is parameterized as a simulator and inference engine over:

A possibly continuous state space $S$
A multimodal observation space $O$
An action/control space $A$
An optional symbolic space $D$

At each discrete time $t$ , the world model maintains an internal state $s_t \in S$ and processes an action $a_t \in A$ to yield a successor state $s_{t+1}$ and a rendered observation $\hat{o}_{t+1} \in O$ : $s_{t+1} = f_\theta(s_t, a_t), \quad \hat{o}_t = g_\phi(s_t), \quad r_t = h_\psi(s_{0:t}, o_{0:t}, a_{0:t})$ where $h_\psi$ yields symbolic or logical forms. This generalizes the latent-variable POMDP formulation: $p_\theta(o_{1:T}, z_{0:T} | a_{0:T-1}) = p_\theta(z_0)\prod_{t=1}^T p_\theta(z_t|z_{t-1},a_{t-1})\,p_\theta(o_t|z_t)$ with ELBO-optimized parameters $\theta, \phi$ (Li et al., 19 Oct 2025, Zeng et al., 2 Feb 2026).

2. Core Modules and Hierarchical Architectures

A normative unified world model comprises the following modules with standard interfaces (Zeng et al., 2 Feb 2026, Chi et al., 23 Jun 2025):

Perception: Encoders map multimodal raw observations $o_t$ (images, text, 3D, audio) to a unified embedding $z_t=\mathrm{Enc}_\pi(o_t)$ , and update posteriors over $s_t$ via a filter.
Interaction/Dynamics: A learned transition model $f_\theta$ (typically GNN, Transformer, or hierarchical diffusion) advances the internal state subject to action.
Symbolic Reasoning: Mapping from $(s_{0:t},o_{0:t},a_{0:t})$ to a symbolic or logical representation $r_t$ , implemented either as direct latent-to-text mapping (LLM) or as an attention-based latent reasoning module.
Spatial Representation: Explicit construction of geometric structures (occupancy grids, graphs, signed distance fields) from $z_t$ .

Architectural examples include:

System	Perception	Dynamics Module	Symbolic Head	Spatial Module	Control Interface
MinD (Chi et al., 23 Jun 2025)	Visual/language enc.	Hierarchical Diffusion	Optional classifier	Video-latent aggregation	HiDiff-Policy diffusion
OmniNWM (Li et al., 21 Oct 2025)	Panoramic Latents	Diffusion Transf.+VAE	N/A	Occupancy grid	Plücker ray-map
GigaWorld-0 (Team et al., 25 Nov 2025)	3D-VAE, vision/lang	Flow matching DiT	IDM	3D Gaussian Splatting	Text/action conditioners
Motus (Bi et al., 15 Dec 2025)	VLM, VGM	MoT + UniDiffuser	Multi-expert FFNs	Latent action (opt.flow)	Data-pyramid/pretraining
PointWorld (Huang et al., 7 Jan 2026)	RGB-D, 2D-3D fusion	PointTransformerV3	N/A	3D point cloud	MPC over point flows

Dual/hierarchical systems (e.g., MinD) disentangle slow visual imagination from fast action, coupling them via adapters (DiffMatcher), enabling coherent closed-loop control with minimal latency (Chi et al., 23 Jun 2025).

Unified world models employ joint optimization over all modules, combining:

$\mathcal{L}_{\mathrm{total}} = \sum_t \left[ \mathcal{L}_{\mathrm{rec}}(o_t, \hat{o}_t) \phantom{}_{} + \mathcal{L}_{\mathrm{dyn}}(s_{t+1}, f_\theta(s_t, a_t)) + \lambda_\mathrm{reason}\mathcal{L}_{\mathrm{reason}}(r_t, r_t^*) + \lambda_\mathrm{spatial}\mathcal{L}_\mathrm{spatial}(G_t, G_t^*) \right]$

For diffusion-based components, each modality may have independent noise schedules and loss weights (as in MinD’s dual-scheduler, or the independent timesteps in UWM (Chi et al., 23 Jun 2025, Zhu et al., 3 Apr 2025)). Auxiliary terms enforce semantic alignment or noise-invariant conditioning (e.g., DiffMatcher’s sim-loss (Chi et al., 23 Jun 2025)) and geometric/physical consistency (e.g., 3DGS regularization (Team et al., 25 Nov 2025), occupancy-based losses (Li et al., 21 Oct 2025)).

Specific task heads may optimize trajectory $\ell_2$ , cross-entropy, or domain-specific metrics (segmentation, detection, memory consistency).

Unified specifications address three major axes of design (Li et al., 19 Oct 2025):

Functionality: Decision-coupled (e.g., RL/MBRL) vs. general-purpose (offline predictive, simulation, multi-task).
Temporal Modeling: Sequential simulation (autoregressive, as in RSSM, GNN, Transformer) vs. global difference prediction (diffusion, Masked JEPA, flow-matching ODEs).
Spatial Representation: Global latent vector (GLV), token feature sequence (TFS), spatial latent grid (SLG), decomposed rendering representations (DRR: NeRF, Gaussian splats).

Selection along these axes is governed by target task demands, compute constraints, and required level of geometric or semantic fidelity.

5. Integrated Control, Simulation, and Evaluation Loops

Unified world models typically operate in tightly coupled simulation–control loops. Visual imagination (video rollout, future state generation) guides policy modules conditioned on imagined latents (MinD, OmniNWM); the MPC or action inference loops are closed with predicted future observations and latent state estimates. For robust deployment, these models often:

Enable online evaluation of task feasibility via latent-space classifiers (success/failure) (Chi et al., 23 Jun 2025)
Predict action-conditioned scene flows in 3D (PointWorld) (Huang et al., 7 Jan 2026)
Define reward directly from rendered or occupancy-derived features (OmniNWM) (Li et al., 21 Oct 2025)
Enable cross-modal generalization and adaptation via shared embeddings and feature alignment

Evaluation leverages task, state, and generative metrics: FID/FVD, mIoU, SSIM, LPIPS for pixel/visual tasks; Chamfer distance, point-set metrics for 3D; success rates, sample efficiency, return for RL/planning (Li et al., 19 Oct 2025).

6. Modularity, Scalability, and Systemic Design Principles

Unified specifications enforce modularity (decoupling video and 3D pipelines, clearly defined interfaces), scalability (parameter-efficient backbones, sparse attention, MoE branching (Team et al., 25 Nov 2025)), and data-centric quality management (automated rejection, consistency, and alignment pipelines).

Characteristic design principles across leading unified architectures include:

Embodiment-agnostic state/action (particle/point flows, normalized ray-maps) for cross-platform generalization (He et al., 3 Nov 2025, Huang et al., 7 Jan 2026)
Separate but coupled optimization schedules for heterogeneous subsystems (dual schedulers) (Chi et al., 23 Jun 2025)
Controllability via explicit policy/APIs (as in Web World Models, which use code-defined physics + generative LLM layers) (Feng et al., 29 Dec 2025)
Curriculum learning and hierarchical memory for long-horizon temporal coherence (Dong et al., 9 Oct 2025)
Explicit spatial grounding for geometric consistency, planning, and physical interaction

These design patterns enable extension to new modalities, tasks, or paradigms, and facilitate benchmarking, continual learning, and cross-task transfer (Zeng et al., 2 Feb 2026).

In sum, a unified design specification for world models prescribes a modular, mathematically rigorous architecture that integrates multimodal perception, structured dynamics, symbolic reasoning, and spatial grounding into a single closed-loop, extensible framework optimized via multi-term loss. Recent advances demonstrate that such systems routinely support imagination-driven planning, cross-modality transfer, latent-space risk prediction, and scalable data generation, substantiating the case for systematic unification over fragmented, task-specific world knowledge injection (Chi et al., 23 Jun 2025, Li et al., 21 Oct 2025, Team et al., 25 Nov 2025, Bi et al., 15 Dec 2025, Zeng et al., 2 Feb 2026).