World-Modeling Generation

Updated 4 July 2026

World-modeling generation is a technique that synthesizes persistent, simulated worlds maintaining causal integrity and physical plausibility.
It employs methods like diffusion, flow matching, and explicit state representations (geometric, symbolic, or latent) to ensure instruction following and physics adherence.
Benchmark evaluations focus on dynamics fidelity, controllability, and real-world physics compliance, driving improvements across diverse application domains.

World-modeling generation denotes a class of generative systems that aim to synthesize not merely plausible images or videos, but persistent worlds whose spatial structure, temporal evolution, physical behavior, and response to interventions remain internally consistent. In the capability taxonomy of visual generation, it corresponds to Level 5, where generation is anchored by an internalized world model and the system functions as “a world simulator, not just an appearance generator” (Wu et al., 30 Apr 2026). Contemporary benchmark design formalizes the same shift: world generation is treated as next-scene continuation under explicit layout and camera specifications, or as video synthesis that must satisfy instruction following and physics adherence rather than only perceptual quality (Duan et al., 1 Apr 2025, Li et al., 28 Feb 2025).

1. Definition and conceptual boundaries

World-modeling generation is distinguished from earlier forms of visual generation by the requirement that outputs remain causally faithful under actions and physical context. In the five-level taxonomy from atomic generation through agentic generation, the defining property of Level 5 is “Causal Simulation (Physics + Intervention),” with the central challenge identified as causal faithfulness and physical plausibility (Wu et al., 30 Apr 2026). The conceptual boundary is explicit: where Level 4 systems can plan, verify, and iterate over outputs, Level 5 systems must predict what would actually happen under a specific action in a specific physical setting.

This criterion changes the meaning of “correctness.” In WorldModelBench, a video generator is treated as a candidate world model only if it both follows the intended action and produces future frames consistent with real-world dynamics; accordingly, the benchmark evaluates instruction following, physics adherence, and commonsense video quality instead of relying on generic visual realism alone (Li et al., 28 Feb 2025). Physics adherence is decomposed into Newton’s First Law, Conservation of Mass / Solid Mechanics, Fluid Mechanics, Impenetrability, and Gravitation, making physical failure a first-class evaluation target rather than an incidental artifact.

WorldScore makes the same point from a scene-generation perspective by redefining world generation as a sequential next-scene problem. Each task is specified by a triplet $(\mathcal{C}, \mathcal{N}, \mathcal{L})$ , where the current scene $\mathcal{C}$ is paired with a next-scene prompt $\mathcal{N}$ and an explicit layout specification $\mathcal{L}$ containing a camera trajectory and camera-motion instruction (Duan et al., 1 Apr 2025). This formulation treats world generation as controlled continuation of a persistent scene graph, camera path, and semantic progression, rather than as isolated clip synthesis.

2. Representational substrates

A central question in world-modeling generation is what constitutes the “world state.” One line of work keeps the problem in learned visual space but changes the forecast target. VFMF formulates world modeling as stochastic prediction of vision foundation model features, $p(\mathbf{f}_{T+1}\mid \mathbf{f}_{1:T})$ , instead of RGB pixels, and decodes generated latents into semantic segmentation, depth, surface normals, and RGB (Boduljak et al., 12 Dec 2025). Its key claim is that deterministic regression in feature space collapses multimodal futures into conditional means, whereas autoregressive flow matching in a learned VAE latent space produces sharper and more accurate future predictions. WoG makes a related move for action generation by compressing future observations into a compact condition space using a Q-Former with $N=16$ learnable query tokens and default dimensionality $D=32$ , then training the VLA to predict those control-relevant conditions from current observations alone (Su et al., 25 Feb 2026).

Another line makes geometric memory explicit. The Xiaomi EV World Model separates reconstruction and generation into WorldRec and WorldGen, then integrates them into the Joint World Model (JWM), where WorldRec supplies a compact 4D Gaussian scene representation and WorldGen conditions on rendered RGB priors from ego-projected scene tokens (Zhou et al., 18 May 2026). TeleWorld similarly converts generated content into an explicit 4D spatio-temporal representation and feeds rendered views of that representation back into subsequent generation, turning world memory into a persistent, closed-loop component rather than a hidden transient latent (Chen et al., 31 Dec 2025).

A third line adopts symbolic or hybrid state representations. Web World Models decompose state as

$S_t = (S_t^{\phi}, S_t^{\psi}),$

where deterministic code updates the physics layer and an LLM populates the imagination layer conditioned on that code-defined state (Feng et al., 29 Dec 2025). Agent2World defines a symbolic world model as $WM=(P_{\text{env}},A_{\text{env}},T_{\text{env}})$ and synthesizes either PDDL domains or executable Python environments from natural-language descriptions, with correctness grounded in executable behavior rather than surface form (Hu et al., 26 Dec 2025). DEVS-Gen extends this executable view to discrete-event systems by modeling a world as $\mathcal{W}=(\mathcal{E},\mathcal{S},\Omega,\mathcal{P},\delta)$ and validating generated simulators through structured event traces (Chen et al., 4 Mar 2026).

Taken together, these systems suggest that world-modeling generation is not tied to a single substrate. The “world” may be a latent visual field, a compact future-condition space, a 4D Gaussian memory, typed web interfaces, a symbolic transition system, or an executable discrete-event simulator. What unifies them is persistence, intervention sensitivity, and evaluability.

3. Generative mechanisms and training regimes

Most visual world-modeling systems are built on diffusion, rectified flow, or flow matching, but their distinctive feature is not the sampler itself; it is the way world constraints are embedded into training and inference. SimWorld exemplifies simulator-conditioned generation. It uses PMWorld, a high-fidelity mining simulator, to generate semantic segmentation, depth, detection boxes, and 3D point clouds, then conditions a diffusion model implemented with ControlNet on top of Stable Diffusion and Stable Diffusion XL to render photorealistic real-world-style images (Li et al., 18 Mar 2025). The simulator specifies “what” and “where,” while the world model specifies appearance. To bias learning toward driving-critical objects, SimWorld introduces DynamicForegroundWeightLoss, which applies a cosine-scheduled spatial weight over bounding-box regions so foreground vehicles are emphasized early in training and later balanced against background fidelity.

DreamWorld modifies the generator more deeply by jointly predicting video and world knowledge. It constructs

$\mathcal{C}$ 0

with temporal, semantic, and spatial priors derived respectively from optical flow, DINOv2, and VGGT, and concatenates this tensor with the video latent so that the model learns a joint velocity field over appearance and world features (Tan et al., 28 Feb 2026). Because naive optimization over heterogeneous world objectives introduces gradient conflict, DreamWorld uses Consistent Constraint Annealing to decay constraint strength during training and Multi-Source Inner-Guidance to enforce learned world priors at inference.

WorldGen, the generation component of the Xiaomi EV World Model, uses a more explicit curriculum. It first performs bidirectional pretraining under rectified flow,

$\mathcal{C}$ 1

then converts the model to causal generation through three stages: Teacher Forcing, ODE distillation, and Distribution Matching Distillation (Zhou et al., 18 May 2026). The stated result is online causal video generation in as few as 4 denoising steps, reducing a 50-step causal generator to 4 steps and yielding roughly 12× inference acceleration. TeleWorld addresses long-horizon failure through Macro-from-Micro Planning (MMPL), which jointly predicts sparse anchor frames at the segment level, uses those anchors to populate intermediate frames, and combines this planning hierarchy with DMD for real-time synthesis (Chen et al., 31 Dec 2025).

These methods indicate a common design principle: world-modeling generation is rarely achieved by scaling a generic video generator alone. It typically requires structured conditioning, causal conversion, retrieval-aware or scaffolded memory, or explicit joint prediction of auxiliary world variables.

4. Shared-world, interactive, and action-conditioned generation

A major expansion of the field is the move from single-view synthesis to shared-world generation, in which multiple agents or cameras must observe the same underlying environment. ShareVerse defines this problem as multi-agent shared world modeling and requires generated videos from different agents to remain mutually compatible in overlapping and non-overlapping regions alike (Zhu et al., 3 Mar 2026). It constructs a CARLA-based dataset of 55,000 video pairs, equips each agent with front, rear, left, and right cameras for 360° coverage, spatially concatenates the four views of each agent, injects raymap embeddings, and inserts cross-agent attention blocks immediately after each raymap encoder. The framework supports 49-frame generation at 480×720 and reports PSNR 20.76, SSIM 0.6656, and LPIPS 0.2791.

IC-World treats shared world modeling as parallel in-context generation. Given a set of $\mathcal{C}$ 2 input images $\mathcal{C}$ 3 representing the same world at the same time from different poses, it pixel-wise concatenates them into a single grid image, generates one grid video, and then decouples that video into $\mathcal{C}$ 4 (Wu et al., 1 Dec 2025). To enforce consistency beyond zero-shot in-context behavior, it uses Group Relative Policy Optimization with two reward models: a geometry consistency reward based on Pi3, Lepard, and symmetric Chamfer distance, and a motion consistency reward based on SpatialTrackerV2. On its shared-world benchmark, IC-World reports the best geometry and motion consistency in both evaluated settings and the lowest generation time per video, 17.08s.

Interactive control exposes the memory problem particularly clearly. “Learning World Models for Interactive Video Generation” shows that naive autoregressive video generation suffers from effectively irreducible compounding error, and that longer context windows or simple inference-time retrieval are insufficient because current video models have limited in-context learning capability (Chen et al., 28 May 2025). Its Video Retrieval Augmented Generation (VRAG) instead retrieves history by explicit global state similarity, trains the model to use retrieved frames during denoising, and masks loss on retrieved frames so history functions as support rather than prediction target. On a 300-frame Minecraft world-coherence benchmark, VRAG reaches SSIM 0.506, PSNR 17.097, and LPIPS 0.506; on a 1200-frame compounding-error benchmark it reaches SSIM 0.349.

Action generation provides a more compressed but still world-model-based formulation. WoG argues that future prediction should occur in condition space rather than raw visual space: future observations are compressed into compact condition tokens, injected into the action head during Stage I, and then predicted by the VLM itself in Stage II so action inference can proceed from current observation only (Su et al., 25 Feb 2026). PanoWorld pushes world-modeling generation to spherical observation space. It treats panoramic video as geometry- and dynamics-consistent latent state modeling, introduces latitude-aware positional encoding and spherical-area-weighted losses, and adds auxiliary depth and trajectory consistency losses; in Stage 1, it reports FVD 56.1, FAED 83.4, FID 136.4, 3D-Smooth 0.025, Depth- $\mathcal{C}$ 5 0.013, and Tr-Life 0.994 (Jiang et al., 14 May 2026).

5. Evaluation methodology and benchmark design

Evaluation has become a defining concern because appearance-centric metrics systematically overestimate progress. The field’s own roadmap argues that current evaluations emphasize perceptual quality while missing structural, temporal, and causal failures, and calls for benchmarks that behave more like compilers or theorem provers than simple image-similarity scoreboards (Wu et al., 30 Apr 2026). Three benchmark families are especially central.

Benchmark	Scope	Main dimensions
WorldModelBench	350 prompt instances across 7 application-driven domains; supports T2V and I2V	Instruction following, physics adherence, commonsense / general video quality
WorldScore	3,000 test examples, with 2,000 static and 1,000 dynamic world-generation examples	Controllability, quality, dynamics
4DWorldBench	Image-to-3D/4D, Video-to-4D, Text-to-3D/4D	Perceptual Quality, Condition-4D Alignment, Physical Realism, 4D Consistency

WorldModelBench is centered on application-driven video generation as world modeling. It crowdsources 67K human labels from 65 voters, evaluates 14 frontier models, and fine-tunes a 2B-parameter VLM judger that achieves 8.6% higher average accuracy in predicting world-modeling violations than GPT-4o with 2B parameters (Li et al., 28 Feb 2025). Its results are explicitly skeptical: even top models remain weak world models, and the paper notes 12% mass conservation violations and 11% penetration violations for Kling. It also reports that VBench correlates reasonably with frame-wise quality at 0.69, but poorly with WorldModelBench physics adherence at 0.28, showing that generic video-quality metrics do not capture world realism.

WorldScore standardizes diverse generation paradigms into a common video-output setting and evaluates 19 representative models using camera controllability, object controllability, content alignment, 3D consistency, photometric consistency, style consistency, subjective quality, motion accuracy, motion magnitude, and motion smoothness (Duan et al., 1 Apr 2025). Its headline finding is structural: 3D scene generation models dominate static world generation, while video models remain stronger on dynamics but weaker on controllability and long-range consistency. The reported WorldScore-Static for WonderWorld is 72.69, compared with 62.15 for CogVideoX-I2V.

4DWorldBench generalizes evaluation further by mapping all modality conditions into a unified textual space during evaluation and combining LLM-as-judge, MLLM-as-judge, and traditional network-based methods (Lu et al., 25 Nov 2025). It evaluates Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency across Image-to-3D/4D, Video-to-4D, and Text-to-3D/4D tasks. Preliminary human studies reported in the benchmark indicate that adaptive tool selection yields closer agreement with subjective human judgments than simpler fixed alternatives.

6. Application domains, constraints, and broader implications

Autonomous driving is a major application domain for world-modeling generation. SimWorld uses PMWorld to collect PMScenes, conditions a diffusion model on simulator-generated labels, and reports AutoMine-style FID 33.96 and $\mathcal{C}$ 6 29.08, versus PMScenes at FID 73.45 and $\mathcal{C}$ 7 25.79; the generated data also improves downstream perception across object detection and semantic segmentation models (Li et al., 18 Mar 2025). The Xiaomi EV World Model positions JWM as a foundation for closed-loop simulation, synthetic rare-event generation, and end-to-end training, and reports generation speeds of 0.19 s/frame for single-view and 0.46 s/frame for three views on H20 GPUs (Zhou et al., 18 May 2026). U4D extends the same ambition to LiDAR by modeling uncertainty first and then completing the full scene; on nuScenes it reports FRD 223.96, FPD 12.90, JSD 0.03, and MMD 0.53, while also improving downstream segmentation calibration (Xu et al., 2 Dec 2025).

At larger spatial scales, EarthGen shows that world-modeling generation can also mean multi-scale geographic synthesis. It factorizes top-down imagery into a coarse-to-fine diffusion cascade and tiled composition scheme, reports General / Urban FID 66.40 / 91.04 and KID 0.0210 / 0.0533 on an extreme $\mathcal{C}$ 8 super-resolution benchmark, and demonstrates a 12 gigapixel interactive map spanning 30 km × 10 km at 15 cm/pixel (Sharma et al., 2024). LatticeWorld translates textual instructions and optional visual terrain cues into interactive UE5 worlds through symbolic layout generation and configuration generation, and reports an increase in industrial production efficiency of over $\mathcal{C}$ 9, from 55 days of manual production to less than 0.6 days (Duan et al., 5 Sep 2025).

The field also includes non-visual or hybrid notions of world generation. Web World Models use ordinary web code as “physics” and LLMs as imagination, enabling persistent travel, encyclopedic, fictional, and game-like worlds without surrendering deterministic control (Feng et al., 29 Dec 2025). Agent2World frames symbolic world-model generation as interactive, test-driven synthesis and reports an average relative gain of 30.95% after supervised fine-tuning on verified multi-turn trajectories (Hu et al., 26 Dec 2025). DEVS-Gen synthesizes executable discrete-event simulators directly from natural-language specifications and evaluates them through operational success and behavioral conformance of emitted event traces (Chen et al., 4 Mar 2026).

Several recurrent limitations delimit the current state of the field. SimWorld notes that building a high-fidelity simulator aligned with real sites is expensive and domain-specific, and that larger models such as SimWorld XL do not automatically dominate smaller ones in every metric (Li et al., 18 Mar 2025). IC-World observes that pixel-wise coupling reduces effective per-view resolution and that performance degrades as the number of input views grows (Wu et al., 1 Dec 2025). TeleWorld depends on substantial infrastructure, including large GPU clusters, context parallelism, and pipelined DMD scheduling, and still reports around one second of latency due to buffered latent chunks (Chen et al., 31 Dec 2025). U4D states that performance degrades beyond about 10 frames because iterative errors accumulate (Xu et al., 2 Dec 2025). These constraints suggest that progress in world-modeling generation depends jointly on richer state representations, more efficient causal rollout, and evaluation procedures that can verify long-horizon spatial, temporal, and physical consistency rather than visual plausibility alone.