Visual World Models Overview

Updated 16 June 2026

Visual World Models are internal, learnable representations that simulate and reconstruct environments using visual modalities for physically grounded reasoning.
They encompass implicit, verbal, and visual modeling approaches, each trading off abstraction, detail fidelity, and sample efficiency in tasks like spatial reasoning and forecasting.
VWMs support diverse applications such as robotics, embodied AI, and GUI simulation while guiding research in robust planning, multi-modal integration, and compositional generalization.

Visual World Models (VWMs) are internal, learnable representations capable of simulating, forecasting, and reconstructing environments using visual modalities. These models support physically grounded reasoning, robust planning, and policy learning in both machine agents and cognitive systems. VWMs operate across a spectrum from pixel-level prediction to advanced compositional abstraction, and they underpin advances in robotics, embodied AI, vision–language modeling, and high-level multimodal reasoning.

1. Formal Frameworks and Core Capabilities

A general VWM centers on the Markov Decision Process formalism, extended for rich visual observation:

$\mathcal{M} = (\mathcal{S}, \mathcal{A}, p, \Phi, \mathcal{O}_\phi, e_\phi)$

$\mathcal{S}$ : (latent) state space.
$\mathcal{A}$ : action space (e.g., agent's control commands, physical manipulations).
$p(s'|s,a)$ : transition kernel for the underlying process.
$\Phi$ : set of view or observation parameters (camera pose, modality).
$\mathcal{O}_\phi$ : output space under view $\phi$ .
$e_\phi(s)$ : observation mapping from state to view.

Two atomic VWM capabilities are formalized:

World Reconstruction (novel-view inference):

$p_\theta(o_{\phi_{n+1}} \mid o_{\phi_1}, \ldots, o_{\phi_n}) \approx \delta(o_{\phi_{n+1}} - e_{\phi_{n+1}}(s))$

World Simulation (action-conditioned prediction):

$p_\theta(o_{t+1} \mid o_{\leq t}, a_{\leq t}) \approx \delta(o_{t+1} - e_\phi(s_{t+1})), \quad s_{t+1} \sim p(s_{t+1}|s_t, a_t)$

These define VWMs as models able to infer unobserved visual states and forecast the consequences of hypothetical actions in diverse embodied or cognitive tasks (Wu et al., 27 Jan 2026).

2. Taxonomy: Model Types and Representational Trade-offs

VWM methodologies span a spectrum reflecting how internal state and external representations are handled:

Implicit World Modeling: No explicit observable output; world state exists in latent neural activations. Reasoning is performed directly in hidden space.
Verbal World Modeling: Internal state is rendered as symbolic or textual structures (coordinates, grids, captions), supporting certain forms of reasoning but bottlenecked in spatial detail.
Visual World Modeling: Internal state is made explicit by generated images reflecting the latent state, maximizing spatial fidelity and supporting tasks requiring perceptually grounded manipulation.

Visual world modeling is empirically found to confer advantages in physical/spatial reasoning, notably in tasks requiring complex geometry, structure-from-view synthesis, or physics-based forecasting, whereas implicit or verbal models suffice for low-dimensional combinatorial or purely verbal logic tasks (Wu et al., 27 Jan 2026, Kim et al., 2023).

3. Algorithms and Implementation Paradigms

Chain-of-Thought (CoT) with Multimodal World Modeling

Multimodal CoT is formalized as a sequence: $\mathcal{S}$ 0 with $\mathcal{S}$ 1 (reasoning step, typically textual) and $\mathcal{S}$ 2 (observation: image or text). The generative model factorizes as: $\mathcal{S}$ 3 where $\mathcal{S}$ 4 and $\mathcal{S}$ 5 collect contextual history.

In practical UMMs (Unified Multimodal Models), the visual-generation head (VGen) is typically trained with diffusion or flow-matching objectives, and visual reasoning steps are interleaved with text (Wu et al., 27 Jan 2026).

Memory and Dynamics in VWMs

Both explicit (frame buffer, 3D reconstructions) and implicit (compressed token, recurrent state) schemes are actively studied. For long-horizon consistency and geometric generalization, geometry-aware implicit memories combine a transformer compressing past history with a geometry distillation head, ensuring queryable, view-consistent representations for robust video simulation (Wei et al., 1 Jun 2026).

Feature-Based and Code-Based VWMs

Models such as RLA-WM forecast compact visual features (e.g., DINO token differences) instead of pixels, dramatically improving sample efficiency and generalizability (Zhang et al., 8 May 2026). Code-based VWMs for domain-specific scenarios (mobile GUIs) render the next state as executable, structured code for perfect text and layout fidelity (Koh et al., 2 Feb 2026).

4. Empirical Evaluation: Benchmarks and Findings

Structured Evaluations

Categories:

World Simulation: spatial/physics prediction (folding, manipulation, tracking, grid tasks).
World Reconstruction: novel view or occluded state inference.

Metrics:

Answer accuracy.
Sample efficiency (data required for a given accuracy).
World-model fidelity (e.g., IoU of generated vs. ground-truth images).

Task domains (navigation, recognition, robotic manipulation) are assessed by task success under online planning with VWM rollouts, moving beyond pure visual quality.

VWMs' ability for compositional generalization is assessed by holding out all combinations of object factors; models are evaluated by LPIPS, MSE, and ID–OOD generalization gap.

Test suite for vision-LLMs: atomic evaluation along perception (color, spatial, temporal, motion, quantity) and prediction (simulation, transitive, compositional) dimensions, including counterfactual and disentanglement tasks.

Notable Results

Task	Implicit WM	Verbal WM	Visual WM
Paper folding	18.4	23.1	52.7
Multi-hop manipulation	45.2	–	75.4
Ball tracking	29.6	–	55.3
Cube 3-view projection	24.8	26.8	52.7
Real-world spatial questions	31.2	31.8	41.3
Maze	73.9	71.5	72.1
Sokoban	99.3	98.7	99.0

Visual WMs far outperform other approaches in physically grounded tasks requiring spatial or physical detail; implicit or verbal models suffice on grid navigation and logic (Wu et al., 27 Jan 2026).

Visual WM sample efficiency is also superior: requiring only 500 samples for 50% accuracy on paper folding versus >2,000 for verbal (Wu et al., 27 Jan 2026). Visual world-model fidelity exceeds 50% (image IoU) versus only 5% for string-based verbal models.

5. Limitations, Open Challenges, and Theoretical Insights

While VWMs show marked gains in physically grounded reasoning, key limitations persist:

Visual-generation heads can exhibit imperfect fidelity (blurring, shape errors) (Wu et al., 27 Jan 2026).
Existing approaches often lack full 3D or geometric awareness, leading to drift or incoherent long-horizon rollouts (Wei et al., 1 Jun 2026).
World modeling for more abstract STEM domains, diagrams, and rich symbolic reasoning remains in its infancy.
Compositional generalization, especially under realistic augmentation and OOD (out-of-distribution) factors, is unsolved—object-centric architectures improve but do not close the gap (Kim et al., 2023).
VLMs, despite strong static perception, are deficient in causal prediction, dynamic simulation, and factor disentanglement—heightened by failure to learn physically grounded transition priors or robust geometric representations (Gao et al., 27 Jun 2025).

Theoretical results, such as task-centric identifiability, show that careful projection and alignment yield compact latents recovering true world states up to smooth/affine maps (Fu et al., 25 May 2026). Efficient bisimulation losses prune out task-irrelevant visual factors, improving robustness under distractors (Toso et al., 20 Feb 2026).

6. Applications, Future Directions, and Implications

VWMs drive advances in:

Data-efficient robot policy learning via synthetic rollouts ("video dreaming") (Jang et al., 19 May 2025, Kim et al., 2023).
Robust multi-task and fleet learning, leveraging VWMs for anomaly prediction and reduced human supervision (Liu et al., 2024).
GUI and software agent simulation, via renderable-code VWMs (Koh et al., 2 Feb 2026).
Hybrid high-level decision making (VLWM, WorldVLM) fusing LLM-based reasoning with visual forecasting (Englmeier et al., 15 Mar 2026, Chen et al., 2 Sep 2025).

Emerging Directions:

Incorporation of geometry-aware or physics-grounded latent states (Wei et al., 1 Jun 2026).
Modular combination of visual, language, and symbolic reasoning (Chen et al., 2 Sep 2025, Englmeier et al., 15 Mar 2026).
Robustness to distractors via implicit-action or bisimulation mechanisms (Wang et al., 2024, Toso et al., 20 Feb 2026).
Multi-step, open-ended imagination and systematic compositionality (Kim et al., 2023).
Adaptive curriculum and counterfactual learning grounded in cognitive science (Gao et al., 27 Jun 2025).

Advances in VWMs are rapidly broadening their applicability, with direct implications for embodied AI, autonomous agents, simulation-based planning, and interpretable high-level reasoning.