Papers
Topics
Authors
Recent
Search
2000 character limit reached

Visual World Models Overview

Updated 16 June 2026
  • Visual World Models are internal, learnable representations that simulate and reconstruct environments using visual modalities for physically grounded reasoning.
  • They encompass implicit, verbal, and visual modeling approaches, each trading off abstraction, detail fidelity, and sample efficiency in tasks like spatial reasoning and forecasting.
  • VWMs support diverse applications such as robotics, embodied AI, and GUI simulation while guiding research in robust planning, multi-modal integration, and compositional generalization.

Visual World Models (VWMs) are internal, learnable representations capable of simulating, forecasting, and reconstructing environments using visual modalities. These models support physically grounded reasoning, robust planning, and policy learning in both machine agents and cognitive systems. VWMs operate across a spectrum from pixel-level prediction to advanced compositional abstraction, and they underpin advances in robotics, embodied AI, vision–language modeling, and high-level multimodal reasoning.

1. Formal Frameworks and Core Capabilities

A general VWM centers on the Markov Decision Process formalism, extended for rich visual observation:

M=(S,A,p,Φ,Oϕ,eϕ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, p, \Phi, \mathcal{O}_\phi, e_\phi)

  • S\mathcal{S}: (latent) state space.
  • A\mathcal{A}: action space (e.g., agent's control commands, physical manipulations).
  • p(s′∣s,a)p(s'|s,a): transition kernel for the underlying process.
  • Φ\Phi: set of view or observation parameters (camera pose, modality).
  • OÏ•\mathcal{O}_\phi: output space under view Ï•\phi.
  • eÏ•(s)e_\phi(s): observation mapping from state to view.

Two atomic VWM capabilities are formalized:

  1. World Reconstruction (novel-view inference):

pθ(oϕn+1∣oϕ1,…,oϕn)≈δ(oϕn+1−eϕn+1(s))p_\theta(o_{\phi_{n+1}} \mid o_{\phi_1}, \ldots, o_{\phi_n}) \approx \delta(o_{\phi_{n+1}} - e_{\phi_{n+1}}(s))

  1. World Simulation (action-conditioned prediction):

pθ(ot+1∣o≤t,a≤t)≈δ(ot+1−eϕ(st+1)),st+1∼p(st+1∣st,at)p_\theta(o_{t+1} \mid o_{\leq t}, a_{\leq t}) \approx \delta(o_{t+1} - e_\phi(s_{t+1})), \quad s_{t+1} \sim p(s_{t+1}|s_t, a_t)

These define VWMs as models able to infer unobserved visual states and forecast the consequences of hypothetical actions in diverse embodied or cognitive tasks (Wu et al., 27 Jan 2026).

2. Taxonomy: Model Types and Representational Trade-offs

VWM methodologies span a spectrum reflecting how internal state and external representations are handled:

  • Implicit World Modeling: No explicit observable output; world state exists in latent neural activations. Reasoning is performed directly in hidden space.
  • Verbal World Modeling: Internal state is rendered as symbolic or textual structures (coordinates, grids, captions), supporting certain forms of reasoning but bottlenecked in spatial detail.
  • Visual World Modeling: Internal state is made explicit by generated images reflecting the latent state, maximizing spatial fidelity and supporting tasks requiring perceptually grounded manipulation.

Visual world modeling is empirically found to confer advantages in physical/spatial reasoning, notably in tasks requiring complex geometry, structure-from-view synthesis, or physics-based forecasting, whereas implicit or verbal models suffice for low-dimensional combinatorial or purely verbal logic tasks (Wu et al., 27 Jan 2026, Kim et al., 2023).

3. Algorithms and Implementation Paradigms

Chain-of-Thought (CoT) with Multimodal World Modeling

Multimodal CoT is formalized as a sequence: S\mathcal{S}0 with S\mathcal{S}1 (reasoning step, typically textual) and S\mathcal{S}2 (observation: image or text). The generative model factorizes as: S\mathcal{S}3 where S\mathcal{S}4 and S\mathcal{S}5 collect contextual history.

In practical UMMs (Unified Multimodal Models), the visual-generation head (VGen) is typically trained with diffusion or flow-matching objectives, and visual reasoning steps are interleaved with text (Wu et al., 27 Jan 2026).

Memory and Dynamics in VWMs

Both explicit (frame buffer, 3D reconstructions) and implicit (compressed token, recurrent state) schemes are actively studied. For long-horizon consistency and geometric generalization, geometry-aware implicit memories combine a transformer compressing past history with a geometry distillation head, ensuring queryable, view-consistent representations for robust video simulation (Wei et al., 1 Jun 2026).

Feature-Based and Code-Based VWMs

Models such as RLA-WM forecast compact visual features (e.g., DINO token differences) instead of pixels, dramatically improving sample efficiency and generalizability (Zhang et al., 8 May 2026). Code-based VWMs for domain-specific scenarios (mobile GUIs) render the next state as executable, structured code for perfect text and layout fidelity (Koh et al., 2 Feb 2026).

4. Empirical Evaluation: Benchmarks and Findings

Structured Evaluations

Categories:

  • World Simulation: spatial/physics prediction (folding, manipulation, tracking, grid tasks).
  • World Reconstruction: novel view or occluded state inference.

Metrics:

  • Answer accuracy.
  • Sample efficiency (data required for a given accuracy).
  • World-model fidelity (e.g., IoU of generated vs. ground-truth images).

Task domains (navigation, recognition, robotic manipulation) are assessed by task success under online planning with VWM rollouts, moving beyond pure visual quality.

VWMs' ability for compositional generalization is assessed by holding out all combinations of object factors; models are evaluated by LPIPS, MSE, and ID–OOD generalization gap.

Test suite for vision-LLMs: atomic evaluation along perception (color, spatial, temporal, motion, quantity) and prediction (simulation, transitive, compositional) dimensions, including counterfactual and disentanglement tasks.

Notable Results

Task Implicit WM Verbal WM Visual WM
Paper folding 18.4 23.1 52.7
Multi-hop manipulation 45.2 – 75.4
Ball tracking 29.6 – 55.3
Cube 3-view projection 24.8 26.8 52.7
Real-world spatial questions 31.2 31.8 41.3
Maze 73.9 71.5 72.1
Sokoban 99.3 98.7 99.0

Visual WMs far outperform other approaches in physically grounded tasks requiring spatial or physical detail; implicit or verbal models suffice on grid navigation and logic (Wu et al., 27 Jan 2026).

Visual WM sample efficiency is also superior: requiring only 500 samples for 50% accuracy on paper folding versus >2,000 for verbal (Wu et al., 27 Jan 2026). Visual world-model fidelity exceeds 50% (image IoU) versus only 5% for string-based verbal models.

5. Limitations, Open Challenges, and Theoretical Insights

While VWMs show marked gains in physically grounded reasoning, key limitations persist:

  • Visual-generation heads can exhibit imperfect fidelity (blurring, shape errors) (Wu et al., 27 Jan 2026).
  • Existing approaches often lack full 3D or geometric awareness, leading to drift or incoherent long-horizon rollouts (Wei et al., 1 Jun 2026).
  • World modeling for more abstract STEM domains, diagrams, and rich symbolic reasoning remains in its infancy.
  • Compositional generalization, especially under realistic augmentation and OOD (out-of-distribution) factors, is unsolved—object-centric architectures improve but do not close the gap (Kim et al., 2023).
  • VLMs, despite strong static perception, are deficient in causal prediction, dynamic simulation, and factor disentanglement—heightened by failure to learn physically grounded transition priors or robust geometric representations (Gao et al., 27 Jun 2025).

Theoretical results, such as task-centric identifiability, show that careful projection and alignment yield compact latents recovering true world states up to smooth/affine maps (Fu et al., 25 May 2026). Efficient bisimulation losses prune out task-irrelevant visual factors, improving robustness under distractors (Toso et al., 20 Feb 2026).

6. Applications, Future Directions, and Implications

VWMs drive advances in:

Emerging Directions:

Advances in VWMs are rapidly broadening their applicability, with direct implications for embodied AI, autonomous agents, simulation-based planning, and interpretable high-level reasoning.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Visual World Models (VWMs).