LLM-Based World Models

Updated 26 December 2025

LLM-based world models are computational frameworks that use pretrained language models to simulate state transitions and reason with compositional, high-level world knowledge.
They integrate direct transition modeling, symbolic rule learning, and retrieval augmentation to enhance planning, decision making, and generative tasks across various domains.
These models improve efficiency and adaptability in robotics, web navigation, and scientific reasoning while addressing challenges like long-horizon consistency and interpretability.

A LLM-based world model is a computational system in which an LLM—potentially augmented by auxiliary modules—serves as the internal simulator or predictor of environment dynamics for autonomous decision making, planning, or generative tasks. LLM-based world models have emerged as a unifying paradigm across domains including robotics, web navigation, embodied agents, scientific reasoning, and open-ended strategy games. Unlike classic world models built from symbolic logic or purely neural black boxes, LLM-based models leverage the compositional, general knowledge of pretrained LLMs to simulate state transitions (including observations and rewards), reason about partial observability, and encode high-level world knowledge.

1. Conceptual Foundations and Formalization

The LLM-based world model paradigm generalizes the classical Markov decision process (MDP) abstraction. Let $(\mathcal{S}, \mathcal{A}, \mathcal{T}, r)$ denote the state space, action space, transition dynamics, and reward. In an LLM-based world model, the transition function $\mathcal{T}(s'|s,a)$ is parameterized by an LLM, which may operate over discrete symbolic states, natural language descriptions, or multimodal inputs (e.g., image tokens or accessibility trees) (Li et al., 21 Dec 2025, Zhao et al., 2024, Chae et al., 2024). For partially observable settings (POMDPs), the belief state $b_t(s)$ is updated via LLM-based inference over the observation and action history (Hu et al., 2023, Light et al., 2024).

The world model takes as input a state-action context (often expressed in natural language or as structured data), and outputs a distribution or direct prediction of the next state and (optionally) reward: $(s_{t+1}, r_{t+1}) \sim \mathcal{W}_\theta(H_t, a_t)$ where $H_t$ denotes the interaction history. In "implicit" text-based world models, the prediction is next-state text; in generative video models, the output is high-dimensional visual data (Zhao et al., 2024, Li et al., 21 Dec 2025).

In many architectures, the LLM-based world model is further composed with behavioral components:

Action proposal: Given a current context, propose top- $K$ plausible next actions (Yang et al., 2024, Li et al., 21 Dec 2025).
Reward modeling: Predict expected reward, success/failure, or value functions for candidate action sequences (Chen et al., 11 Jan 2025, Chuang et al., 28 Jan 2025).
Planning: Enable Monte Carlo Tree Search (MCTS), model predictive control (MPC), or rollouts via repeated LLM-based simulation (Hu et al., 2023, Zhao et al., 2024, Tang et al., 2024, Zhou et al., 22 Apr 2025).

2. Core Methodological Approaches

2.1 Direct Transition Modeling by LLMs

The prototypical approach is to frame world simulation as conditional language modeling. A context is compiled—often as a system prompt and dialogue—including symbolic observations, prior actions, and potentially summary hypotheses or rules, and the LLM is prompted for the next state and/or reward. This approach underpins agents in text games, gridworlds, web environments, and scientific Q&A (Li et al., 21 Dec 2025, Yang et al., 2024, Guo et al., 5 Nov 2025).

Fine-tuning with supervised next-state prediction objectives on experience trajectories strongly enhances step-wise fidelity and long-term trajectory consistency (Li et al., 21 Dec 2025). Accuracy saturates with both data and model scale in structured domains.

2.2 World Alignment via Symbolic Rule Learning

The LLM's prior may be misaligned with precise environment dynamics. "World alignment" augments the LLM with a neurosymbolic knowledge layer extracted automatically by comparing LLM predictions to observed outcomes, generating deterministic rules, knowledge graphs, and scene graphs as code. These code modules 'override' the LLM's predictions where applicable, producing a composite model (Zhou et al., 2024, Zhou et al., 22 Apr 2025).

A typical aligned world model thus has the form: $P_{world}(s_{t+1}|s_t,a_t) = \begin{cases} \delta[s_{t+1} = rule(s_t,a_t)] & \text{if a rule applies} \ P_{LLM}(s_{t+1}|s_t,a_t) & \text{otherwise} \end{cases}$ Pruning and selection of rules is performed via maximum-coverage combinatorial optimization. This hybridization produces large gains in sample efficiency, planning stability, and token cost, particularly for embodied agents operating in partially observed, open-world settings (Zhou et al., 2024, Zhou et al., 22 Apr 2025).

2.3 Incorporating External Knowledge via Retrieval

A critical limitation of LLM-only models is hallucination and distributional drift over long rollouts, primarily due to static pretraining. Retrieval-augmented world models (R-WoM) (Mei et al., 13 Oct 2025), and more generally retrieval-augmented generation (RAG) architectures (Guo et al., 5 Nov 2025), inject up-to-date, relevant knowledge (e.g., software tutorials for digital environments, or peer-reviewed literature for scientific reasoning) into the LLM's context at each simulation step. This constrains world model predictions to remain close to factual environment dynamics, particularly across multi-step procedures.

2.4 Theory-Induced and Compositional World Models

Some frameworks, such as WorldLLM (Levy et al., 7 Jun 2025), induce explicit natural language hypotheses (theories) about transition regularities via Bayesian inference over experience, leveraging in-context LLM reasoning. These hypotheses function as an interpretable, low-dimensional world model, guiding both forward prediction and curiosity-driven exploration. Modular world models are also implemented via dynamic composition of independently trained sub-models, with prototype-based retrieval and compound attention for knowledge fusion (Yoo et al., 4 Sep 2025).

3. Representative Architectures and Benchmarks

3.1 Generative Video and Embodied Agents

DriveDreamer-2 (Zhao et al., 2024) pioneers multi-stage LLM-based world modeling for video generation in self-driving. The pipeline:

User queries are converted to trajectories via a Python-function-based LLM interface (fine-tuned GPT-3.5).
Trajectories induce HDMap generation via a latent diffusion model conditioned on agent traces.
A unified multi-view model (concatenated panorama tensor, masked generation formulation, denoising score matching loss) produces temporally and spatially coherent multi-view driving videos.

This system achieves leading FID = 11.2 (30% improvement over prior) and FVD = 55.7 (50% improvement), with downstream benefits for 3D detection (+3.8% mAP) and tracking (+8.3% AMOTA).

3.2 Decision Making, Planning, and Multi-Agent Collaboration

LLM-based world models have demonstrated substantial improvements in diverse text-based decision environments (Yang et al., 2024, Li et al., 21 Dec 2025, Light et al., 2024) and hard-exploration problems (Kim et al., 28 Sep 2025). Techniques include dual-scale models (global trajectory frontier + local advantage reflection), model predictive control with LLM lookahead, and explicit multi-agent POMDP decomposition.

Multi-agent reasoning settings are addressed with collaborative belief worlds, which track both zeroth- and first-order beliefs in a symbolic language, updated via modular LLM prompts in a zero-shot Bayesian style (Wang et al., 26 Sep 2025). This enables consistent distributed planning and major reductions in redundant communication.

3.3 Web and Software Environments

LLM-based world models in web navigation utilize transition-focused observation abstraction, capturing only the salient state changes with abstract natural language descriptions (Chae et al., 2024). Retrieval-augmented world models dominate in digital workflow domains (OSWorld, WebArena), yielding up to +25.3% and +18.1% gains in procedural alignment via tutorial grounding (Mei et al., 13 Oct 2025).

4. Empirical Results and Performance Characteristics

Quantitative evaluation reveals that LLM-based world models can match or exceed the accuracy and sample efficiency of prior deep RL and classical symbolic approaches in a variety of domains:

In ALFWorld, WALL-E 2.0 achieves a 98% success rate after only 4 alignment iterations, surpassing prior best (RAFA) and human baseline (Zhou et al., 22 Apr 2025).
In hard-exploration text games, GLoW attains SOTA with 100–800× fewer interactions than RL/MCTS (Kim et al., 28 Sep 2025).
In generative video, DriveDreamer-2 reduces FID by ∼30% and FVD by ∼50% over prior work (Zhao et al., 2024).
In scientific QA with retrieval-grounded world models, expert evaluation scores for balanced perspective, factual comprehensiveness, and evidentiary support are >2× better than closed LLM baselines (Guo et al., 5 Nov 2025).
In decision-making tasks, LLM world models’ policy planning and verification accuracy drops steadily with horizon length, highlighting compounding error as a central limitation (e.g., GPT-4o drops from 85% at 25% horizon to 53% at full horizon) (Yang et al., 2024, Li et al., 21 Dec 2025).

Critical bottlenecks include the need for sufficient behavioral coverage in training trajectories, model size scaling in open-ended domains, and the intrinsic instability arising from composition of action proposal, transition prediction, and planning modules.

5. Interpretability, Analysis, and Model Limitations

Analysis of LLM world-model capacity reveals both promise and fragility:

LLMs exhibit reliable latent world-model representations sufficient for guesstimation, object volume reasoning, and coarse mechanical interpretation, as established by performance well above chance and convergence of Wisdom-of-Crowds median decoding (Chuang et al., 28 Jan 2025).
In cognitive-scientific probes, LLMs can index global spatial relations and distinguish functionally connected from jumbled mechanical systems, but fail to reason over nuanced physical connectivity or cause-effect chains, suggesting heuristic reliance rather than robust, simulation-based modeling (Robertson et al., 21 Jul 2025).
In complex planning, long-term consistency degrades without external grounding or explicit code-induced rules (Mei et al., 13 Oct 2025, Zhou et al., 22 Apr 2025).
Modular symbolic world models induced by LLMs (e.g., via code generation or rule learning) offer interpretability and explicit correction but require execution infrastructure and careful coverage of edge cases (Light et al., 2024, Tang et al., 2024).

6. Design Patterns and Future Directions

Key design methodologies emerging across the literature include:

Separation of symbolic (human-interpretable) and neural (LLM) modules to achieve robustness, interpretability, and efficiency (Light et al., 2024, Zhou et al., 22 Apr 2025).
Retrieval-augmented and offline knowledge fusion to overcome hallucination and drift, especially over long horizons in dynamic or under-documented environments (Mei et al., 13 Oct 2025, Guo et al., 5 Nov 2025).
Meta-level theory induction and compositional alignment for continual model refinement, with Bayesian filtering and particle-based approximations (Levy et al., 7 Jun 2025, Li et al., 21 Dec 2025).
Multi-modal abstraction, with visual, textual, and executable code artifacts all contributing to the world model’s representational capacity (Zhao et al., 2024, Chen et al., 11 Jan 2025, Haijima et al., 2024).
Sample-efficient, RL-free planning enabled by LLM-based optimistic code synthesis and symbolic planning (Tang et al., 2024), dual-scale exploration (Kim et al., 28 Sep 2025), and MPC via LLM lookahead (Zhou et al., 2024, Zhou et al., 22 Apr 2025).

Current limitations focus on scaling to high-entropy continuous domains, multi-step visual simulation, compositional generalization, and fully end-to-end learning of planning, reasoning, and knowledge acquisition pipelines within a unified LLM-based framework.

7. Impact and Applications

LLM-based world models have demonstrably advanced the state-of-the-art in:

Generative modeling for video and robotics, particularly in data-efficient simulation and scenario customization (Zhao et al., 2024, Chen et al., 11 Jan 2025).
Complex reasoning and planning, including multi-agent collaboration with explicit theory-of-mind modeling (Wang et al., 26 Sep 2025).
Scientific Q&A and domain-specific information retrieval, where world models equipped with retrieval and evidence tracking deliver best-in-class factual comprehension (Guo et al., 5 Nov 2025).
Hard-exploration RL and decision making, via explicit model-based rollouts and symbolic-LLM hybrid reasoning (Kim et al., 28 Sep 2025, Yang et al., 2024).

A plausible implication is that further advances in alignment, retrieval, compositionality, and interpretability will expand the practical reach of LLM-based world models across high-stakes, real-world domains—including science, engineering, and autonomous agents operating in complex open worlds.