Reliability and utility conditions for LLMs as world models

Determine whether large language models can reliably serve as world models that provide simulated experience to improve learning efficiency, and identify the specific conditions under which such models meaningfully benefit agent performance.

Background

The paper targets the experience bottleneck in agentic reinforcement learning and proposes using world models to generate simulated experience. Given that LLMs are trained via next-token prediction and encode broad knowledge, the authors question whether these models can act as reliable world models and under which regimes they are beneficial to agents.

This uncertainty motivates the paper’s three-level evaluation framework—fidelity/consistency, scalability/robustness, and agent utility—applied to five text-based environments to probe reliability and practical value.

References

World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether LLMs can reliably serve this role and under what conditions they meaningfully benefit agents.

From Word to World: Can Large Language Models be Implicit Text-based World Models? (2512.18832 - Li et al., 21 Dec 2025) in Abstract