Create a Video View Paper

Qwen-AgentWorld: Language World Models for General Agents

This presentation introduces Qwen-AgentWorld, the first family of native language world models capable of simulating agentic environments across seven diverse domains within a unified framework. We explore how explicit world modeling—the ability to predict environment state transitions—forms a necessary foundation for general agent capabilities, examine the three-stage training pipeline that achieves high-fidelity simulation, and demonstrate how world models enable both scalable controllable simulation and transfer learning that improves downstream agent performance across text-based and GUI domains.

Script

Most agent research focuses on what action to take next, but what if the real bottleneck is predicting what happens when you act? Qwen-AgentWorld introduces the first native language world models trained to simulate environment responses across seven fundamentally different domains, from terminals and code editors to mobile apps and operating systems.

The core insight is architectural: every interaction, whether typing a bash command, clicking a mobile button, or calling an API, follows the same fundamental pattern of action followed by observation. This unified trajectory schema enables training a single model on over 10 million environment interactions spanning text-based terminals, GUI hierarchies, and structured tool responses.

Training unfolds in three precise stages. Continual pre-training injects world knowledge at scale. Supervised fine-tuning activates explicit next-state prediction as a reasoning chain, using 256 thousand token contexts to capture deep trajectories. Reinforcement learning then sharpens fidelity through five-dimensional rubric ratings covering format, factuality, consistency, realism, and quality, combined with strict rule-based verifiers to prevent reward hacking.

On AgentWorldBench, a grounded evaluation built from real environment execution, the 397 billion parameter Qwen-AgentWorld model achieves the highest overall score of 58.71. It dominates text-based domains like terminal and software engineering environments, and remains competitive in GUI simulation for Android, web, and operating systems, outperforming frontier closed models.

World models unlock two complementary capabilities. As standalone simulators, they enable controllable, scalable training across thousands of synthetic environments, yielding double-digit percentage gains when agents train in fictional yet self-consistent worlds. As agent foundation models, explicit world modeling provides meta-reasoning: agents internalize environment dynamics, simulate outcomes before acting, and transfer robustly across domains with limited real interaction data.

Qwen-AgentWorld demonstrates that high-fidelity world modeling is not auxiliary but foundational for general agents. Explicit next-state prediction acts as transferable meta-reasoning, enabling agents to generalize across domains and scale beyond the constraints of real-environment interaction. To explore how world models reshape agent training and create your own video explanations, visit EmergentMind.com.