AgentWorldBench: Multi-Domain Evaluation Suite
- AgentWorldBench is a comprehensive evaluation suite that simulates agentic environments across seven interactive domains using strict out-of-distribution protocols.
- It employs a three-stage pipeline—pre-training, supervised fine-tuning, and reinforcement learning—to enhance next-state prediction, reasoning, and meta-representational capabilities.
- Empirical results show that Qwen-AgentWorld models achieve competitive scores in realism, consistency, and factual accuracy across diverse simulation tasks.
AgentWorldBench is a comprehensive evaluation suite for assessing language world models’ ability to simulate agentic environments, spanning seven interactive domains and nine widely adopted benchmarks. Developed as part of the Qwen-AgentWorld project, AgentWorldBench uses real-environment trajectories collected from frontier LLM-based agents and imposes strict out-of-distribution (OOD) protocols to quantify environment simulation fidelity, reasoning, and meta-representational capabilities of generalist LLMs (Zuo et al., 23 Jun 2026).
1. Scope and Domain Coverage
AgentWorldBench targets the simulation of diverse agentic environments characterized by language-driven actions and observations. The suite encompasses the following seven domains, each mapped to canonical academic benchmarks:
| Domain | Description and Associated Benchmarks |
|---|---|
| MCP | Multi-service remote procedure call via JSON tools (Tool Decathlon, MCPMark) |
| Search | Web search and extractor tasks (WideSearch) |
| SWE | Software engineering: code reading, editing, execution (SWE-Bench Verified/Pro) |
| Terminal | Unix shell simulation (Terminal-Bench 1.0, 2.0) |
| Android | Mobile GUI via view hierarchies (AndroidWorld) |
| Web | Browser GUI via accessibility trees (WebArena) |
| OS | Desktop GUI via window/app state (OSWorld-Verified) |
Evaluation trajectories derive from the outputs of five competitive LLM-based agents: Anthropic Claude Sonnet 4.6, Claude Opus 4.6, OpenAI GPT-5.4, DeepMind Gemini 3.1 Pro, and three strong Qwen-family agents for expanded GUI diversity. This configuration ensures coverage of both text-centric and GUI-centric interaction paradigms (Zuo et al., 23 Jun 2026).
2. Dataset Construction and Task Structure
The training substrate for Qwen-AgentWorld’s language world models comprises over 10 million environment interaction trajectories distributed across all seven domains. Data sources include (1) agent infrastructure with containerized sandboxes and virtual machines for synthetic data generation, (2) public traces from terminal logs, tool invocations, and code execution, and (3) in-house agentic SFT rollouts, all normalized to a Unified Environment Trajectory Schema:
- system_prompt:
- turn:
- trajectory:
Preprocessing includes turn-level expansion (one training sample per turn), loss masking (CPT stage) using statistic-based filters for novelty, overlap, and boilerplate detection, and rejection sampling (SFT stage) guided by LLM-based judgments. Input–output pairs for world-model training are structured to predict the observation from the full system prompt, trajectory history, and current action; e.g., for the Terminal domain, predicting ls -la output after a given system state initialization (Zuo et al., 23 Jun 2026).
3. Evaluation Protocols
AgentWorldBench enforces rigorous OOD splits—no query or trajectory overlaps between evaluation and training pools—yielding approximately 2,170 sampled turns strictly for evaluation-only. Sampling strategies include:
- Text domains (MCP, Search, Terminal, SWE): Select five turns per trajectory (first, last, three intermediates)
- GUI domains (Android, Web, OS): Manual challenging-turn selection plus 50% random subsampling
Zero-shot evaluation is standard: world models receive only the static system prompt with demonstrations, history, and current action as input (no dynamic few-shot context).
Predicted observations are rated by a rubric-judging LLM (GPT-5.2) along five dimensions (Format, Factuality, Consistency, Realism, Quality), each scored 1–5. The normalized evaluation formula is:
Matching protocols differentiate deterministic, pre-existing, and metadata-based outputs for grading strictness. Judge prompts were calibrated with double-blind Turing-test A/B comparisons on real vs. simulated turns, optimizing discrimination capacity (Zuo et al., 23 Jun 2026).
4. Empirical Results and Statistical Analysis
Table 1 provides the mean rubric score (0-100) for each domain and model. Principal findings include:
| Model | MCP | Search | Term. | SWE | Android | Web | OS | Avg. |
|---|---|---|---|---|---|---|---|---|
| GPT-5.4 | 70.10 | 37.26 | 53.69 | 66.29 | 60.00 | 51.80 | 68.58 | 58.25 |
| Claude Opus 4.6 | 69.90 | 29.30 | 57.51 | 64.55 | 61.74 | 51.42 | 70.20 | 57.80 |
| Qwen3.5-397B-A17B | 68.31 | 30.81 | 55.30 | 64.44 | 54.90 | 48.55 | 60.85 | 54.74 |
| Qwen-AgentWorld-397B-A17B | 68.24 | 37.82 | 57.73 | 68.49 | 60.20 | 50.98 | 67.89 | 58.71 |
| Qwen3.5-35B-A3B | 57.87 | 25.98 | 46.13 | 47.58 | 53.18 | 47.10 | 56.27 | 47.73 |
| Qwen-AgentWorld-35B-A3B | 64.79 | 36.69 | 53.96 | 65.63 | 58.17 | 49.55 | 65.92 | 56.39 |
Qwen-AgentWorld models, particularly Qwen-AgentWorld-397B-A17B, attain the highest average scores—exceeding or matching performance of proprietary frontier models across most domains. Reinforcement learning (RL, Stage 3 of training) delivers significant improvements: +8.7 points (35B), +3.97 (397B) relative to CPT+SFT. Targeted RL on Terminal data transfers positively to held-out text domains (+14.2 on Terminal, +11.5 SWE, etc.). All main differences reported are statistically significant (bootstrap , CI width ) (Zuo et al., 23 Jun 2026).
5. Training Pipeline and Loss/Reward Schemes
The three-stage training pipeline is defined as:
- CPT (Corpus Pre-Training): General-purpose world modeling pretrained over state transition and augmented corpora. Loss-masking discards low-novelty turns, with keep-ratios specified for categories using metrics Overlap (OL), Novelty (Nov), and ratio . For instance, retrieval and expansion turns are fully retained; boilerplate drops to 10%.
| Category | Signature | Keep-Ratio |
|---|---|---|
| retrieval | Novelty ≥ 60%, length ratio > 1 | 100% |
| expansion | OL ≥ 50%, Nov ≥ 50%, R > 1.5 | 100% |
| action | Nov ≥ 50%, R ≤ 1 or short | 100% |
| transform | Nov < 50%, R < 1 | 50% |
| boilerplate | OL ≥ 50%, Nov < 50% | 10% |
| echo | OL ≥ 70%, Nov < 30% | 5% |
| other | uncategorized | 100% |
- SFT (Supervised Fine-Tuning): Next-state-prediction supervised with rejection sampling; three model rollouts are LLM-judged, keeping only the highest scorers above a threshold.
- RL (Reinforcement Learning): Open-ended RL with hybrid rubric-plus-rule reward:
Where 0 for 1 in the five LLM-judged dimensions (2), and 3 binary verifier score. RL yields improved factual accuracy (+11.3% rel on Factuality), cross-turn consistency (+10%), and realistic behavior (Zuo et al., 23 Jun 2026).
6. Observed Strengths, Limitations, and Implications
AgentWorldBench demonstrates that a single language world model can simulate seven heterogeneous interactive domains with high fidelity. The staged pipeline enables progressive acquisition of causal next-state prediction, with RL acting as an effective simulation sharpener via rubric-and-verifier-based feedback.
Key strengths include the decoupled simulation of thousands of controllable real-world environments for scalable RL, and the utility of world-model pre-training as a warm-up to boost downstream RL performance across multiple benchmarks.
However, factuality remains the lowest-scoring rubric dimension (≈ 45/100), indicating challenges with up-to-date world knowledge and real-time API fidelity. GUI domains present a modest performance gap compared to proprietary multimodal models, likely due to missing visual/contextual input streams in text-only modeling.
Meta-reasoning and mental simulation behaviors emerge in next-state prediction traces, such as stepwise refinement of configuration actions in complex workflows—evidence that RL-induced chain-of-thought modeling is crucial for robust agent planning.
A plausible implication is that scalable, reference-grounded evaluation frameworks like AgentWorldBench are essential for quantifying and advancing general agent world modeling capabilities.
7. Significance for Future Agentic RL Research
AgentWorldBench, in conjunction with Qwen-AgentWorld models, establishes a rigorous, OOD evaluation standard for multi-domain, text-based world modeling. Its structure and outcomes clarify the progress and open challenges in leveraging LLMs as both decoupled simulators and unified agent foundations, paving the way for greater scaling, domain generalization, and robust downstream policy learning in RL environments (Zuo et al., 23 Jun 2026).