AgentWorldBench: Multi-Domain Evaluation Suite

Updated 24 June 2026

AgentWorldBench is a comprehensive evaluation suite that simulates agentic environments across seven interactive domains using strict out-of-distribution protocols.
It employs a three-stage pipeline—pre-training, supervised fine-tuning, and reinforcement learning—to enhance next-state prediction, reasoning, and meta-representational capabilities.
Empirical results show that Qwen-AgentWorld models achieve competitive scores in realism, consistency, and factual accuracy across diverse simulation tasks.

AgentWorldBench is a comprehensive evaluation suite for assessing language world models’ ability to simulate agentic environments, spanning seven interactive domains and nine widely adopted benchmarks. Developed as part of the Qwen-AgentWorld project, AgentWorldBench uses real-environment trajectories collected from frontier LLM-based agents and imposes strict out-of-distribution (OOD) protocols to quantify environment simulation fidelity, reasoning, and meta-representational capabilities of generalist LLMs (Zuo et al., 23 Jun 2026).

1. Scope and Domain Coverage

AgentWorldBench targets the simulation of diverse agentic environments characterized by language-driven actions and observations. The suite encompasses the following seven domains, each mapped to canonical academic benchmarks:

Domain	Description and Associated Benchmarks
MCP	Multi-service remote procedure call via JSON tools (Tool Decathlon, MCPMark)
Search	Web search and extractor tasks (WideSearch)
SWE	Software engineering: code reading, editing, execution (SWE-Bench Verified/Pro)
Terminal	Unix shell simulation (Terminal-Bench 1.0, 2.0)
Android	Mobile GUI via view hierarchies (AndroidWorld)
Web	Browser GUI via accessibility trees (WebArena)
OS	Desktop GUI via window/app state (OSWorld-Verified)

Evaluation trajectories derive from the outputs of five competitive LLM-based agents: Anthropic Claude Sonnet 4.6, Claude Opus 4.6, OpenAI GPT-5.4, DeepMind Gemini 3.1 Pro, and three strong Qwen-family agents for expanded GUI diversity. This configuration ensures coverage of both text-centric and GUI-centric interaction paradigms (Zuo et al., 23 Jun 2026).

2. Dataset Construction and Task Structure

The training substrate for Qwen-AgentWorld’s language world models comprises over 10 million environment interaction trajectories distributed across all seven domains. Data sources include (1) agent infrastructure with containerized sandboxes and virtual machines for synthetic data generation, (2) public traces from terminal logs, tool invocations, and code execution, and (3) in-house agentic SFT rollouts, all normalized to a Unified Environment Trajectory Schema:

system_prompt: $\text{task\_description} \oplus \text{action\_space} \oplus \text{initial\_state} \oplus \text{demonstrations} \oplus \text{simulation\_instruction}$
turn $_t$ : $(\text{action}_t, \text{observation}_t)$
trajectory: $\text{system\_prompt} \oplus [\text{turn}_1, \ldots, \text{turn}_T]$

Preprocessing includes turn-level expansion (one training sample per turn), loss masking (CPT stage) using statistic-based filters for novelty, overlap, and boilerplate detection, and rejection sampling (SFT stage) guided by LLM-based judgments. Input–output pairs for world-model training are structured to predict the observation $\hat{o}_t$ from the full system prompt, trajectory history, and current action; e.g., for the Terminal domain, predicting ls -la output after a given system state initialization (Zuo et al., 23 Jun 2026).

3. Evaluation Protocols

AgentWorldBench enforces rigorous OOD splits—no query or trajectory overlaps between evaluation and training pools—yielding approximately 2,170 sampled turns strictly for evaluation-only. Sampling strategies include:

Text domains (MCP, Search, Terminal, SWE): Select five turns per trajectory (first, last, three intermediates)
GUI domains (Android, Web, OS): Manual challenging-turn selection plus 50% random subsampling

Zero-shot evaluation is standard: world models receive only the static system prompt with demonstrations, history, and current action as input (no dynamic few-shot context).

Predicted observations are rated by a rubric-judging LLM (GPT-5.2) along five dimensions (Format, Factuality, Consistency, Realism, Quality), each scored 1–5. The normalized evaluation formula is:

$\text{score} = 20 \times \left( \frac{\text{mean of the five dimension scores}}{5} \right)$

Matching protocols differentiate deterministic, pre-existing, and metadata-based outputs for grading strictness. Judge prompts were calibrated with double-blind Turing-test A/B comparisons on real vs. simulated turns, optimizing discrimination capacity (Zuo et al., 23 Jun 2026).

4. Empirical Results and Statistical Analysis

Table 1 provides the mean rubric score (0-100) for each domain and model. Principal findings include:

Model	MCP	Search	Term.	SWE	Android	Web	OS	Avg.
GPT-5.4	70.10	37.26	53.69	66.29	60.00	51.80	68.58	58.25
Claude Opus 4.6	69.90	29.30	57.51	64.55	61.74	51.42	70.20	57.80
Qwen3.5-397B-A17B	68.31	30.81	55.30	64.44	54.90	48.55	60.85	54.74
Qwen-AgentWorld-397B-A17B	68.24	37.82	57.73	68.49	60.20	50.98	67.89	58.71
Qwen3.5-35B-A3B	57.87	25.98	46.13	47.58	53.18	47.10	56.27	47.73
Qwen-AgentWorld-35B-A3B	64.79	36.69	53.96	65.63	58.17	49.55	65.92	56.39

Qwen-AgentWorld models, particularly Qwen-AgentWorld-397B-A17B, attain the highest average scores—exceeding or matching performance of proprietary frontier models across most domains. Reinforcement learning (RL, Stage 3 of training) delivers significant improvements: +8.7 points (35B), +3.97 (397B) relative to CPT+SFT. Targeted RL on Terminal data transfers positively to held-out text domains (+14.2 on Terminal, +11.5 SWE, etc.). All main differences reported are statistically significant (bootstrap $p < 0.01$ , CI width $\approx \pm 0.5$ ) (Zuo et al., 23 Jun 2026).

5. Training Pipeline and Loss/Reward Schemes

The three-stage training pipeline is defined as:

CPT (Corpus Pre-Training): General-purpose world modeling pretrained over state transition and augmented corpora. Loss-masking discards low-novelty turns, with keep-ratios specified for categories using metrics Overlap (OL), Novelty (Nov), and ratio $R=|\text{obs}|/|\text{act}|$ . For instance, retrieval and expansion turns are fully retained; boilerplate drops to 10%.

Category	Signature	Keep-Ratio
retrieval	Novelty ≥ 60%, length ratio > 1	100%
expansion	OL ≥ 50%, Nov ≥ 50%, R > 1.5	100%
action	Nov ≥ 50%, R ≤ 1 or short	100%
transform	Nov < 50%, R < 1	50%
boilerplate	OL ≥ 50%, Nov < 50%	10%
echo	OL ≥ 70%, Nov < 30%	5%
other	uncategorized	100%

SFT (Supervised Fine-Tuning): Next-state-prediction supervised with rejection sampling; three model rollouts are LLM-judged, keeping only the highest scorers above a threshold.
RL (Reinforcement Learning): Open-ended RL with hybrid rubric-plus-rule reward:

$R_\text{total} = 0.9 \cdot R_\text{rubric} + 0.1 \cdot R_\text{rule}$

Where $_t$ 0 for $_t$ 1 in the five LLM-judged dimensions ( $_t$ 2), and $_t$ 3 binary verifier score. RL yields improved factual accuracy (+11.3% rel on Factuality), cross-turn consistency (+10%), and realistic behavior (Zuo et al., 23 Jun 2026).

6. Observed Strengths, Limitations, and Implications

AgentWorldBench demonstrates that a single language world model can simulate seven heterogeneous interactive domains with high fidelity. The staged pipeline enables progressive acquisition of causal next-state prediction, with RL acting as an effective simulation sharpener via rubric-and-verifier-based feedback.

Key strengths include the decoupled simulation of thousands of controllable real-world environments for scalable RL, and the utility of world-model pre-training as a warm-up to boost downstream RL performance across multiple benchmarks.

However, factuality remains the lowest-scoring rubric dimension (≈ 45/100), indicating challenges with up-to-date world knowledge and real-time API fidelity. GUI domains present a modest performance gap compared to proprietary multimodal models, likely due to missing visual/contextual input streams in text-only modeling.

Meta-reasoning and mental simulation behaviors emerge in next-state prediction traces, such as stepwise refinement of configuration actions in complex workflows—evidence that RL-induced chain-of-thought modeling is crucial for robust agent planning.

A plausible implication is that scalable, reference-grounded evaluation frameworks like AgentWorldBench are essential for quantifying and advancing general agent world modeling capabilities.

7. Significance for Future Agentic RL Research

AgentWorldBench, in conjunction with Qwen-AgentWorld models, establishes a rigorous, OOD evaluation standard for multi-domain, text-based world modeling. Its structure and outcomes clarify the progress and open challenges in leveraging LLMs as both decoupled simulators and unified agent foundations, paving the way for greater scaling, domain generalization, and robust downstream policy learning in RL environments (Zuo et al., 23 Jun 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Qwen-AgentWorld: Language World Models for General Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgentWorldBench.