Qwen-AgentWorld: Agent Simulation & Planning
- Qwen-AgentWorld is a family of large-scale language world models designed for simulating agent behavior, reasoning, and planning across seven distinct real-world domains.
- It employs a three-stage training pipeline (CPT, SFT, and RL) that leverages explicit chain-of-thought reasoning to enhance simulation fidelity and robust decision-making.
- The framework integrates transformer architectures with targeted quantization and scratchpad modifications to support both closed-loop agent training and general agent foundation modeling.
Qwen-AgentWorld is a family of large-scale language world models designed for agentic environment simulation, reasoning, and planning, establishing a unified framework for both scalable closed-loop agent training and general agent foundation modeling. These models are notable for their capability to simulate environment trajectories across seven real-world domains via explicit chain-of-thought (CoT) reasoning, trained on over 10 million real-world interaction trajectories. Through a staged CPT→SFT→RL pipeline and leveraging rubric-guided reinforcement, Qwen-AgentWorld demonstrates strong simulation fidelity, broad generalization, and practical utility across a battery of evaluator-curated agent benchmarks (Zuo et al., 23 Jun 2026).
1. Model Architectures
Qwen-AgentWorld comprises two main instantiations: Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B. Both are based on the Qwen3 decoder-only Transformer architecture, customized for world modeling and long CoT reasoning:
- Qwen-AgentWorld-35B-A3B:
- Base: Qwen3.5‐35B
- Layers: 40
- Hidden size: 8,192
- Feed-forward size: 32,768
- Attention heads: 32
- Parameters: ~35B
- A3B: 3-bit activation quantization; includes a sparse CoT prefix head
- Qwen-AgentWorld-397B-A17B:
- Base: Qwen3.6‐397B
- Layers: 80
- Hidden size: 15,360
- Feed-forward size: 61,440
- Attention heads: 64
- Parameters: ~397B
- A17B: 17-bit quantization, extended “scratchpad” CoT head after layer 20
Both configurations integrate lightweight “scratchpad” MLPs (≈20M parameters) at each 8th layer during SFT to promote explicit thought token emission and use token-type embeddings during RL to constrain reward attribution to assistant regions only. These modifications facilitate interpretable multi-step world simulation and reasoning (Zuo et al., 23 Jun 2026).
2. Three-Stage Training Pipeline
The “CPT injects ⟶ SFT activates ⟶ RL sharpens” pipeline undergirds Qwen-AgentWorld’s progressive acquisition of environment modeling capacity:
2.1 Continual Pre-Training (CPT)
- Data:
- Trajectories from sandboxed code, containerized terminals, browsers, OS VMs, public code/script logs, professional domain corpora (law, finance, medicine)
- Objective:
- Turn-level masking stratifies action-observation pairs into semantic bins by token overlap, novelty, and ratio, retaining high-novelty transitions at lower rates to improve coverage.
2.2 Supervised Fine-Tuning (SFT)
- Data:
- 7,094 “thinking” trajectories curated via 3-beam self-critique and rejection sampling (69.2% retention), covering all seven domains.
- Objective:
- Supervision enforces ground-truth CoT traces before each next-state prediction, “Why the next state will look like X, then X.”
- System prompt templates are diversified (ten variants) to prevent prompt overfitting.
2.3 Reinforcement Learning (RL)
- Algorithm: GSPO (Group Sequence Policy Optimization)
- Data: 92,308 sampled trajectories (one turn each, context ≤128k tokens)
- Reward design (hybrid):
- Rubric: LLM-judge evaluates each predicted observation on 5-point scales over Format, Factuality, Consistency, Realism, and Quality. Aggregate:
- Rule verifier: domain-specific schema/consistency checks (0/1, scaled to [0,25])
- Loss:
- Key strategies:
- One-turn sampling to mitigate shared-prefix collapse.
- Five-dimensional rubric outperforms pairwise/Turing-test signals for RL reward.
- Content-type filtering prevents reward hacking via irrelevant or self-referential outputs.
3. Dataset Construction and Domains
Qwen-AgentWorld is trained on over 10 million environment interaction turns across seven subdomains:
| Domain | Environment Type |
|---|---|
| MCP | REST/JSON tool calls |
| Search | web_search, web_extractor |
| SWE | code edit, bash, diff |
| Terminal | bash, multiprogram pipelines |
| Android | UI touch/swipe (view hierarchies) |
| Web | click/type (accessibility trees) |
| OS | mouse/keyboard (window trees) |
Data preparation includes schema unification, trajectory expansion, aggressive filtering (to remove retry and no-change turns), and construction of system prompts (task, action space, demonstration, simulation instruction). Disjoint splits are enforced for CPT, SFT, and RL data (Zuo et al., 23 Jun 2026).
4. Evaluation: AgentWorldBench
To provide a rigorous assessment, AgentWorldBench is introduced, aggregating real-world interaction traces from five leading agents across nine established benchmarks:
- Benchmarks include Tool Decathlon (32 apps), MCPMark (127 MCP tasks), Terminal-Bench 1.0/2.0, OSWorld-Verified, SWE-Bench Verified/Pro, BFCL v4, and WideSearch.
- Five “actor” agents supply trajectories: Claude Opus 4.8, Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro.
- Each predicted observation is judged by an LLM on five dimensions for reference-anchored, rubric-based evaluation.
- Main results:
| Model | MCP | Search | Term. | SWE | Android | Web | OS | Avg. |
|---|---|---|---|---|---|---|---|---|
| Qwen3.5 35B | 57.9 | 26.0 | 46.1 | 47.6 | 53.2 | 47.1 | 56.3 | 47.7 |
| Q-AW 35B | 64.8 | 36.7 | 54.0 | 65.6 | 58.2 | 49.6 | 65.9 | 56.4 |
| Qwen3.5 397B | 68.3 | 30.8 | 55.3 | 64.4 | 54.9 | 48.6 | 60.9 | 54.7 |
| Q-AW 397B | 68.2 | 37.8 | 57.7 | 68.5 | 60.2 | 51.0 | 67.9 | 58.7 |
| GPT-5.4 | 70.1 | 37.3 | 53.7 | 66.3 | 60.0 | 51.8 | 68.6 | 58.2 |
At both 35B and 397B, Qwen-AgentWorld exceeds corresponding Qwen base models and outperforms Claude Sonnet 4.6; at 397B, it achieves the highest overall average (+0.46 over GPT-5.4) (Zuo et al., 23 Jun 2026).
5. Application Paradigms
Qwen-AgentWorld supports two principal paradigms for agentic intelligence:
5.1 Decoupled Environment Simulator
- Functions as a standalone simulator for agent RL, capable of zero-shot simulation of 4,000+ real or synthetic environments (e.g., OpenClaw).
- Demonstrates strong scalability (Claw-Eval +4.3, QwenClawBench +7.1).
- Controllability: supports natural-language instructions to induce edge conditions or rare scenarios, improving coverage and robustness.
- In Sim-RL vs. Real-RL comparisons, simulated training using Qwen-AgentWorld can surpass real-environment training in coverage and trajectory diversity.
5.2 Unified Agent Foundation Model
- Pre-trained under world-model RL, Qwen-AgentWorld serves as an agent foundation model, transferring to downstream agentic benchmarks without additional fine-tuning.
- Notable gains: Terminal-Bench 2.0 (+6.3 pts), SWE-Bench Verified (+3.4 pts), WideSearch (+12.8/6.9 pts on F1 Item/Row), with strong results on out-of-domain evaluation.
- Emergent “mental simulation”: The model predicts the probable effects of candidate actions in CoT before choosing an action, refining downstream planning and reliability.
6. Practical Implementation and Codebase
Reproducibility is enabled by open-source scripts and configuration files:
- GitHub: https://github.com/Qwen-AI/Qwen-AgentWorld
- Scripts:
run_cpt.py(CPT)fine_tune_sft.sh(SFT, up to 256k tokens)train_rl_gspo.sh(RL)bench_evaluate.py(AgentWorldBench)
- Configurations for all stages and prompt/judge templates are versioned for traceability and extensibility.
Quantization formats (“A3B”/“A17B”) enable large-scale deployment at feasible inference costs; prompt template maintenance and LLM-judge calibration are highlighted as critical to training stability (Zuo et al., 23 Jun 2026).
7. Limitations, Extensions, and Future Directions
Stated limitations include reliance on text-only accessibility trees (limiting GUI realism), persistent difficulties in factuality due to knowledge gaps, and high RL compute bills from prompt processing. Proposed extensions:
- Co-evolutionary self-play between agent and world model
- Multimodal modeling integrating pixels and structure (e.g., accessibility trees/photos)
- Adaptive sim-to-real routing per query for efficient environment switching
- Dynamic tool synthesis wherein the world model invents new APIs to reflect novel affordances
These directions aim to further scale general agent capabilities, simulation fidelity, and transfer to new domains (Zuo et al., 23 Jun 2026).