Qwen-AgentWorld: Agent Simulation & Planning

Updated 24 June 2026

Qwen-AgentWorld is a family of large-scale language world models designed for simulating agent behavior, reasoning, and planning across seven distinct real-world domains.
It employs a three-stage training pipeline (CPT, SFT, and RL) that leverages explicit chain-of-thought reasoning to enhance simulation fidelity and robust decision-making.
The framework integrates transformer architectures with targeted quantization and scratchpad modifications to support both closed-loop agent training and general agent foundation modeling.

Qwen-AgentWorld is a family of large-scale language world models designed for agentic environment simulation, reasoning, and planning, establishing a unified framework for both scalable closed-loop agent training and general agent foundation modeling. These models are notable for their capability to simulate environment trajectories across seven real-world domains via explicit chain-of-thought (CoT) reasoning, trained on over 10 million real-world interaction trajectories. Through a staged CPT→SFT→RL pipeline and leveraging rubric-guided reinforcement, Qwen-AgentWorld demonstrates strong simulation fidelity, broad generalization, and practical utility across a battery of evaluator-curated agent benchmarks (Zuo et al., 23 Jun 2026).

1. Model Architectures

Qwen-AgentWorld comprises two main instantiations: Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B. Both are based on the Qwen3 decoder-only Transformer architecture, customized for world modeling and long CoT reasoning:

Qwen-AgentWorld-35B-A3B:
- Base: Qwen3.5‐35B
- Layers: 40
- Hidden size: 8,192
- Feed-forward size: 32,768
- Attention heads: 32
- Parameters: ~35B
- A3B: 3-bit activation quantization; includes a sparse CoT prefix head
Qwen-AgentWorld-397B-A17B:
- Base: Qwen3.6‐397B
- Layers: 80
- Hidden size: 15,360
- Feed-forward size: 61,440
- Attention heads: 64
- Parameters: ~397B
- A17B: 17-bit quantization, extended “scratchpad” CoT head after layer 20

Both configurations integrate lightweight “scratchpad” MLPs (≈20M parameters) at each 8th layer during SFT to promote explicit thought token emission and use token-type embeddings during RL to constrain reward attribution to assistant regions only. These modifications facilitate interpretable multi-step world simulation and reasoning (Zuo et al., 23 Jun 2026).

2. Three-Stage Training Pipeline

The “CPT injects ⟶ SFT activates ⟶ RL sharpens” pipeline undergirds Qwen-AgentWorld’s progressive acquisition of environment modeling capacity:

2.1 Continual Pre-Training (CPT)

Data:
- Trajectories from sandboxed code, containerized terminals, browsers, OS VMs, public code/script logs, professional domain corpora (law, finance, medicine)
Objective:

$\mathcal{L}_{\mathrm{CPT}} = -\sum_{(c, a_{\le t}, o_{<t},\,o_t)\in\mathcal{D}} \log p_\theta(o_t \mid c, a_{<t}, o_{<t})$

Turn-level masking stratifies action-observation pairs into semantic bins by token overlap, novelty, and ratio, retaining high-novelty transitions at lower rates to improve coverage.

2.2 Supervised Fine-Tuning (SFT)

Data:
- 7,094 “thinking” trajectories curated via 3-beam self-critique and rejection sampling (69.2% retention), covering all seven domains.
Objective:

$\mathcal{L}_{\mathrm{SFT}} = -\sum_{(c,\,o_{\le t},\,a_t,\,o_{t+1})} \log p_\theta(o_{t+1}\mid c,o_{\le t},a_t)$

Supervision enforces ground-truth CoT traces before each next-state prediction, “Why the next state will look like X, then X.”
System prompt templates are diversified (ten variants) to prevent prompt overfitting.

2.3 Reinforcement Learning (RL)

Algorithm: GSPO (Group Sequence Policy Optimization)
Data: 92,308 sampled trajectories (one turn each, context ≤128k tokens)
Reward design (hybrid):

$R(\tau) = \alpha R_\mathrm{rubric}(\tau) + \beta R_\mathrm{rule}(\tau)$

Rubric: LLM-judge evaluates each predicted observation on 5-point scales over Format, Factuality, Consistency, Realism, and Quality. Aggregate:

$R_\mathrm{rubric} = 5 \times \tfrac{1}{5} \sum \text{score}_i \in [5,25]$
Rule verifier: domain-specific schema/consistency checks (0/1, scaled to [0,25])
$\alpha:\beta = 9:1$
Loss: $\mathcal{L}_{\mathrm{RL}} = -\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)]$
- Key strategies:
One-turn sampling to mitigate shared-prefix collapse.
Five-dimensional rubric outperforms pairwise/Turing-test signals for RL reward.
Content-type filtering prevents reward hacking via irrelevant or self-referential outputs.

3. Dataset Construction and Domains

Qwen-AgentWorld is trained on over 10 million environment interaction turns across seven subdomains:

Domain	Environment Type
MCP	REST/JSON tool calls
Search	web_search, web_extractor
SWE	code edit, bash, diff
Terminal	bash, multiprogram pipelines
Android	UI touch/swipe (view hierarchies)
Web	click/type (accessibility trees)
OS	mouse/keyboard (window trees)

Data preparation includes schema unification, trajectory expansion, aggressive filtering (to remove retry and no-change turns), and construction of system prompts (task, action space, demonstration, simulation instruction). Disjoint splits are enforced for CPT, SFT, and RL data (Zuo et al., 23 Jun 2026).

4. Evaluation: AgentWorldBench

To provide a rigorous assessment, AgentWorldBench is introduced, aggregating real-world interaction traces from five leading agents across nine established benchmarks:

Benchmarks include Tool Decathlon (32 apps), MCPMark (127 MCP tasks), Terminal-Bench 1.0/2.0, OSWorld-Verified, SWE-Bench Verified/Pro, BFCL v4, and WideSearch.
Five “actor” agents supply trajectories: Claude Opus 4.8, Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro.
Each predicted observation is judged by an LLM on five dimensions for reference-anchored, rubric-based evaluation.
Main results:

Model	MCP	Search	Term.	SWE	Android	Web	OS	Avg.
Qwen3.5 35B	57.9	26.0	46.1	47.6	53.2	47.1	56.3	47.7
Q-AW 35B	64.8	36.7	54.0	65.6	58.2	49.6	65.9	56.4
Qwen3.5 397B	68.3	30.8	55.3	64.4	54.9	48.6	60.9	54.7
Q-AW 397B	68.2	37.8	57.7	68.5	60.2	51.0	67.9	58.7
GPT-5.4	70.1	37.3	53.7	66.3	60.0	51.8	68.6	58.2

At both 35B and 397B, Qwen-AgentWorld exceeds corresponding Qwen base models and outperforms Claude Sonnet 4.6; at 397B, it achieves the highest overall average (+0.46 over GPT-5.4) (Zuo et al., 23 Jun 2026).

5. Application Paradigms

Qwen-AgentWorld supports two principal paradigms for agentic intelligence:

5.1 Decoupled Environment Simulator

Functions as a standalone simulator for agent RL, capable of zero-shot simulation of 4,000+ real or synthetic environments (e.g., OpenClaw).
Demonstrates strong scalability (Claw-Eval +4.3, QwenClawBench +7.1).
Controllability: supports natural-language instructions to induce edge conditions or rare scenarios, improving coverage and robustness.
In Sim-RL vs. Real-RL comparisons, simulated training using Qwen-AgentWorld can surpass real-environment training in coverage and trajectory diversity.

5.2 Unified Agent Foundation Model

Pre-trained under world-model RL, Qwen-AgentWorld serves as an agent foundation model, transferring to downstream agentic benchmarks without additional fine-tuning.
Notable gains: Terminal-Bench 2.0 (+6.3 pts), SWE-Bench Verified (+3.4 pts), WideSearch (+12.8/6.9 pts on F1 Item/Row), with strong results on out-of-domain evaluation.
Emergent “mental simulation”: The model predicts the probable effects of candidate actions in CoT before choosing an action, refining downstream planning and reliability.

6. Practical Implementation and Codebase

Reproducibility is enabled by open-source scripts and configuration files:

GitHub: https://github.com/Qwen-AI/Qwen-AgentWorld
Scripts:
- run_cpt.py (CPT)
- fine_tune_sft.sh (SFT, up to 256k tokens)
- train_rl_gspo.sh (RL)
- bench_evaluate.py (AgentWorldBench)
Configurations for all stages and prompt/judge templates are versioned for traceability and extensibility.

Quantization formats (“A3B”/“A17B”) enable large-scale deployment at feasible inference costs; prompt template maintenance and LLM-judge calibration are highlighted as critical to training stability (Zuo et al., 23 Jun 2026).

7. Limitations, Extensions, and Future Directions

Stated limitations include reliance on text-only accessibility trees (limiting GUI realism), persistent difficulties in factuality due to knowledge gaps, and high RL compute bills from prompt processing. Proposed extensions:

Co-evolutionary self-play between agent and world model
Multimodal modeling integrating pixels and structure (e.g., accessibility trees/photos)
Adaptive sim-to-real routing per query for efficient environment switching
Dynamic tool synthesis wherein the world model invents new APIs to reflect novel affordances

These directions aim to further scale general agent capabilities, simulation fidelity, and transfer to new domains (Zuo et al., 23 Jun 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Qwen-AgentWorld: Language World Models for General Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen-AgentWorld.