- The paper introduces a novel post-training pipeline using SailorFog-QA and concise reasoning trajectory reconstruction to empower open-source web agents with superhuman performance.
- It leverages RFT Cold Start and DUPO techniques to achieve 2–3x training speedups and marked accuracy improvements on benchmarks like BrowseComp-en and BrowseComp-zh.
- Empirical results demonstrate that even smaller models outperform larger open-source baselines and rival proprietary systems on complex web navigation and reasoning tasks.
WebSailor: Advancing Superhuman Reasoning in Open-Source Web Agents
The paper "WebSailor: Navigating Super-human Reasoning for Web Agent" (2507.02592) presents a comprehensive post-training methodology for open-source web agents, targeting the persistent gap in complex information-seeking and reasoning tasks between open-source and proprietary systems. The work is motivated by the observation that proprietary agentic systems (e.g., DeepResearch) have demonstrated superhuman performance on benchmarks such as BrowseComp, primarily due to their ability to systematically reduce extreme uncertainty in vast, unstructured information spaces—a capability largely absent in open-source models.
The authors identify that existing open-source agents are limited by their training paradigms, which focus on tasks with low or easily reducible uncertainty (Level 1 and 2). In contrast, real-world web navigation and information-seeking often require robust compositional generalization and creative exploration in scenarios with high and hard-to-reduce uncertainty (Level 3). The inability to handle such tasks is reflected in near-zero accuracy of open-source agents on challenging benchmarks like BrowseComp-en/zh.
Data Synthesis: SailorFog-QA
To address this, the paper introduces a scalable data synthesis pipeline, SailorFog-QA, which generates Level 3 tasks by:
- Constructing complex, densely interconnected knowledge graphs via random walks over real-world web data, seeded with rare entities.
- Sampling subgraphs to create question-answer pairs that require reasoning over novel compositions of entities and relations.
- Applying deliberate information obfuscation (e.g., vague time references, masked names, qualitative descriptors) to amplify initial uncertainty and prevent shortcut solutions.
This approach ensures that the training data is both structurally and informationally challenging, closely mirroring the complexity of real-world web tasks.
Supervision Signal: Reasoning Trajectory Reconstruction
A key challenge in training on such data is obtaining high-quality supervision for long-horizon, multi-step reasoning. The authors propose a two-stage trajectory construction:
- Action-Observation Trace Extraction: Expert open-source LRMs (e.g., QwQ-32B, DeepSeek-R1) are used to generate successful action-observation traces, discarding their verbose and stylized thought processes.
- Short-CoT Reasoning Reconstruction: For each action, a concise, action-oriented "thought" is generated using a separate instruction-following model, ensuring the resulting trajectories are compact and goal-oriented. This mitigates context window overflow and avoids stylistic contamination.
Training Methodology: RFT Cold Start and DUPO
The training pipeline consists of:
- Rejection Sampling Fine-Tuning (RFT) Cold Start: A modest SFT phase on filtered, high-quality trajectories to bootstrap tool-use and long-horizon reasoning. Only trajectories with correct answers, manageable length (<32k tokens), and sufficient complexity (>5 tool calls) are retained.
- Duplicating Sampling Policy Optimization (DUPO): A reinforcement learning algorithm designed for sample efficiency and training speed. DUPO employs dynamic sampling and duplication within batches to focus on informative cases, achieving a 2–3x speedup over prior methods (e.g., DAPO). The reward function combines format and answer validation, with answer correctness judged by an LLM.
Empirical Results
The WebSailor family (3B, 7B, 32B, 72B) is evaluated on four challenging benchmarks: BrowseComp-en, BrowseComp-zh, GAIA, and Xbench-DeepSearch. The results demonstrate:
- Substantial gains over all open-source baselines: For example, WebSailor-32B achieves 10.5% accuracy on BrowseComp-en and 25.5% on BrowseComp-zh, compared to 2.5% and 14.1% for WebDancer-32B, and 2.8% and 7.3% for WebThinker-RL.
- Performance parity with proprietary systems: WebSailor-72B matches or surpasses Doubao and Grok-3 on BrowseComp-zh and Xbench-DeepSearch, closing the gap with DeepResearch.
- Downward compatibility: Despite being trained exclusively on high-difficulty data, WebSailor maintains strong performance on simpler tasks (e.g., SimpleQA), indicating robust generalization.
Notably, the improvements are not merely a function of model scale; WebSailor-7B outperforms much larger open-source baselines, underscoring the impact of the training methodology.
Analysis and Ablations
The paper provides detailed analyses:
- Task Complexity: The distribution of tool calls in SailorFog-QA closely matches that of BrowseComp, with a long tail of highly complex trajectories, unlike prior datasets skewed toward simplicity.
- RL vs. SFT: RL training yields significant improvements in Pass@1, especially on the most difficult tasks, by reinforcing stable, effective strategies.
- Necessity of Cold Start: Direct RL without RFT cold start fails to acquire long-horizon reasoning, as evidenced by lower tool call counts and inferior final accuracy.
Limitations and Future Directions
The authors acknowledge several limitations:
- The 32k token context limit may restrict the agent's ability to solve even more complex tasks.
- The RL process is limited to 50 steps due to synchronous training inefficiencies.
- WebSailor sometimes "over-thinks" simple questions, though this often reflects cross-verification rather than aimless exploration.
Future work will focus on asynchronous RL frameworks, further scaling of uncertainty-driven data synthesis, and extending agentic capabilities beyond information seeking.
Implications
This work demonstrates that with carefully designed data synthesis, supervision, and RL optimization, open-source web agents can approach or match the performance of proprietary systems on complex reasoning tasks. The methodology provides a blueprint for instilling superhuman reasoning patterns in LLM agents, with implications for a broad range of applications requiring adaptive, multi-step decision making in unstructured environments.
The approach also highlights the importance of training data complexity and the structure of supervision signals in eliciting advanced reasoning capabilities. As open-source models continue to scale, such post-training pipelines will be critical for closing the capability gap with closed systems and enabling transparent, reproducible research in agentic AI.