Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WebSailor: Navigating Super-human Reasoning for Web Agent (2507.02592v1)

Published 3 Jul 2025 in cs.CL and cs.AI

Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

Summary

  • The paper introduces a novel post-training pipeline using SailorFog-QA and concise reasoning trajectory reconstruction to empower open-source web agents with superhuman performance.
  • It leverages RFT Cold Start and DUPO techniques to achieve 2–3x training speedups and marked accuracy improvements on benchmarks like BrowseComp-en and BrowseComp-zh.
  • Empirical results demonstrate that even smaller models outperform larger open-source baselines and rival proprietary systems on complex web navigation and reasoning tasks.

WebSailor: Advancing Superhuman Reasoning in Open-Source Web Agents

The paper "WebSailor: Navigating Super-human Reasoning for Web Agent" (2507.02592) presents a comprehensive post-training methodology for open-source web agents, targeting the persistent gap in complex information-seeking and reasoning tasks between open-source and proprietary systems. The work is motivated by the observation that proprietary agentic systems (e.g., DeepResearch) have demonstrated superhuman performance on benchmarks such as BrowseComp, primarily due to their ability to systematically reduce extreme uncertainty in vast, unstructured information spaces—a capability largely absent in open-source models.

Problem Formulation and Motivation

The authors identify that existing open-source agents are limited by their training paradigms, which focus on tasks with low or easily reducible uncertainty (Level 1 and 2). In contrast, real-world web navigation and information-seeking often require robust compositional generalization and creative exploration in scenarios with high and hard-to-reduce uncertainty (Level 3). The inability to handle such tasks is reflected in near-zero accuracy of open-source agents on challenging benchmarks like BrowseComp-en/zh.

Data Synthesis: SailorFog-QA

To address this, the paper introduces a scalable data synthesis pipeline, SailorFog-QA, which generates Level 3 tasks by:

  • Constructing complex, densely interconnected knowledge graphs via random walks over real-world web data, seeded with rare entities.
  • Sampling subgraphs to create question-answer pairs that require reasoning over novel compositions of entities and relations.
  • Applying deliberate information obfuscation (e.g., vague time references, masked names, qualitative descriptors) to amplify initial uncertainty and prevent shortcut solutions.

This approach ensures that the training data is both structurally and informationally challenging, closely mirroring the complexity of real-world web tasks.

Supervision Signal: Reasoning Trajectory Reconstruction

A key challenge in training on such data is obtaining high-quality supervision for long-horizon, multi-step reasoning. The authors propose a two-stage trajectory construction:

  1. Action-Observation Trace Extraction: Expert open-source LRMs (e.g., QwQ-32B, DeepSeek-R1) are used to generate successful action-observation traces, discarding their verbose and stylized thought processes.
  2. Short-CoT Reasoning Reconstruction: For each action, a concise, action-oriented "thought" is generated using a separate instruction-following model, ensuring the resulting trajectories are compact and goal-oriented. This mitigates context window overflow and avoids stylistic contamination.

Training Methodology: RFT Cold Start and DUPO

The training pipeline consists of:

  • Rejection Sampling Fine-Tuning (RFT) Cold Start: A modest SFT phase on filtered, high-quality trajectories to bootstrap tool-use and long-horizon reasoning. Only trajectories with correct answers, manageable length (<32k tokens), and sufficient complexity (>5 tool calls) are retained.
  • Duplicating Sampling Policy Optimization (DUPO): A reinforcement learning algorithm designed for sample efficiency and training speed. DUPO employs dynamic sampling and duplication within batches to focus on informative cases, achieving a 2–3x speedup over prior methods (e.g., DAPO). The reward function combines format and answer validation, with answer correctness judged by an LLM.

Empirical Results

The WebSailor family (3B, 7B, 32B, 72B) is evaluated on four challenging benchmarks: BrowseComp-en, BrowseComp-zh, GAIA, and Xbench-DeepSearch. The results demonstrate:

  • Substantial gains over all open-source baselines: For example, WebSailor-32B achieves 10.5% accuracy on BrowseComp-en and 25.5% on BrowseComp-zh, compared to 2.5% and 14.1% for WebDancer-32B, and 2.8% and 7.3% for WebThinker-RL.
  • Performance parity with proprietary systems: WebSailor-72B matches or surpasses Doubao and Grok-3 on BrowseComp-zh and Xbench-DeepSearch, closing the gap with DeepResearch.
  • Downward compatibility: Despite being trained exclusively on high-difficulty data, WebSailor maintains strong performance on simpler tasks (e.g., SimpleQA), indicating robust generalization.

Notably, the improvements are not merely a function of model scale; WebSailor-7B outperforms much larger open-source baselines, underscoring the impact of the training methodology.

Analysis and Ablations

The paper provides detailed analyses:

  • Task Complexity: The distribution of tool calls in SailorFog-QA closely matches that of BrowseComp, with a long tail of highly complex trajectories, unlike prior datasets skewed toward simplicity.
  • RL vs. SFT: RL training yields significant improvements in Pass@1, especially on the most difficult tasks, by reinforcing stable, effective strategies.
  • Necessity of Cold Start: Direct RL without RFT cold start fails to acquire long-horizon reasoning, as evidenced by lower tool call counts and inferior final accuracy.

Limitations and Future Directions

The authors acknowledge several limitations:

  • The 32k token context limit may restrict the agent's ability to solve even more complex tasks.
  • The RL process is limited to 50 steps due to synchronous training inefficiencies.
  • WebSailor sometimes "over-thinks" simple questions, though this often reflects cross-verification rather than aimless exploration.

Future work will focus on asynchronous RL frameworks, further scaling of uncertainty-driven data synthesis, and extending agentic capabilities beyond information seeking.

Implications

This work demonstrates that with carefully designed data synthesis, supervision, and RL optimization, open-source web agents can approach or match the performance of proprietary systems on complex reasoning tasks. The methodology provides a blueprint for instilling superhuman reasoning patterns in LLM agents, with implications for a broad range of applications requiring adaptive, multi-step decision making in unstructured environments.

The approach also highlights the importance of training data complexity and the structure of supervision signals in eliciting advanced reasoning capabilities. As open-source models continue to scale, such post-training pipelines will be critical for closing the capability gap with closed systems and enabling transparent, reproducible research in agentic AI.

Youtube Logo Streamline Icon: https://streamlinehq.com