WebSailor: Navigating Super-human Reasoning for Web Agent (2507.02592v1)

Published 3 Jul 2025 in cs.CL and cs.AI

Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

Summary

The paper devises a two-stage post-training pipeline using rejection sampling fine-tuning and DUPO RL that substantially improves open-source agent performance on complex web tasks.
The paper develops a scalable data synthesis method, SailorFog-QA, that generates high-uncertainty, multi-hop reasoning tasks via random walks and obfuscated subgraph sampling.
The paper demonstrates through empirical results that WebSailor-7B and WebSailor-72B outperform or match larger proprietary models on benchmarks such as BrowseComp-en and BrowseComp-zh.

WebSailor: A Post-Training Pipeline for Superhuman Web Agent Reasoning

The WebSailor framework addresses a central challenge in the development of open-source web agents: bridging the substantial performance gap with proprietary systems on complex information-seeking tasks. The work is motivated by the observation that leading closed-source agents (e.g., DeepResearch) demonstrate superhuman capabilities on benchmarks such as BrowseComp, primarily due to their ability to systematically reduce extreme uncertainty in vast, unstructured information spaces. In contrast, open-source agents have historically failed to generalize to such high-uncertainty, compositional tasks, largely due to limitations in training data and post-training methodology.

Task Taxonomy and Data Synthesis

A key insight of the paper is the explicit categorization of information-seeking tasks into three levels:

Level 1: Low uncertainty, easily reducible (e.g., single-hop QA).
Level 2: High initial uncertainty but with a clear, structured solution path (e.g., multi-hop QA).
Level 3: High uncertainty and high difficulty in uncertainty reduction, with no predefined reasoning path and complex, emergent entity relationships.

WebSailor targets Level 3 tasks, which are underrepresented in existing datasets but dominate challenging benchmarks. The authors introduce a scalable data synthesis pipeline, SailorFog-QA, which constructs complex knowledge graphs via random walks over real-world web data. Subgraphs are sampled to generate questions with intricate, coupled entities. Information obfuscation is applied to further increase ambiguity, forcing agents to engage in non-trivial, multi-step reasoning rather than simple retrieval or pattern matching.

This data generation process is notable for its scalability and its ability to produce tasks that are intractable for humans under typical time constraints, as confirmed by manual evaluation. The resulting dataset exhibits a long-tail distribution in tool call requirements, closely matching the complexity profile of benchmarks like BrowseComp.

Trajectory Reconstruction and Supervision

A significant methodological contribution is the two-stage trajectory supervision process. Rather than directly imitating the verbose and stylized reasoning traces of expert open-source LRMs (e.g., QwQ-32B, DeepSeek-R1), which can lead to context overflow and stylistic contamination, the authors extract only the action-observation traces from successful trajectories. They then reconstruct concise, action-oriented "short-CoT" thoughts for each step using a separate instruction-following model. This approach yields compact, goal-oriented supervision signals that are well-suited for long-horizon tasks and avoid the pitfalls of direct imitation.

Post-Training: RFT Cold Start and DUPO RL

WebSailor employs a two-stage post-training pipeline:

Rejection Sampling Fine-Tuning (RFT) Cold Start: A modest SFT phase on high-quality, filtered trajectories (correct answer, <32k tokens, >5 tool calls) is used to bootstrap the agent's tool-use and reasoning skeleton. The necessity of this phase is empirically validated: direct RL from scratch fails to acquire sophisticated reasoning strategies, especially on the most complex tasks.
Duplicating Sampling Policy Optimization (DUPO): The RL phase introduces a novel batch construction strategy. Instead of sequentially rolling out new cases to fill batches (as in DAPO), DUPO duplicates samples with non-zero standard deviation within the batch, achieving a 2–3x speedup. The RL objective is group-relative, with token-level policy gradients and reward clipping. Rewards combine format validation and answer correctness, with LLM-based answer judging. Observations are masked from the loss, focusing optimization on the agent's decision-making.

Empirical Results

WebSailor is evaluated on four challenging benchmarks: BrowseComp-en, BrowseComp-zh, GAIA, and Xbench-DeepSearch. The results demonstrate several strong claims:

Open-source agents trained with direct inference or standard SFT achieve near-zero accuracy on BrowseComp.
WebSailor-7B outperforms much larger open-source baselines (e.g., WebDancer-32B, WebThinker-RL) on BrowseComp-en, indicating that the gains are due to the training paradigm rather than scale.
WebSailor-72B achieves parity with proprietary agents such as Doubao on BrowseComp-zh, and narrows the gap with DeepResearch.
Post-training on high-uncertainty data exhibits downward compatibility: WebSailor performs strongly on simpler tasks (e.g., SimpleQA), indicating no loss of generality.
RL training yields substantial improvements in Pass@1, especially on the most complex tasks, and enhances sample efficiency.

The analysis of tool call distributions and pass rates further substantiates the claim that WebSailor's training data and methodology are well-matched to the target task complexity.

Limitations and Future Directions

The authors acknowledge several limitations:

The 32k token context limit for training trajectories may restrict the agent's ability to handle even more complex, longer-horizon tasks.
The RL process is limited to 50 steps due to synchronous framework inefficiencies, suggesting that asynchronous RL could further improve training efficiency and agent capabilities.
WebSailor sometimes "over-thinks" on simple tasks, but qualitative analysis suggests this often reflects cross-verification rather than aimless exploration.

Implications and Outlook

WebSailor demonstrates that with carefully constructed high-uncertainty data, concise trajectory supervision, and efficient RL post-training, open-source web agents can approach or match the performance of proprietary systems on the most challenging information-seeking tasks. This work provides a concrete blueprint for future agentic post-training pipelines, emphasizing the importance of task complexity, uncertainty reduction, and scalable RL optimization.

Theoretically, the results support the hypothesis that compositional generalization and robust uncertainty reduction are not emergent properties of scale alone, but can be explicitly elicited through targeted data and training regimes. Practically, the methodology is extensible to other domains requiring complex, tool-augmented reasoning, and suggests that further advances will depend on both richer task construction and more efficient RL frameworks.

Future research should explore asynchronous RL for agentic tasks, further scaling of context and trajectory length, and the extension of these methods to multimodal and real-time environments. The paradigm established by WebSailor is likely to inform the next generation of open-source agentic systems across a range of domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/pythontrending/status/1942183883274940840

https://twitter.com/Ali_TongyiLab/status/1941143057325592972

https://twitter.com/_akhaliq/status/1941128194893258920

https://twitter.com/l1tu_0u/status/1941130356369060017

https://twitter.com/HuggingPapers/status/1941106924247826942

https://twitter.com/ResearchBitesAI/status/1941150069702717590

YouTube

Show All Videos