WebSailor: Navigating Super-human Reasoning for Web Agent (2507.02592v1)

Published 3 Jul 2025 in cs.CL and cs.AI

Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

Summary

The paper introduces a novel post-training pipeline that overcomes high-uncertainty challenges in web-based reasoning tasks.
It leverages a scalable data synthesis method, SailorFog-QA, to generate complex, non-linear tasks that require creative, multi-step reasoning.
Efficient RL techniques and a robust RFT cold start substantially improve agent performance, matching proprietary systems on key benchmarks.

WebSailor: Advancing Superhuman Reasoning in Open-Source Web Agents

The WebSailor framework addresses a central challenge in the development of web-based LLM agents: enabling robust, superhuman reasoning in complex, high-uncertainty information-seeking tasks. The work systematically analyzes the limitations of existing open-source agents, identifies the absence of advanced uncertainty-reduction strategies as a key bottleneck, and introduces a comprehensive post-training pipeline that closes the performance gap with proprietary systems on benchmarks such as BrowseComp-en/zh.

Problem Formulation and Motivation

The paper frames information-seeking as a process of uncertainty reduction, emphasizing that human cognitive constraints—limited memory, attention, and inability to parallelize exploration—are transcended by agentic LLM systems. Proprietary agents (e.g., DeepResearch) have demonstrated superhuman performance on complex web benchmarks, attributed to their ability to systematically reduce extreme uncertainty in vast, unstructured information spaces. In contrast, open-source agents have failed to generalize beyond Level 1 and Level 2 tasks, which are characterized by low or easily reducible uncertainty and clear solution paths. The inability to handle Level 3 tasks—those with high, hard-to-reduce uncertainty and no predefined reasoning path—has resulted in near-zero accuracy for open-source models on challenging benchmarks.

Data Synthesis: SailorFog-QA

A core contribution is the scalable synthesis of Level 3 information-seeking tasks. The authors construct complex knowledge graphs via random walks over real-world web data, ensuring emergent, non-linear structures that resist simple, linear solution strategies. Subgraphs are sampled to generate question-answer pairs, and deliberate information obfuscation is applied to amplify initial ambiguity. This process yields tasks that require creative, multi-step reasoning and compositional generalization, closely mirroring the complexity of benchmarks like BrowseComp.

The resulting SailorFog-QA dataset exhibits a long-tail distribution in tool call requirements, with a significant fraction of tasks demanding more than five tool interactions—substantially exceeding the complexity of prior open-source datasets. Manual evaluation confirms that many generated tasks are intractable for human researchers within practical time constraints, underscoring the superhuman nature of the targeted reasoning.

Trajectory Reconstruction for Supervision

To provide effective supervision for post-training, the authors avoid direct imitation of verbose, stylized reasoning traces from expert open-source LRMs (e.g., QwQ-32B, DeepSeek-R1). Instead, they extract action-observation sequences from successful trajectories and reconstruct concise, action-oriented "short-CoT" thoughts for each step using a separate instruction-following model. This approach mitigates context window overload and stylistic contamination, producing compact, goal-oriented reasoning chains suitable for long-horizon agentic tasks.

Training Pipeline: RFT Cold Start and DUPO RL

The training pipeline consists of two stages:

Rejection Sampling Fine-Tuning (RFT) Cold Start: A modest set of high-quality, filtered trajectories is used to bootstrap the agent's tool-use and long-horizon reasoning skeleton. Only trajectories with correct final answers, manageable length (<32k tokens), and sufficient complexity (>5 tool calls) are retained. Observations are masked from the loss to focus learning on decision-making.
Duplicating Sampling Policy Optimization (DUPO) RL: The RL phase employs a novel batch construction strategy that duplicates samples with non-zero reward variance, improving rollout efficiency by 2–3x over prior methods. The reward function combines strict format validation with LLM-based answer correctness, and group-relative advantage estimation is used to stabilize policy gradients. This design addresses the extreme sparsity of rewards and the inefficiency of synchronous RL in agentic settings.

Empirical Results

WebSailor models (3B, 7B, 32B, 72B) are evaluated on BrowseComp-en/zh, GAIA, and Xbench-DeepSearch. The results demonstrate several key findings:

Direct inference is inadequate: All models, including GPT-4.1, achieve near-zero accuracy on BrowseComp-en/zh without external tool use.
WebSailor sets a new open-source SOTA: WebSailor-7B achieves 6.7% accuracy on BrowseComp-en, outperforming much larger open-source baselines (e.g., WebDancer-32B at 2.5%). WebSailor-72B reaches 30.1% on BrowseComp-zh, matching proprietary agents like Doubao.
RL is essential for stability and efficiency: The RL phase yields substantial improvements in Pass@1, especially on the most complex tasks, and narrows the gap between Pass@1 and Pass@3, indicating increased sample efficiency and solution path stability.
RFT cold start is indispensable: Direct RL without cold start fails to acquire long-horizon reasoning and tool-use patterns, particularly on the hardest benchmarks.
Downward compatibility: Despite being trained exclusively on high-difficulty data, WebSailor maintains strong performance on simpler tasks (e.g., SimpleQA), indicating robust generalization.

Analysis and Implications

The work provides quantitative evidence that the complexity of training data—measured by tool call distribution and ambiguity—directly correlates with agentic reasoning capability. The explicit focus on Level 3 tasks and the avoidance of overfitting to stylized expert traces are critical for eliciting flexible, superhuman strategies. The RL pipeline, particularly the DUPO algorithm, addresses the unique challenges of agentic RL, such as slow rollouts and sparse rewards.

The findings imply that further progress in open-source agentic LLMs will require:

Continued development of scalable, high-uncertainty data synthesis pipelines.
More efficient, possibly asynchronous, RL frameworks to support longer-horizon training.
Exploration of reward functions and supervision signals that incentivize not only correctness but also efficiency and interpretability in reasoning.

Limitations and Future Directions

The current approach is constrained by context window limits (32k tokens) and RL step limits (50), which may cap performance on even more complex tasks. There is also a tendency for "over-thinking" on simple tasks, though this often manifests as cross-verification rather than aimless exploration. Future work should address these bottlenecks and extend the methodology to broader domains beyond information seeking, with the goal of achieving general superhuman agentic performance.

Conclusion

WebSailor demonstrates that with principled data synthesis, targeted trajectory reconstruction, and efficient RL optimization, open-source LLM agents can approach and, in some cases, match the performance of proprietary systems on the most challenging web-based reasoning tasks. The work establishes a new baseline for agentic post-training and provides a blueprint for future research in scalable, superhuman open-source web agents.