Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 112 tok/s Pro
Kimi K2 199 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning (2509.13305v1)

Published 16 Sep 2025 in cs.LG and cs.CL

Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

Summary

  • The paper demonstrates that integrating dense, uncertainty-rich synthetic data with a dual-environment RL pipeline bridges the gap between open-source and proprietary web agents.
  • The paper presents a novel synthetic dataset, SailorFog-QA-V2, constructed through random walks on dense knowledge graphs to enhance diverse reasoning capabilities.
  • The paper validates its approach with state-of-the-art performance across benchmarks, highlighting scalable RL and robust simulation-real world synergy.

WebSailor-V2: Advancing Open-Source Web Agents with Synthetic Data and Scalable RL

Introduction

WebSailor-V2 addresses the persistent performance gap between open-source and proprietary web agents in complex, multi-step information-seeking tasks. The system introduces a comprehensive post-training pipeline that integrates advanced synthetic data construction and a scalable reinforcement learning (RL) framework. The core contributions are twofold: (1) SailorFog-QA-V2, a dataset generated from a densely connected knowledge graph with diverse uncertainty types, and (2) a dual-environment RL setup that leverages both high-fidelity simulation and robust real-world environments, all within a closed-loop data-policy feedback system. The agent, built on Qwen3-30B-A3B, achieves state-of-the-art results on challenging benchmarks, outperforming both open-source and much larger proprietary models.

Data Construction: SailorFog-QA-V2

The SailorFog-QA-V2 dataset is constructed to overcome the limitations of prior data generation methods, which typically yield acyclic, tree-like knowledge structures and rely on narrow definitions of uncertainty (primarily obfuscation). The new pipeline constructs a dense knowledge graph by actively seeking cyclic and highly interconnected subgraphs, better reflecting real-world information webs. Subgraph extraction employs a random-walk-based approach, ensuring efficient sampling of non-isomorphic, structurally diverse subgraphs. QA generation is further diversified by introducing multiple uncertainty types beyond obfuscation, targeting a broader spectrum of reasoning skills.

This approach ensures that the agent is exposed to a wide variety of logical relationships and uncertainty scenarios, which is critical for generalization and robust reasoning in open-domain web tasks.

Agentic Framework and Tooling

WebSailor-V2 adopts the ReAct framework for agentic reasoning, prioritizing simplicity and scalability over complex, human-engineered planning. The agent's action space comprises four primary tools: search, visit, Google Scholar, and a Python interpreter, in addition to a terminal "final answer" action. This toolkit enables the agent to perform iterative, tool-augmented reasoning, dynamically gathering and synthesizing information from the web and academic sources, and executing code for computation when necessary.

Scalable Reinforcement Learning Pipeline

The RL pipeline is designed for both scalability and robustness. It consists of two tightly integrated environments:

  • Simulated Environment: Built on an offline Wikipedia corpus, this environment enables high-frequency, low-cost experimentation and data curation. The simulation closely mirrors real-world tool dynamics, state transitions, and reward mechanisms, facilitating rapid algorithmic iteration.
  • Real-World Environment: The agent interacts with live web APIs, including search engines and web page parsers. A unified tool execution interface abstracts away API volatility, employing concurrency management, fault tolerance, and result caching to ensure deterministic and stable tool invocation.

A fully automated data synthesis and filtering pipeline dynamically curates the training set based on training dynamics, closing the loop between data generation and policy learning. Figure 1

Figure 1: The RL framework integrates simulated and real environments with an automated data synthesis and filtering pipeline for closed-loop training.

RL Algorithm and Training Dynamics

The RL algorithm is a variant of GRPO, optimized for token-level policy gradients and employing a leave-one-out strategy for advantage estimation. Negative samples are conservatively filtered to maintain training stability, and large batch/group sizes are used to reduce variance. The training regimen is strictly on-policy, with trajectories sampled from the latest policy.

Empirical results indicate that data quality and environment stability are more critical than algorithmic details for successful agentic RL. Notably, training directly on human-annotated benchmarks (e.g., BrowseComp) yields inferior results compared to synthetic data, likely due to the latter's more consistent distribution and learnable structure. Figure 2

Figure 2: RL training dynamics show consistent reward improvement and validation performance across benchmarks.

Experimental Results

WebSailor-V2-30B-A3B achieves strong results across multiple benchmarks:

  • BrowseComp-EN: 35.3
  • BrowseComp-ZH: 44.1
  • xBench-DeepSearch: 73.7
  • GAIA: 74.1
  • HLE: 30.6

These results surpass all open-source agents and, in several cases, outperform proprietary models with significantly larger parameter counts (e.g., DeepSeek-V3.1-671B). The SFT cold-start stage is shown to be indispensable, providing a robust initial policy that enables effective RL exploration in sparse-reward, long-horizon tasks. Figure 3

Figure 3: WebSailor-V2 achieves state-of-the-art performance on BrowseComp-EN and xBench-DeepSearch, outperforming both open-source and proprietary agents.

Analysis: Training and Inference Behavior

Training and Entropy Dynamics

Training curves reveal that RL substantially improves both pass@1 and pass@3 on difficult benchmarks, indicating genuine expansion of problem-solving capacity. On simpler tasks, RL mainly increases sampling efficiency. Policy entropy remains high throughout training, reflecting sustained exploration due to the non-stationary, stochastic nature of web environments. No explicit entropy regularization is required.

Context Scaling

Increasing the context window from 32k to 128k tokens and allowing up to 100 ReAct iterations yields significant accuracy gains, with diminishing returns beyond 64k context. This demonstrates the importance of long-context modeling for complex, multi-step reasoning. Figure 4

Figure 4

Figure 4: (Left) Accuracy increases with context length and tool call budget; (Right) RL yields accuracy improvements across four benchmarks.

Case Study: Complex Multi-Step Reasoning

A detailed BrowseComp case paper demonstrates the agent's ability to decompose complex queries, adapt search strategies, identify key information pivots, and systematically verify hypotheses using multiple tools. The agent's workflow mirrors expert human research methodologies, including iterative hypothesis generation, targeted information retrieval, and cross-source verification. Figure 5

Figure 5: Stepwise reasoning and tool use in a complex BrowseComp case.

Figure 6

Figure 6: Continued reasoning and evidence synthesis in the same case.

Figure 7

Figure 7: Final answer synthesis with comprehensive citation and justification.

Implications and Future Directions

WebSailor-V2 demonstrates that with high-quality synthetic data and a robust RL pipeline, open-source models of moderate scale can match or exceed the performance of much larger proprietary systems in agentic web research tasks. The results underscore the primacy of data and environment engineering over algorithmic novelty in current agentic RL. The modular, scalable framework paves the way for further advances in:

  • Automated, closed-loop data generation and filtering
  • More sophisticated uncertainty modeling in QA generation
  • Integration of advanced context management and memory modules
  • Transfer to other domains requiring tool-augmented, long-horizon reasoning

Conclusion

WebSailor-V2 establishes a new standard for open-source web agents by combining dense, uncertainty-rich synthetic data with a scalable, stable RL pipeline. The system achieves state-of-the-art results on challenging benchmarks, closing the gap with proprietary agents and demonstrating that careful system engineering—particularly in data and environment design—is the key determinant of agentic performance. This work provides a robust foundation for future research in scalable, tool-augmented AI agents.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Overview

This paper introduces WebSailor‑V2, an AI “web agent” designed to do deep research on the internet. Think of it as a smart assistant that can search, read webpages, check academic papers, run code, and use reasoning to answer hard, multi‑step questions. The authors present a full recipe for building such an agent: how to make the right training data, how to teach it with supervised learning, and how to improve it further using reinforcement learning. Their goal is to help open‑source systems catch up to powerful, closed‑source research tools from big companies.

Key Questions the Paper Tries to Answer

  • How can we create training data that teaches an AI to handle messy, real‑world web research, not just simple fact lookups?
  • How can we train an AI agent at scale, safely and cheaply, when real web tools are slow, unpredictable, or expensive?
  • Can a well‑trained, mid‑sized open‑source model match or beat much larger or proprietary systems at deep research tasks?

Methods and Approach

The Agent’s Brain and Behavior

  • The agent uses a framework called ReAct, which follows a simple loop: think → act (use a tool) → observe (see results) → repeat. This avoids complicated plans and lets the model learn general reasoning strategies.
  • It’s built on a 30‑billion‑parameter model (Qwen3‑30B‑A3B), a mixture‑of‑experts (MoE) type of model that is efficient but strong.

The agent can use four tools:

  • Search: ask a search engine and get results.
  • Visit: open a webpage and summarize what’s relevant.
  • Google Scholar: search for academic papers.
  • Python: run code in a safe sandbox to calculate or analyze.

Better Training Data: SailorFog‑QA‑V2

To train smart research skills, the authors made a new dataset based on a dense “knowledge graph.” Imagine a huge web of connected facts about people, companies, events, dates, and more. Instead of growing this graph in a simple, tree‑like way, they deliberately added many cross‑links and loops, like a real web of information.

From this graph, they create question‑answer tasks that force the agent to:

  • Deal with uncertainty, not just “hidden names” (obfuscation) — for example, clues might be vague or indirect, making the agent infer and verify.
  • Use different reasoning patterns, because questions target different “roles” of nodes in the graph.
  • Follow realistic research steps, including tracking which searches and links led to new facts.

They also extract subgraphs using “random walks” (like wandering through connected facts) to sample a wide variety of structures efficiently.

Two‑Stage Training: SFT then RL

  • Supervised Fine‑Tuning (SFT) cold start: First, they teach the model with high‑quality, synthetic examples from SailorFog‑QA‑V2. This gives the agent a solid “starter strategy,” so it doesn’t flail during later training.
  • Reinforcement Learning (RL): Then they let the agent practice solving tasks and learn from rewards (successes and failures), updating its policy over time.

A key innovation is the dual RL environment:

  • Simulated environment: A fast, cheap “practice internet” built from offline Wikipedia. It’s stable, controllable, and perfect for quickly testing ideas.
  • Real environment: Carefully managed connections to real web tools. The authors add caching, retries, backups, and rate limits so training isn’t ruined by slow or broken APIs.

Finally, they use a closed feedback loop: as training progresses, the system automatically generates and filters new data to match the agent’s needs, improving quality and stability over time.

A Simple View of the RL Algorithm

The RL method (based on GRPO) rewards the agent for good trajectories and carefully handles bad ones to keep training stable. It:

  • Samples fresh attempts using the newest policy (on‑policy).
  • Uses token‑level learning signals.
  • Reduces noise by comparing each attempt to the group average.
  • Excludes some “empty” failures (like answers cut off by length limits) so they don’t destabilize learning.

The authors emphasize that the algorithm matters, but the biggest wins come from great data and a stable training setup.

Main Findings and Why They Matter

The agent was tested on several tough benchmarks:

  • BrowseComp‑EN (English) and BrowseComp‑ZH (Chinese): Hard internet research tasks with multiple clues and steps.
  • xBench‑DeepSearch and GAIA: Tool‑use and reasoning tests, including academic tasks.
  • Humanity’s Last Exam (HLE): Deep academic questions from experts worldwide.

Results:

  • WebSailor‑V2 scored 35.3 (EN) and 44.1 (ZH) on BrowseComp — the best among open‑source agents.
  • It scored 73.7 on xBench‑DeepSearch and 74.1 on GAIA, competitive with top proprietary systems.
  • On HLE, it scored 30.6 — beating even some much larger models, including a 671‑billion‑parameter system (DeepSeek‑V3.1) in that benchmark.

Key takeaways:

  • A strong data pipeline plus stable RL can make a mid‑sized open‑source model perform like (or better than) giant models on deep research tasks.
  • The SFT stage is crucial. Without a good starting policy, RL fails to learn well because successful trajectories (and rewards) are rare in complex tasks.
  • RL improved first‑try accuracy most on simpler tasks (better choosing the right path immediately), and raised both first‑try and multiple‑try accuracy on harder tasks (genuinely boosting problem‑solving ability).
  • Longer context helps: increasing the context window to 128k tokens allowed the agent to track long research sessions and improved accuracy, though most correct cases fit within 64k.

Implications and Potential Impact

This work shows a practical path to building strong, open‑source research agents:

  • High‑fidelity training data that mirrors real web complexity teaches robust reasoning.
  • A scalable, dual‑environment RL setup lets teams iterate fast and then reliably train in the real world.
  • Careful tool management makes training affordable and stable.
  • Smaller models, trained well, can challenge or beat much larger ones on difficult research tasks.

If widely adopted, these ideas could:

  • Narrow the gap between open and closed AI systems in deep research.
  • Make advanced research assistants more accessible to students, journalists, scientists, and small organizations.
  • Encourage future work to add smarter context management, better report writing, and broader toolkits — on top of a strong, simple ReAct core.

In short, WebSailor‑V2 is a blueprint: build rich data, train safely at scale, and let a straightforward agent loop learn powerful, general research skills.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Below is a focused list of concrete knowledge gaps, limitations, and open questions left unresolved by the paper that future work could act on:

  • Data: The paper claims “a wider array of uncertainty definitions” beyond obfuscation but does not enumerate or taxonomize them, report their proportions, or ablate their individual contributions to capability gains.
  • Data: No distributional statistics are provided for SailorFog-QA-V2 (graph size, entity/type coverage, degree distributions, cyclicity metrics), making it hard to assess structural diversity or replicate the sampling regime.
  • Data: Decontamination procedures are not specified; it is unclear whether SailorFog-QA-V2 (or SFT/RL data) overlaps with evaluation benchmarks (BrowseComp, HLE, xbench), or how potential leakage is prevented/measured.
  • Data: The symbiotic data-policy feedback loop is described conceptually, but concrete mechanisms (trigger criteria, sampling weights, filtering rules, and stopping conditions) are not specified or evaluated.
  • Data: The dataset’s temporal coverage and handling of time-sensitive facts (e.g., evolving web content) are unspecified; robustness to temporal drift is untested.
  • Data: Multilingual scope is limited to EN and ZH; there is no analysis of transfer to other languages or scripts, nor evaluation on region-specific search ecosystems (e.g., Baidu, Bing, Yandex).
  • Data: Modality scope is restricted—GAIA is evaluated on a text-only subset; no support or experiments for images, PDFs with tables/figures, or scanned documents are provided.
  • Simulator: The Wikipedia-only simulated environment lacks quantitative fidelity validation; no metrics quantify sim-to-real gap (e.g., action-observation distribution divergence, transfer ratio, or performance drop).
  • Simulator: Search in simulation (offline Wikipedia) does not mirror live web ranking, snippets, or noise; the effect of index differences on learned policies is not measured.
  • Simulator: Reward definitions/mechanics in simulation and real environments are not specified; it is unclear whether rewards are sparse, shaped, or aligned between environments.
  • Training environment: The managed real environment’s reliability layer (caching, retries, fallback tools) is described, but no throughput, latency, failure-rate, or cost metrics are reported to establish scalability and reproducibility.
  • RL algorithm: Key choices (token-level GRPO variant, leave-one-out baseline, negative-sample filtering) lack ablations; the contribution of each to stability and performance is not quantified.
  • RL algorithm: Criteria for excluding “negative samples” are heuristically described but not formalized; the impact on bias, credit assignment, and sample efficiency is unknown.
  • RL algorithm: No comparisons to alternative objectives (e.g., value-based critics, KL-control, entropy regularization schedules, trajectory-level rewards) or to off-policy methods are provided.
  • RL dynamics: High policy entropy is observed but not connected to inference-time behavior; how entropy/exploration translates to stability, determinism, and user-facing reliability at deployment is not evaluated.
  • SFT: The provenance and quality control of SFT trajectories (which open models generated them, acceptance rates, error types, and filtering thresholds) are not detailed; risk of codifying systematic errors is unassessed.
  • Generalization: The agent is evaluated mainly on web research benchmarks; transfer to other agentic domains (GUI interaction, APIs, forms/auth flows, multi-step workflows with state) is unexplored.
  • Tools/action space: The toolset is limited (search, visit, Google Scholar, Python); no support or evaluation for PDFs, citations export/verification, code execution with data files, spreadsheets, or structured API use.
  • Security/safety: No evaluation of prompt injection, malicious content, data exfiltration, or code-sandbox escape risks; no red-teaming or defense strategies are reported.
  • Robustness: Sensitivity to non-stationary search results, regional restrictions, CAPTCHAs/paywalls, crawler blocking, and content variability is not stress-tested.
  • Evaluation: Heavy reliance on LLM-as-a-judge lacks calibration; inter-judge reliability, adjudication protocols, and sensitivity to judge model choice are not reported.
  • Evaluation: Results are pass@1 with non-zero temperature; there is no systematic analysis across seeds/temperatures, no confidence intervals, and limited pass@k reporting—making significance and robustness unclear.
  • Evaluation: Fairness controls are missing—tool-call budgets, context limits, and inference parameters for baselines (especially proprietary systems) are not standardized or normalized by cost.
  • Compute/reporting: Training compute, token budgets, wall-clock time, and financial cost are not disclosed; reproducibility (seeds, exact checkpoints, tool versions) is limited without these details.
  • Failure analysis: There is no systematic breakdown of failure modes (by uncertainty type, graph topology, domain, or tool failure), hindering targeted data or algorithmic improvements.
  • Long-context behavior: While 128k context is used, strategies for context management (memory, summarization, retrieval, forgetting curves) and robustness to context overflow are not studied.
  • Temporal robustness: No longitudinal evaluation is presented (re-running months later) to quantify performance drift due to web changes and tool variability.
  • Ethical/legal: The paper does not discuss compliance with site terms, robots.txt, rate limits, or licensing when caching content (including Google Scholar); governance and auditability are unspecified.
  • Benchmark scope: HLE and other benchmarks may have partial or evolving content overlap with training corpora (e.g., Wikipedia); explicit contamination checks and replication artifacts are absent.
  • Presentation layer: The model underperforms slightly on report-writing vs retrieval; no concrete methods or ablations are provided to improve discourse quality, citation formatting, or factual grounding in generated reports.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 12 posts and received 584 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube