WebSailor-7B/32B: Agentic LLMs for Web Research

Updated 9 December 2025

WebSailor-7B/32B are large language model web agents that integrate minimal-modified Transformer architectures with specialized tool tokens for dynamic web information retrieval.
They employ synthetic high-uncertainty data generation and advanced RL techniques like DUPO to optimize multi-hop reasoning and tool-assisted query processing.
Benchmark evaluations demonstrate significant performance gains, bridging the gap between open-source models and proprietary agentic systems in complex web tasks.

WebSailor-7B and WebSailor-32B are LLM web agents engineered to address the challenge of open-domain information-seeking and reasoning across vast, uncertain web landscapes. Originating from the Qwen-2.5 family, these models employ a minimum-modified Transformer architecture augmented for agentic tool use, incorporating modern RL approaches, novel synthetic data pipelines, and advanced test-time ensemble decoding. WebSailor agents are evaluated on challenging benchmarks (BrowseComp, GAIA, XbenchDeepSearch) and demonstrate state-of-the-art open-source performance, closing the gap toward proprietary agentic systems such as DeepResearch (Li et al., 3 Jul 2025, Wang et al., 2 Dec 2025).

1. Architectural Foundation and Agent-Specific Adaptations

WebSailor-7B is derived from Qwen-2.5-7B (32 layers, hidden size 4096, FFN dim 16384, 32 heads), while WebSailor-32B uses Qwen-2.5-32B (64 layers, hidden size 8192, FFN dim 32768, 64 heads). Both variants retain standard architectural elements—rotary position embeddings, RMSNorm, SwiGLU activation—and integrate FlashAttention kernels for training/inference efficiency.

Agentic extensions are limited to the addition of special tokens for tool reasoning tags: > …, <tool_call>…</tool_call>, <tool_response>…</tool_response>, <answer>…</answer>. No changes are made to attention mechanisms, depth, or underlying planning components. The location and format of these tokens enable seamless interleaving of natural-language chains-of-thought and explicit API/tool calls within the autoregressive generation process (Li et al., 3 Jul 2025).

2. Synthetic High-Uncertainty Data Generation

WebSailor leverages the "SailorFog-QA" engine to synthesize high-uncertainty QA corpora—central to its superhuman reasoning performance. Starting from a seed entity (e.g., rare or obscure), a search-and-visit protocol constructs a knowledge graph with cycles and dense relational connectivity. Connected subgraphs of 3–6 entities are randomly sampled, and questions are generated by chaining relations within these subgraphs.

Intentional obfuscation injects information uncertainty, including vague dates (“early 2010s”), masked names (“initial ‘F’”), or qualitative values (“<1% market share”). The ground-truth answer matches the obfuscated constraints deterministically. This methodology sharply biases the training distribution toward queries that require nontrivial navigation and inference, moving beyond classical lookup and simple fact extraction (Li et al., 3 Jul 2025).

3. Training Pipeline: RFT Cold Start and RL with DUPO

Training proceeds in two phases. First, Rejection-Sampling Fine-Tuning (RFT) cold-starts the model on expert trajectories filtered by complexity and correctness. Only records with correct final answers, moderate length (<32k tokens), and ≥6 tool calls are retained. Cross-entropy loss is computed over reasoned tokens, masking tool responses:

$\mathcal{L}_{\mathrm{RFT}}(\theta) = -\sum_{t \in T} \log \pi_{\theta}(x_t \mid H_{<t})$

AdamW optimization is applied, with cosine-decaying schedule and weight decay (Li et al., 3 Jul 2025).

Second, RL optimization utilizes Duplicating Sampling Policy Optimization (“DUPO”, Editor's term)—an in-batch duplication algorithm enhancing gradient diversity and computational efficiency over vanilla PPO. For each QA, G parallel rollouts are generated. The reward combines format accuracy and answer correctness, and policy updates use clipped surrogate gradients:

$\mathcal{J}(\theta) = \mathbb{E}\left[\frac{1}{\sum_{i}|o_{i}|}\sum_{i=1}^{G} \sum_{t=1}^{|o_{i}|} \min\left( r_{i,t} \hat{A}_{i,t}, \mathrm{clip}(r_{i,t}, 1-\varepsilon, 1+\varepsilon) \hat{A}_{i,t} \right)\right]$

Where advantage estimation is based on normalized returns among the rollouts (Li et al., 3 Jul 2025).

4. Test-Time Ensemble Reasoning: ThinkMerge Decoding

For deep-research scenarios, ThinkMerge logit-averaging is deployed at test-time to merge K parallel reasoning traces per query (K=2,4,8). After each trace's “think” span, pre-softmax logits $z_t^{(k)}$ are averaged:

$\bar{z}_t = \frac{1}{K} \sum_{k=1}^K z_t^{(k)}$

The merged token distribution is computed as:

$\hat{p}_t(y_t \mid Q, \text{Think}_{1..K}, y_{<t}) = \mathrm{softmax}\left( \frac{\bar{z}_t}{T_{\text{ans}}} \right)[y_t]$

Tokens are sampled or selected greedily, then broadcast to synchronize all streams. This method ensures diverse access paths and continuous token-level voting, suppresses hallucinations, and aligns traces for coherent subsequent generation. Empirically, ThinkMerge yields double-digit absolute gains for WebSailor agents on BrowseComp, GAIA, and XbenchDeepSearch (Wang et al., 2 Dec 2025).

5. Empirical Evaluation and Benchmarking

WebSailor agents are rigorously evaluated using pass@1 accuracy (strict tool-call and answer validation) over key multi-hop web benchmarks:

Benchmark	WS-7B (base)	WS-7B (TM@4)	Δ (%)	WS-32B (base)	WS-32B (TM@4/8)	Δ (%)
GAIA	35.52	41.26	+5.74	46.64	51.46	+4.82
XbenchDeepSearch	37.80	48.00	+10.2	50.40	57.60 (K=8)	+7.20
BrowseComp-en	6.30	13.60	+7.30	11.80	14.50 (K=8)	+2.70
BrowseComp-zh	14.01	24.91	+10.9	21.97	28.37	+6.40

Baseline results (pre-ThinkMerge) for WS-32B surpass all prior open-source agents and approach proprietary system performance; ThinkMerge amplifies gains further (Wang et al., 2 Dec 2025). Ablation studies demonstrate DUPO’s wall-clock efficiency and the critical importance of obfuscated, high-uncertainty training examples (Li et al., 3 Jul 2025).

6. Analysis and Deployment Considerations

WebSailor advances uncertainty reduction through a combination of hard QA synthesis, RFT-initiated chains-of-thought, and robust RL stabilization. Limiting session tool calls (<30), caching queries, and monitoring excess reasoning are recommended for practical deployment. Removing obfuscation or RFT cold start excessively impairs final accuracy and tool-use depth, while DUPO yields converged policies in less than half the runtime of PPO (Li et al., 3 Jul 2025).

A plausible implication is that sub-7B agents may not benefit from ThinkMerge if too few high-quality traces are produced at K>2. Larger agents surpass critical reasoning thresholds, allowing logit averaging to amplify correct trajectories (Wang et al., 2 Dec 2025).

7. Limitations and Prospective Directions

While WebSailor-7B/32B approach proprietary benchmarks, model scale remains an open variable; contemporary results do not extend to MoE-style models below 30B parameters. Report style may lag behind retrieval accuracy—further SFT or post-processing for output polish is suggested. Tool volatility and nonstationary online environments create persistent evaluation challenges, motivating research on differentiable simulators and adversarial curriculum sampling. Extensions to graph-structured planning or multi-agent debate protocols remain an undeveloped frontier (Li et al., 16 Sep 2025, Li et al., 3 Jul 2025).

In summary, WebSailor-7B and -32B combine principled architectural minimalism, sophisticated uncertainty-centric data synthesis, efficient RL, and advanced parallel decoding to set new open-source standards for agentic web research performance, moving toward parity with closed-source systems.