Papers
Topics
Authors
Recent
Search
2000 character limit reached

WebSeer: Self-Reflective Web Search Agents

Updated 22 October 2025
  • WebSeer is a family of intelligent web search agents that combine reinforcement learning with self-reflective reasoning to extend multi-turn tool interactions and enhance search precision.
  • The system employs a two-stage training protocol, starting with supervised learning on reflection-rich multi-hop datasets followed by reinforcement learning with Group Relative Policy Optimization to boost decision-making.
  • Benchmark evaluations on datasets like HotpotQA and SimpleQA demonstrate WebSeer’s improved search completeness, robustness, and capability to mitigate error propagation through structured tool-use chains.

WebSeer refers to a family of intelligent web search agent systems, culminating in the 2025 publication "WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection" (He et al., 21 Oct 2025). WebSeer agents are designed to perform dynamic interactive retrieval within web environments, leveraging reinforcement learning (RL) and a structured self-reflection mechanism to substantially extend the depth and accuracy of their tool-use chains during information search. The system is characterized by a two-stage training protocol: an initial cold-start supervised phase on multi-hop, reflection-rich datasets, followed by a reinforcement learning stage employing Group Relative Policy Optimization techniques and explicit tool-use self-reflection. The WebSeer framework supports the use of multiple external tools in an iterative fashion and introduces reflective reasoning as a means to mitigate error propagation and improve multi-turn decision-making during search tasks.

1. Reinforcement Learning with Self-Reflective Reasoning

WebSeer adopts an RL paradigm augmented with self-reflection to address the limitations of shallow tool-use and error accumulation found in prior agentic retrieval models. The central training process involves:

  • A two-stage regime, starting with supervised fine-tuning (SFT) on trajectories annotated with explicit reflection patterns. This phase utilizes a masked autoregressive negative log-likelihood objective, optimizing only the agent's internal decision outputs.
  • The subsequent RL stage incorporates Self-Reflective Reinforcement Learning (SRRL). Here, the agent may submit multiple answer proposals per turn; each "submit-answer" tool call is scored via an F₁-based feedback mechanism relative to the ground truth answer, encouraging correction and deeper modeling of tool interactions.
  • SRRL employs Group Relative Policy Optimization (GRPO) with an asymmetric clip-higher mechanism. The RL objective is:

JDAPO(θ)=E[itmin(ri,t(θ)A^i,t,clip(ri,t(θ),1ϵlow,1+ϵhigh)A^i,t)ioi]\mathcal{J}_{\text{DAPO}}(\theta) = \mathbb{E} \left[ \frac{ \sum_{i} \sum_{t} \min \left( r_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}( r_{i,t}(\theta), 1- \epsilon_{\text{low}}, 1+ \epsilon_{\text{high}} ) \hat{A}_{i,t} \right)}{\sum_{i} |o_i| } \right ]

where ri,tr_{i,t} is the importance sampling ratio, and A^i,t\hat{A}_{i,t} is the estimated advantage.

This framework explicitly supports and rewards deeper self-corrective tool-use trajectories, building robustness into the agent's search strategies.

2. Self-Reflection Mechanism and Multi-Turn Tool-Use

The core innovation of WebSeer is its structured self-reflection paradigm, designed to promote iterative reasoning and error correction during web-based search operations. Each agent:

  • Iteratively extends its search trajectory by proposing intermediate answers, then submits them to an external verifier module for binary feedback and reflective commentary.
  • Uses multi-turn rejection sampling: if the agent’s reflection (predicate Ψ(Rt,y^,y)\Psi(R_t, \hat{y}, y^*)) does not align with the ground truth, additional verification samples are acquired (up to budget KK), ensuring eventual convergence to a reflection pattern with Ψ=1\Psi = 1.
  • Concatenates each (proposal, reflection) pair into its history, enabling the model to revise, backtrack, or reformulate queries and tool invocations until the correct answer is reached.

This mechanism allows WebSeer to generate substantially longer and more robust tool-use chains and is empirically shown to improve both search depth and final answer accuracy.

3. Training Data and Two-Stage Learning Protocol

WebSeer’s training pipeline begins with the construction of a large-scale dataset of multi-hop question answering instances, each annotated with detailed reflection patterns acquired via a rejection-sampling protocol. The protocol ensures that only valid trajectories—those for which the sequence of reflections and verifier decisions match the ground truth—are retained for supervised fine-tuning.

  • The first stage is supervised, based on sequences Ti={y1(i),...,yT(i)}\mathcal{T}_i = \{ y_1^{(i)}, ..., y_T^{(i)} \}, with objective:

L(x,T;θ)=tI[ytO]logpθ(ytx,y<t)tI[ytO]\mathcal{L}(x, \mathcal{T}; \theta) = - \frac{ \sum_{t} \mathbb{I}[y_t \notin \mathcal{O}] \log p_{\theta}(y_t \mid x, y_{<t}) }{ \sum_{t} \mathbb{I}[y_t \notin \mathcal{O}] }

where O\mathcal{O} denotes tokens representing raw external tool outputs.

  • The second stage unifies SFT with RL, incorporating explicit reward signals from the verifier feedback to guide trajectory length and reflection depth.

This dual-stage protocol is critical for stabilizing learning and fostering the agent’s self-reflective capabilities in complex web environments.

4. Tool-Use Chain Extension and Strategic Action Depth

A defining feature of WebSeer is its ability to extend tool-use chains beyond the shallow decision horizons of prior agentic models. Quantitative analyses show progression from an average of 3 tool calls per query pre-SFT to 5–8 geared invocations per task post-RL, indicating the emergence of deliberate multi-step reasoning.

  • Tool modules include web search API, webpage reader, code executor, and a dedicated submit-answer interface.
  • The agent is incentivized, via trajectory-level RL reward discounting, to explore longer reflective chains, backtrack when necessary, and strategically select external tools based on intermediate feedback.

This extension leads directly to improved completeness and reliability in answering multi-hop tasks and is substantiated by state-of-the-art benchmark results.

5. Benchmark Evaluation and Generalization Capacity

WebSeer’s performance is empirically validated on prominent multi-hop QA benchmarks, achieving:

  • HotpotQA: 72.3%
  • SimpleQA: 90.0%
  • Demonstrated high accuracy and stability on diverse out-of-distribution evaluation sets (including Bamboogle, PopQA, etc.)

The reflective training protocol and RL-guided exploration enhance generalization, with the agent adapting well to novel web search distributions and content sources not seen during training. The system is shown to outperform baselines and prior agentic retrieval systems by significant margins on both in-domain and OOD tasks.

6. System Implementation and Reproducibility

Complete source code, model checkpoints, and deployment guidelines for WebSeer are provided at https://github.com/99hgz/WebSeer. The repository supports:

  • Training from scratch using the provided two-stage framework
  • Integration with web-based environments and external tool APIs
  • Reproduction of experimental results and interaction with the agentic modules

This open-source provision supports further research, experimentation, and adaptation to real-world search environments.


WebSeer establishes a new paradigm in web search agent design by integrating self-reflective reinforcement learning with dynamic multi-tool reasoning, yielding significant advances in both tool-use chain complexity and answer accuracy. The methodology and results underscore the importance of structured reflection, multi-stage training, and staged RL policy design for building robust search agents capable of sophisticated web-based information retrieval (He et al., 21 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WebSeer.