Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 173 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

WebDancer: Towards Autonomous Information Seeking Agency (2505.22648v2)

Published 28 May 2025 in cs.CL

Abstract: Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct, WebDancer. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models. The codes and demo will be released in https://github.com/Alibaba-NLP/WebAgent.

Summary

  • The paper introduces an end-to-end system for building autonomous web agents using a staged pipeline for data construction, trajectory sampling, and fine-tuning.
  • It employs advanced QA pair synthesis and multi-step reasoning via the ReAct framework to generate high-quality web browsing trajectories.
  • Reinforcement learning further refines agent decision-making, leading to significant performance gains over baselines and even GPT-4o.

This paper presents WebDancer, an agentic system designed for autonomous, multi-step information seeking on the web. The core contribution is a systematic, end-to-end pipeline for building such agents, focusing on data construction and a staged training approach.

The paper identifies key challenges in building effective web agents: acquiring high-quality browsing data, constructing reliable multi-step trajectories, and designing scalable training strategies for real-world generalization. To address these, WebDancer proposes a four-stage paradigm:

  1. Browsing Data Construction: This stage focuses on generating diverse and challenging deep information-seeking QA pairs. Two methods are introduced:
    • crawlQA: Synthesizing QA pairs by systematically crawling knowledgeable websites (e.g., arxiv, github, wiki) and using an LLM (GPT-4o) to generate questions based on the collected content, mimicking human browsing behavior.
    • e2hQA (easy-to-hard QA): Iteratively transforming simple fact-seeking questions into complex, multi-step ones by replacing entities with information retrieved from search results related to that entity. An LLM (GPT-4o) is used to reformulate the question iteratively, controlling complexity by the number of iterations. These methods aim to create datasets that require longer-horizon web exploration compared to existing shallow datasets.
  2. Trajectories Sampling: High-quality interaction trajectories are sampled from the synthesized QA pairs. The agent framework is based on ReAct, which interleaves Thought, Action, and Observation steps. Actions are limited to search, visit, and answer. The paper explores generating trajectories using two types of Chain-of-Thought (CoT):
    • Short CoT: Generated directly using a powerful LLM (GPT-4o) following the standard ReAct prompt format.
    • Long CoT: Generated by sequentially providing a Large Reasoning Model (LRM, QwQ-Plus) with historical actions and observations, allowing it to decide the next action. The LRM's internal reasoning process is recorded as the thought. A three-stage rejection filtering framework (validity, correctness, quality) is applied to ensure the sampled trajectories are high-quality, correct, and non-redundant.
  3. Supervised Fine-Tuning (SFT): The collected high-quality ReAct trajectories are used to fine-tune a LLM (policy model πθ\pi_\theta). This stage serves as a "cold start," teaching the model the fundamental ReAct behavioral paradigm of alternating reasoning and action, while preserving its original reasoning capabilities. The loss function is computed over the agent's autonomous decision steps (τ\tau, α\alpha), masking out tokens corresponding to external feedback (oo).

    L=1i=1HI[xio]i=1HI[xio]logπθ(xitc,x<i)L = -\frac{1}{\sum_{i=1}^{|\mathcal{H}|} \mathbb{I}[x_i \ne o]} \sum_{i=1}^{|\mathcal{H}|} \mathbb{I}[x_i \ne o] \cdot \log \pi_{\theta}(x_i \mid \mathbf{tc}, x_{<i})

  4. Reinforcement Learning (RL): Building on the SFT model, this stage further optimizes the agent's decision-making and generalization capabilities in real-world web environments using outcome-based rewards. The Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) algorithm is employed. DAPO samples candidate execution trajectories and updates the policy to maximize a reward function. A dynamic sampling mechanism is used to prioritize QA pairs that were not fully utilized during SFT, enhancing data efficiency and robustness. The reward function is a combination of format correctness and answer correctness, weighted towards answer correctness judged by an LLM-as-a-Judge model:

    R(y^i,y)=0.1scoreformat+0.9scoreanswerR(\hat{y}_i, y) = 0.1 * score_{\text{format}} + 0.9 * score_{\text{answer}}

    Agentic action rollouts within the ReAct framework generate the trajectories for RL optimization.

WebDancer is an instantiation of this framework. The paper evaluates WebDancer on challenging web information-seeking benchmarks: GAIA [mialon2023gaia], WebWalkerQA [wu2025webwalker], and BrowseComp [weibrowsecomp]. Experimental results show that WebDancer achieves strong performance, significantly improving over vanilla ReAct baselines across different model scales (Qwen-2.5-7B, Qwen-2.5-32B, QwQ-32B) and even surpassing GPT-4o in some cases.

Analysis highlights several practical insights:

  • High-quality synthetic data (crawlQA, e2hQA) is crucial for training effective agents, and robust filtering improves performance, especially in low-data regimes.
  • SFT provides essential instruction-following capabilities for agent tasks, acting as a necessary cold start before RL.
  • RL improves agent consistency and performance on complex tasks, although gains can be limited for LRMs possibly due to sparse rewards from long trajectories.
  • Transferring "thinking pattern" knowledge from strong reasoners (for Long CoT) to smaller instruction models is challenging and can introduce issues like increased invalid outputs. Training reasoning models on trajectories from other reasoning models is more effective.
  • RL enables longer reasoning processes and supports more complex agentic actions compared to SFT alone.
  • Performance can be highly sensitive to the dynamic nature of the web environment, suggesting inherent instability that requires more robust training and deployment strategies.

The paper concludes by summarizing the effectiveness of the proposed pipeline and discussing limitations and future work, including incorporating more complex tools, extending to document-level research tasks, improving data utilization, reducing rollout costs, developing hybrid thinking models, and addressing potential issues like tool hallucination and over-action.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 18 tweets and received 63 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube