- The paper introduces an end-to-end system for building autonomous web agents using a staged pipeline for data construction, trajectory sampling, and fine-tuning.
- It employs advanced QA pair synthesis and multi-step reasoning via the ReAct framework to generate high-quality web browsing trajectories.
- Reinforcement learning further refines agent decision-making, leading to significant performance gains over baselines and even GPT-4o.
This paper presents WebDancer, an agentic system designed for autonomous, multi-step information seeking on the web. The core contribution is a systematic, end-to-end pipeline for building such agents, focusing on data construction and a staged training approach.
The paper identifies key challenges in building effective web agents: acquiring high-quality browsing data, constructing reliable multi-step trajectories, and designing scalable training strategies for real-world generalization. To address these, WebDancer proposes a four-stage paradigm:
- Browsing Data Construction: This stage focuses on generating diverse and challenging deep information-seeking QA pairs. Two methods are introduced:
- crawlQA: Synthesizing QA pairs by systematically crawling knowledgeable websites (e.g., arxiv, github, wiki) and using an LLM (GPT-4o) to generate questions based on the collected content, mimicking human browsing behavior.
- e2hQA (easy-to-hard QA): Iteratively transforming simple fact-seeking questions into complex, multi-step ones by replacing entities with information retrieved from search results related to that entity. An LLM (GPT-4o) is used to reformulate the question iteratively, controlling complexity by the number of iterations.
These methods aim to create datasets that require longer-horizon web exploration compared to existing shallow datasets.
- Trajectories Sampling: High-quality interaction trajectories are sampled from the synthesized QA pairs. The agent framework is based on ReAct, which interleaves Thought, Action, and Observation steps. Actions are limited to
search, visit, and answer. The paper explores generating trajectories using two types of Chain-of-Thought (CoT):
- Short CoT: Generated directly using a powerful LLM (GPT-4o) following the standard ReAct prompt format.
- Long CoT: Generated by sequentially providing a Large Reasoning Model (LRM, QwQ-Plus) with historical actions and observations, allowing it to decide the next action. The LRM's internal reasoning process is recorded as the thought.
A three-stage rejection filtering framework (validity, correctness, quality) is applied to ensure the sampled trajectories are high-quality, correct, and non-redundant.
- Supervised Fine-Tuning (SFT): The collected high-quality ReAct trajectories are used to fine-tune a LLM (policy model πθ). This stage serves as a "cold start," teaching the model the fundamental ReAct behavioral paradigm of alternating reasoning and action, while preserving its original reasoning capabilities. The loss function is computed over the agent's autonomous decision steps (τ, α), masking out tokens corresponding to external feedback (o).
L=−∑i=1∣H∣I[xi=o]1i=1∑∣H∣I[xi=o]⋅logπθ(xi∣tc,x<i)
- Reinforcement Learning (RL): Building on the SFT model, this stage further optimizes the agent's decision-making and generalization capabilities in real-world web environments using outcome-based rewards. The Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) algorithm is employed. DAPO samples candidate execution trajectories and updates the policy to maximize a reward function. A dynamic sampling mechanism is used to prioritize QA pairs that were not fully utilized during SFT, enhancing data efficiency and robustness. The reward function is a combination of format correctness and answer correctness, weighted towards answer correctness judged by an LLM-as-a-Judge model:
R(y^i,y)=0.1∗scoreformat+0.9∗scoreanswer
Agentic action rollouts within the ReAct framework generate the trajectories for RL optimization.
WebDancer is an instantiation of this framework. The paper evaluates WebDancer on challenging web information-seeking benchmarks: GAIA [mialon2023gaia], WebWalkerQA [wu2025webwalker], and BrowseComp [weibrowsecomp]. Experimental results show that WebDancer achieves strong performance, significantly improving over vanilla ReAct baselines across different model scales (Qwen-2.5-7B, Qwen-2.5-32B, QwQ-32B) and even surpassing GPT-4o in some cases.
Analysis highlights several practical insights:
- High-quality synthetic data (crawlQA, e2hQA) is crucial for training effective agents, and robust filtering improves performance, especially in low-data regimes.
- SFT provides essential instruction-following capabilities for agent tasks, acting as a necessary cold start before RL.
- RL improves agent consistency and performance on complex tasks, although gains can be limited for LRMs possibly due to sparse rewards from long trajectories.
- Transferring "thinking pattern" knowledge from strong reasoners (for Long CoT) to smaller instruction models is challenging and can introduce issues like increased invalid outputs. Training reasoning models on trajectories from other reasoning models is more effective.
- RL enables longer reasoning processes and supports more complex agentic actions compared to SFT alone.
- Performance can be highly sensitive to the dynamic nature of the web environment, suggesting inherent instability that requires more robust training and deployment strategies.
The paper concludes by summarizing the effectiveness of the proposed pipeline and discussing limitations and future work, including incorporating more complex tools, extending to document-level research tasks, improving data utilization, reducing rollout costs, developing hybrid thinking models, and addressing potential issues like tool hallucination and over-action.