WebThinker: Empowering Large Reasoning Models with Deep Research Capability (2504.21776v1)

Published 30 Apr 2025 in cs.CL, cs.AI, and cs.IR

Abstract: Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose \textbf{WebThinker}, a deep research agent that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker integrates a \textbf{Deep Web Explorer} module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an \textbf{Autonomous Think-Search-and-Draft strategy}, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an \textbf{RL-based training strategy} via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.

PDF Abstract

WebThinker is a deep research agent designed to empower large reasoning models (LRMs) with the ability to autonomously search the web, navigate web pages, and draft research reports during their reasoning process. The paper "WebThinker: Empowering Large Reasoning Models with Deep Research Capability" (Li et al., 30 Apr 2025 ) addresses the limitations of traditional LRMs and existing retrieval-augmented generation (RAG) methods, which often rely on static internal knowledge or predefined retrieval workflows, hindering performance on complex, knowledge-intensive tasks and comprehensive report generation.

The core objective of WebThinker is to enable LRMs to seamlessly interleave reasoning, information gathering from the web, and output generation (either problem solutions or research reports) in a single, continuous process. This contrasts with standard RAG, which typically involves a separate retrieval step followed by generation, and iterative RAG, which follows a fixed cycle of search-then-reason. WebThinker puts the LRM in control, allowing it to decide when and how to interact with external tools based on its dynamic reasoning needs.

The framework operates in two main modes:

Problem-Solving Mode: This mode focuses on solving complex, knowledge-intensive questions. It equips the LRM with a Deep Web Explorer module. When the LRM encounters a knowledge gap during its reasoning, it can invoke the Deep Web Explorer via a search tool call. The Deep Web Explorer, itself driven by the LRM, can perform initial web searches and, critically, navigate deeper into promising web pages by clicking links or buttons. This allows for more thorough information gathering than simple snippet retrieval. The Explorer summarizes the relevant findings and returns them to the main LRM's reasoning process. The LRM then continues reasoning using this newly acquired information.
Report Generation Mode: This mode targets the creation of comprehensive research reports. Building upon the Problem-Solving Mode, it integrates an Autonomous Think-Search-and-Draft strategy. The main LRM orchestrates the entire process, deciding when to search for information (using the Deep Web Explorer) and when to compose or refine sections of the report. It utilizes a set of report writing tools ( $\mathcal{T}_{\text{write}}$ $T_{write}$ ) implemented by an assistant LLM, including:
- write_section: Drafts content for a specific chapter or section.
- check_article: Retrieves the current outline of the report to allow the main LRM to assess progress and structure.
- edit_article: Instructs the assistant LLM to modify existing report content based on specified instructions. This iterative process allows the LRM to collect information for a section, draft it, check the overall report for coherence and completeness, and edit sections as needed, all while continuing its core reasoning.

To enhance the LRM's ability to effectively utilize these tools, WebThinker employs RL-based training strategies. The approach involves iterative online Direct Preference Optimization (DPO) [DPO, (Wu et al., 1 May 2024 , Dong et al., 19 Jun 2024 )]. Trajectories of the LRM interacting with the tools are sampled on complex tasks. Preference pairs of better-performing and worse-performing trajectories are constructed based on criteria like overall correctness/quality, tool efficiency (fewer calls for same result), and thinking conciseness. This preference data is then used to fine-tune the LRM using DPO in an iterative loop, where the updated model samples new trajectories for subsequent training rounds, enabling the model to progressively learn more effective tool usage patterns.

Implementation Details:

WebThinker is implemented using open-source LRMs like QwQ-32B [qwen_qwq] and DeepSeek-R1 series models [deepseek-r1]. The Deep Web Explorer utilizes a search engine (Bing Web Search API) and a web crawler (Crawl4AI [Crawl4AI]) for fetching web page content and simulating clicks. An assistant LLM (e.g., Qwen2.5-Instruct [qwen2.5]) handles the detailed execution of report writing and editing tools and summarizes clicked web page content. The models are trained using iterative online DPO with preference data generated from diverse datasets requiring reasoning and tool use.

Evaluation:

The framework was evaluated on:

Complex Reasoning Benchmarks: GPQA (Rein et al., 2023 ) (PhD-level Science QA), GAIA gaia, WebWalkerQA (Wu et al., 13 Jan 2025 ) (Web traversal QA), and Humanity's Last Exam (HLE) HLE. Performance is measured by Pass@1 accuracy, often evaluated by a strong LLM judge (Qwen2.5-72B-Instruct [qwen2.5]).
Scientific Report Generation Tasks: Glaive glaive_dataset. Report quality is evaluated by strong LLM judges (DeepSeek-R1-671B [deepseek-r1] and GPT-4o [gpt_4o_system_card]) based on criteria like Completeness, Thoroughness, Factuality, and Coherence.

Results:

Experimental results demonstrate that WebThinker significantly outperforms baseline methods, including direct reasoning, various RAG workflows, and existing autonomous search agents like Search-o1 (Li et al., 9 Jan 2025 ), across both complex problem-solving and report generation tasks.

On complex reasoning benchmarks, the RL-trained WebThinker-32B-RL achieved state-of-the-art results among comparable 32B models and surpassed stronger models like o3-mini (High) on the challenging HLE benchmark.
On scientific report generation, WebThinker achieved the highest average quality scores, outperforming RAG baselines and proprietary systems like Gemini2.0 Deep Research [gemini_deep_research].
Ablation studies confirmed the critical roles of the Deep Web Explorer (especially link clicking), the Autonomous Think-Search-and-Draft strategy, and the RL training in achieving these results.
Experiments with DeepSeek-R1 backbones of different sizes (7B, 14B, 32B) showed that WebThinker's framework is effective across different LRMs, consistently improving performance over direct generation and standard RAG.

Contributions:

The paper highlights the following contributions:

Introduction of WebThinker, a deep research agent for autonomous web exploration and report drafting within the LRM's thinking process.
Proposal of the Deep Web Explorer module for in-depth web information gathering and navigation.
Introduction of the Autonomous Think-Search-and-Draft strategy for real-time report writing integrated with reasoning and searching.
Development of RL-based training strategies using iterative online DPO to enhance tool utilization.
Demonstration of WebThinker's effectiveness on complex reasoning and scientific report generation tasks using open-source LRM backbones.

Practical Implications:

WebThinker's ability to conduct in-depth web research and generate comprehensive reports makes it highly applicable to tasks requiring significant external knowledge and structured output. This includes:

Research Assistance: Automating the process of gathering, synthesizing, and drafting reports on complex topics in science, finance, engineering, etc.
Complex Problem Solving: Providing AI assistants capable of tackling challenging questions that require accessing and processing diverse real-world information sources.
Knowledge Management: Building systems that can dynamically update their understanding by exploring the latest information online.

The framework's open-source nature (code available on GitHub) encourages further development and adoption in practical applications.

Future Work:

The authors plan to extend WebThinker by incorporating multimodal reasoning, developing more advanced tool learning mechanisms, and exploring GUI-based web exploration for more intuitive interaction with complex web interfaces.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Xiaoxi Li (24 papers)
Jiajie Jin (14 papers)
Guanting Dong (46 papers)
Hongjin Qian (23 papers)
Yutao Zhu (63 papers)
Yongkang Wu (12 papers)
Ji-Rong Wen (299 papers)
Zhicheng Dou (113 papers)

Related Papers

Find Related Papers

GitHub

GitHub - RUC-NLPIR/WebThinker: 🌐 WebThinker: Empowering Large Reasoning Models with Deep Research Capability (163 stars)

Tweets

https://twitter.com/kakakbibibi/status/1917768235069628823

https://twitter.com/fly51fly/status/1918781350225596894

https://twitter.com/_reachsumit/status/1917776078757884105

https://twitter.com/TheTuringPost/status/1919774340402602354

https://twitter.com/dair_ai/status/1921606674286096513

https://twitter.com/dongxi_nlp/status/1921686299771052463

YouTube

Show All Videos