Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 85 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 10 tok/s Pro

GPT-4o 108 tok/s Pro

Kimi K2 192 tok/s Pro

GPT OSS 120B 455 tok/s Pro

Claude Sonnet 4 31 tok/s Pro

2000 character limit reached

WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research (2509.13312v1)

Published 16 Sep 2025 in cs.CL

Abstract: This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and one-shot generation paradigms that easily suffer from long-context failure issues like "loss in the middle" and hallucinations. To address these challenges, we introduce WebWeaver, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, source-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank for each part, it effectively mitigates long-context issues. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured reports.

Summary

The paper introduces a dual-agent framework that dynamically refines research outlines to guide targeted evidence retrieval.
It demonstrates hierarchical synthesis by retrieving and citing evidence section-by-section, significantly boosting citation accuracy and coherence.
Empirical evaluations show that iterative outline optimization outperforms static approaches in comprehensiveness, insight, and overall report quality.

WebWeaver: Dynamic Outline-Driven Deep Research Agents for Web-Scale Evidence Synthesis

Introduction and Motivation

The paper addresses the challenge of open-ended deep research (OEDR), where AI agents must autonomously synthesize information from web-scale corpora—often exceeding 100 documents—into comprehensive, insightful, and source-grounded reports. Existing approaches are limited by static research pipelines that decouple planning from evidence acquisition and by one-shot generation paradigms that are susceptible to long-context failures, such as "loss in the middle" and hallucinations. WebWeaver introduces a dual-agent framework that emulates the iterative, adaptive research process of human experts, integrating dynamic outline optimization with targeted evidence retrieval and hierarchical synthesis.

Paradigm Analysis and System Architecture

WebWeaver is positioned against two prevailing paradigms: (a) search-then-generate, which aggregates evidence before generating a report, and (b) static outline-guided search, which fixes the outline prior to evidence collection. Both approaches are fundamentally limited—either by lack of structure or by rigidity and context overload.

Figure 1: Comparison of research paradigms, highlighting WebWeaver's dynamic, co-evolving outline and search strategy with hierarchical, evidence-focused writing.

WebWeaver's architecture consists of two specialized agents:

Planner: Iteratively interleaves evidence acquisition (web search, document parsing, evidence extraction) with outline optimization. The outline is dynamically refined as new evidence is discovered, and each section is explicitly linked to evidence in a structured memory bank.
Writer: Performs hierarchical, section-by-section synthesis, retrieving only the relevant evidence for each section as indicated by the outline's citations. This mitigates attentional failures and context-bleeding, ensuring high factuality and coherence.
Figure 2: WebWeaver workflow: the planner iteratively collects evidence and optimizes the outline; the writer composes the report hierarchically, grounded in cited evidence.

Methodology: Dynamic Research Cycle and Memory-Grounded Synthesis

Dynamic Research Cycle

The planner operates in a ReAct-style agentic loop, alternating between three actions: search, outline optimization, and terminate. Evidence acquisition is performed via web search, followed by LLM-based URL selection, summarization, and evidence extraction. Outline optimization is not a one-off step but a continuous process, with each iteration expanding, restructuring, or refining the outline based on newly acquired evidence. Each outline section is annotated with explicit citations to the memory bank, enabling traceable provenance.

Hierarchical, Memory-Grounded Synthesis

The writer agent eschews brute-force, all-at-once generation. Instead, it retrieves only the evidence relevant to the current section, performs internal reasoning, and writes the section before pruning the context and proceeding to the next. This divide-and-conquer approach is critical for managing long-context limitations and for maintaining both local and global coherence.

Empirical Evaluation

Benchmark Results

WebWeaver is evaluated on DeepResearch Bench, DeepConsult, and DeepResearchGym, outperforming both proprietary and open-source baselines across all major metrics, including comprehensiveness, insight, instruction-following, readability, effective citations, and citation accuracy.

Figure 3: WebWeaver achieves state-of-the-art performance across DeepResearch Bench, DeepConsult, and DeepResearchGym.

Notably, WebWeaver achieves a citation accuracy of 93.37% and the highest overall report quality, demonstrating the efficacy of dynamic outline optimization and hierarchical synthesis. The system also achieves near-perfect scores in depth and breadth on DeepResearchGym, indicating exhaustive topic coverage and robust structural organization.

Outline Optimization and Hierarchical Writing

Ablation studies reveal that iterative outline optimization yields monotonic improvements in both end-to-end report quality and LLM-judged outline quality. Each additional round of optimization increases comprehensiveness, insight, and evidentiary support, empirically invalidating static-outline approaches.

Figure 4: Distribution of outline optimization rounds, showing the prevalence and necessity of multiple refinement cycles.

Figure 5: End-to-end report scores increase with additional outline optimization rounds.

Figure 6: LLM-judged outline quality improves with each optimization round, especially in depth, breadth, and support.

Hierarchical writing is shown to be strictly superior to brute-force writing (e.g., LongWriter-style), with significant gains in insight, readability, and support. The targeted retrieval-and-pruning mechanism is essential for preventing context-bleeding and for ensuring claims are tightly linked to evidence.

Figure 7: Hierarchical writing outperforms brute-force writing across all report quality metrics.

Agentic Finetuning and Data Generation

To enable smaller models to acquire the complex skills of planning, searching, and writing, the authors construct WebWeaver-3k, a high-quality SFT dataset generated by the framework. Fine-tuning on this dataset yields substantial improvements in citation accuracy (from 25% to 85.9%) and overall report quality, demonstrating that the agentic workflow is learnable and transferable.

Figure 8: Outline optimization statistics in the WebWeaver-SFT training set, reflecting the complexity and depth of the agentic trajectories.

Implications and Future Directions

WebWeaver demonstrates that OEDR can be reframed as a structured, agentic process, where dynamic planning and targeted synthesis are orchestrated via explicit tool use and memory management. This approach not only achieves superior empirical results but also provides a blueprint for future research in agentic reasoning, long-context management, and automated knowledge work.

The framework's modularity and compatibility with various LLMs suggest broad applicability. The success of agentic finetuning on smaller models indicates a path toward democratizing high-quality research agents without reliance on proprietary, large-scale systems. The explicit citation mechanism and memory-centric architecture also provide a foundation for future work in verifiable, auditable AI systems.

Conclusion

WebWeaver establishes a new state-of-the-art in open-ended deep research by integrating dynamic outline optimization with hierarchical, memory-grounded synthesis. The dual-agent framework overcomes the limitations of static pipelines and brute-force generation, achieving high factuality, coherence, and insight. The results validate the necessity of adaptive planning and focused synthesis for complex, information-intensive tasks, and the methodology provides a scalable template for future agentic systems in AI research and beyond.