- The paper introduces a dual-agent framework that dynamically refines research outlines to guide targeted evidence retrieval.
- It demonstrates hierarchical synthesis by retrieving and citing evidence section-by-section, significantly boosting citation accuracy and coherence.
- Empirical evaluations show that iterative outline optimization outperforms static approaches in comprehensiveness, insight, and overall report quality.
WebWeaver: Dynamic Outline-Driven Deep Research Agents for Web-Scale Evidence Synthesis
Introduction and Motivation
The paper addresses the challenge of open-ended deep research (OEDR), where AI agents must autonomously synthesize information from web-scale corpora—often exceeding 100 documents—into comprehensive, insightful, and source-grounded reports. Existing approaches are limited by static research pipelines that decouple planning from evidence acquisition and by one-shot generation paradigms that are susceptible to long-context failures, such as "loss in the middle" and hallucinations. WebWeaver introduces a dual-agent framework that emulates the iterative, adaptive research process of human experts, integrating dynamic outline optimization with targeted evidence retrieval and hierarchical synthesis.
Paradigm Analysis and System Architecture
WebWeaver is positioned against two prevailing paradigms: (a) search-then-generate, which aggregates evidence before generating a report, and (b) static outline-guided search, which fixes the outline prior to evidence collection. Both approaches are fundamentally limited—either by lack of structure or by rigidity and context overload.
Figure 1: Comparison of research paradigms, highlighting WebWeaver's dynamic, co-evolving outline and search strategy with hierarchical, evidence-focused writing.
WebWeaver's architecture consists of two specialized agents:
Methodology: Dynamic Research Cycle and Memory-Grounded Synthesis
Dynamic Research Cycle
The planner operates in a ReAct-style agentic loop, alternating between three actions: search, outline optimization, and terminate. Evidence acquisition is performed via web search, followed by LLM-based URL selection, summarization, and evidence extraction. Outline optimization is not a one-off step but a continuous process, with each iteration expanding, restructuring, or refining the outline based on newly acquired evidence. Each outline section is annotated with explicit citations to the memory bank, enabling traceable provenance.
Hierarchical, Memory-Grounded Synthesis
The writer agent eschews brute-force, all-at-once generation. Instead, it retrieves only the evidence relevant to the current section, performs internal reasoning, and writes the section before pruning the context and proceeding to the next. This divide-and-conquer approach is critical for managing long-context limitations and for maintaining both local and global coherence.
Empirical Evaluation
Benchmark Results
WebWeaver is evaluated on DeepResearch Bench, DeepConsult, and DeepResearchGym, outperforming both proprietary and open-source baselines across all major metrics, including comprehensiveness, insight, instruction-following, readability, effective citations, and citation accuracy.
Figure 3: WebWeaver achieves state-of-the-art performance across DeepResearch Bench, DeepConsult, and DeepResearchGym.
Notably, WebWeaver achieves a citation accuracy of 93.37% and the highest overall report quality, demonstrating the efficacy of dynamic outline optimization and hierarchical synthesis. The system also achieves near-perfect scores in depth and breadth on DeepResearchGym, indicating exhaustive topic coverage and robust structural organization.
Outline Optimization and Hierarchical Writing
Ablation studies reveal that iterative outline optimization yields monotonic improvements in both end-to-end report quality and LLM-judged outline quality. Each additional round of optimization increases comprehensiveness, insight, and evidentiary support, empirically invalidating static-outline approaches.
Figure 4: Distribution of outline optimization rounds, showing the prevalence and necessity of multiple refinement cycles.
Figure 5: End-to-end report scores increase with additional outline optimization rounds.
Figure 6: LLM-judged outline quality improves with each optimization round, especially in depth, breadth, and support.
Hierarchical writing is shown to be strictly superior to brute-force writing (e.g., LongWriter-style), with significant gains in insight, readability, and support. The targeted retrieval-and-pruning mechanism is essential for preventing context-bleeding and for ensuring claims are tightly linked to evidence.

Figure 7: Hierarchical writing outperforms brute-force writing across all report quality metrics.
Agentic Finetuning and Data Generation
To enable smaller models to acquire the complex skills of planning, searching, and writing, the authors construct WebWeaver-3k, a high-quality SFT dataset generated by the framework. Fine-tuning on this dataset yields substantial improvements in citation accuracy (from 25% to 85.9%) and overall report quality, demonstrating that the agentic workflow is learnable and transferable.

Figure 8: Outline optimization statistics in the WebWeaver-SFT training set, reflecting the complexity and depth of the agentic trajectories.
Implications and Future Directions
WebWeaver demonstrates that OEDR can be reframed as a structured, agentic process, where dynamic planning and targeted synthesis are orchestrated via explicit tool use and memory management. This approach not only achieves superior empirical results but also provides a blueprint for future research in agentic reasoning, long-context management, and automated knowledge work.
The framework's modularity and compatibility with various LLMs suggest broad applicability. The success of agentic finetuning on smaller models indicates a path toward democratizing high-quality research agents without reliance on proprietary, large-scale systems. The explicit citation mechanism and memory-centric architecture also provide a foundation for future work in verifiable, auditable AI systems.
Conclusion
WebWeaver establishes a new state-of-the-art in open-ended deep research by integrating dynamic outline optimization with hierarchical, memory-grounded synthesis. The dual-agent framework overcomes the limitations of static pipelines and brute-force generation, achieving high factuality, coherence, and insight. The results validate the necessity of adaptive planning and focused synthesis for complex, information-intensive tasks, and the methodology provides a scalable template for future agentic systems in AI research and beyond.