DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents (2506.11763v1)

Published 13 Jun 2025 in cs.CL and cs.IR

Abstract: Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA's information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents.

Authors (5)

Mingxuan Du (15 papers)
Benfeng Xu (15 papers)
Chiwei Zhu (6 papers)
Xiaorui Wang (30 papers)
Zhendong Mao (55 papers)

Summary

Overview of DeepResearch Bench: A Benchmark for LLM-Based Deep Research Agents

The paper "DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents" introduces a novel benchmark designed to evaluate the capabilities of Deep Research Agents (DRA) that transform online information into analyst-grade reports autonomously. The proposed benchmark seeks to address the absence of a systematic framework for assessing the performance of DRAs across various domains, detailing methodologies aimed at aligning closely with human judgment.

Deep Research Agents represent an increasingly utilized category of LLM-based agents. These agents autonomously handle complex tasks such as multistep web exploration, targeted retrieval, and synthesis to prepare citation-rich reports in significantly less time than traditional manual research. However, evaluating their performance is complicated, primarily due to the challenge of assessing the quality of the generated reports and information retrieval capabilities without a transparent internal process.

DeepResearch Bench Characteristics and Construction

DeepResearch Bench is articulated as a 100-task benchmark spanning 22 domains, created in collaboration with domain experts. Each task is meticulously crafted to ensure high relevance and challenge, reflecting authentic research demands derived from statistical analyses of over 96,000 real-world queries. This meticulous statistic-driven development not only guarantees coverage across diverse sectors but also aligns the tasks with genuine user demands efficiently.

Two evaluation methodologies are presented—RACE and FACT—tailored to measure the report quality and information retrieval capabilities of DRAs:

RACE (Reference-based Adaptive Criteria-driven Evaluation with Dynamic Weighting): This framework assesses the quality of generated reports through dynamically weighted dimensions (Comprehensiveness, Insight, Instruction-Following, and Readability), focusing primarily on aligning with human judgments. By evaluating reports relative to a high-quality reference using criteria tailored specifically for the tasks, RACE circumvents the common pitfalls of static evaluation checklists and problematic isolated scoring.
FACT (Factual Abundance and Citation Trustworthiness): FACT assesses the accuracy and effectiveness of DRAs in web information retrieval and citation accuracy by using the Statement-URL pair extraction method combined with support judgment. This framework examines citation accuracy and the average effective citations to determine the practical reliability of cited information within reports.

Experimental Evaluation and Findings

The paper includes a wide array of evaluations on several Deep Research Agents, including Gemini 2.5 Pro Deep Research, OpenAI Deep Research, Perplexity Deep Research, and various non-specialized LLMs capable of conducting web searches. Among these, Gemini 2.5 Pro Deep Research demonstrated notable strengths, achieving the highest scores across several dimensions, particularly in Effective Citations, underscoring its capability in comprehensive information retrieval.

Furthermore, empirical validation through human consistency checks revealed RACE's robust alignment with human judgment. This was demonstrated through high agreement rates between automated scoring methodologies and domain experts, reinforcing the framework's reliability as an evaluative method for DRA-generated reports.

Implications and Future Directions

The contributions embodied by DeepResearch Bench propose significant implications both practically and theoretically. By providing a benchmark aligned closely with real-world research needs, this work potentially accelerates the development of DRAs, paving the way for enhanced automation capabilities in research settings. The methodologies presented, particularly RACE and FACT, are scalable beyond deep research applications, offering broad applicability in various LLM evaluation contexts.

Future developments could focus on expanding the benchmark size to include more diverse and robust task coverage while incorporating additional external reviews to mitigate domain bias. The frameworks might also evolve with increased computational capacity allowing for more extensive human consistency studies, thereby refining these methodologies further. Overall, DeepResearch Bench represents a crucial step toward developing powerful AI-driven research solutions that are not only effective but closely aligned with practical user expectations and needs.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - Ayanami0730/deep_research_bench (6 stars)

Tweets

https://twitter.com/chai_research/status/1935558603680919601

https://twitter.com/theomitsa/status/1939384578537296010

https://twitter.com/_philschmid/status/1935669865945133185

https://twitter.com/HuggingPapers/status/1934946363780862054

https://twitter.com/theomitsa/status/1939384608547905685

https://twitter.com/arxivexplained/status/1935718919378985384