DeepResearchBench: AI Research Benchmark
- DeepResearchBench is a dual-axis evaluation framework that rigorously benchmarks deep AI research systems on unsolved, open-ended scientific questions.
- It integrates expert rubrics to assess both insight quality and factual reliability, promoting methodological rigor and innovative problem-solving.
- The framework uses a diverse set of 65 realistic research tasks to drive multi-step planning, creative synthesis, and precise citation verification.
DeepResearchBench is a specialized benchmark and dual-axis evaluation framework designed to rigorously assess advanced “Deep AI Research Systems” (DARS) on genuinely open-ended, frontier scientific questions in artificial intelligence. Unlike conventional benchmarks, which emphasize retrieval and summarization of established knowledge, DeepResearchBench evaluates systems on their capability to analyze, synthesize, and generate novel insights in scenarios where answers are not predetermined and where scientific creativity is required (Xu et al., 22 Jul 2025).
1. Motivation and Scope
DeepResearchBench originates from two primary observations:
- Existing AI agent benchmarks (e.g., RAG-based QA, report-generation, web-browsing) primarily assess the ability to retrieve and summarize prior work, failing to measure understanding, deep analysis, or innovation on open research problems.
- As research assistant systems evolve toward agentic behavior—including multi-step planning, iterative tool use, and complex strategy—they require evaluation protocols that reward conceptual depth, methodical rigor, and creativity, especially on unsolved, high-impact research tasks.
The benchmark is thus specifically targeted at the frontier of scientific research in AI, where answers are emergent and systems must move beyond rote knowledge extraction to genuine scientific partnership.
2. Dataset Composition
The question set in DeepResearchBench consists of 65 research questions reflecting realistic and challenging AI research scenarios. Key construction principles:
- Sources: Laboratory discussions from AI research teams, semi-structured interviews with senior scientists, technical issues and threads from scientific forums.
- Coverage: 35 AI subfields, including but not limited to model architectures, multimodal fusion, AI ethics, training regimes, and rapidly emerging ML paradigms.
- Question Types:
- Technical Details (12): Deep explanations of underlying algorithms, theoretical concepts, or system designs.
- Literature Review (20): Synthetic, comparison-focused summaries of recently published research, including trend and gap identification.
- Open Consulting (33): Forward-looking, high-level strategy or design questions requiring creative, nontrivial judgment.
These tasks are curated to reflect authentic, unmet needs in real research environments, focusing on the breadth and depth of understanding and innovation.
3. Dual Evaluation Framework
DeepResearchBench adopts a two-pronged evaluation strategy to rigorously capture both the quality of insight and factual reliability in agent output.
3.1 Rubric Assessment (Insight Quality)
- Insight Extraction: For each task, domain experts identify “key insights” from context materials.
- Rubric Creation: Insights are translated into a set of binary rubric items , each assigned a positive real-valued weight to reflect importance.
- System Assessment: For each response to question , an expert-judged LLM marks .
- Weighted Coverage Score:
This captures how comprehensively the system’s answer aligns with the manually distilled key insights, rewarding coverage of significant elements and penalizing omissions of high-priority points.
3.2 Factual Assessment (Faithfulness & Groundedness)
- Claim Extraction: All factual claims and their associated URLs are extracted from the system response.
- Verification: For every cited claim, an LLM judge verifies whether the cited document supports the claim.
- Metrics:
- Faithfulness (precision of citation support):
- Groundedness (fraction of claims that are cited):
where is the number of cited claims, is the number of supported cited claims, and is the total number of factual claims.
4. Experimental Evaluation and Key Findings
A systematic evaluation was conducted on five commercial DARS and two advanced LLM+search baselines. Systems included OpenAI Deep Research, Gemini Deep Research, Grok 3 DeepSearch, Perplexity Deep Research, and others.
Aggregate results (all 65 questions):
- Rubric Coverage: OpenAI Deep Research (0.703) and Gemini Deep Research (0.693) lead, outperforming all other systems by 20–30%.
- Faithfulness: All DARS achieve high citation correctness (≈0.80–0.86).
- Groundedness: Scores are lower overall (≈0.31–0.59), indicating frequent occurrence of uncited factual claims; Gemini leads the balance between faithfulness and groundedness (0.86/0.59).
Task-type breakdown:
- Open Consulting questions are universally easier for DARS (top performers coverage), demonstrating strong adaptive and strategic reasoning.
- Technical Details and Literature Review questions are more challenging; Gemini Deep Research leads in technical depth, OpenAI Deep Research in synthesized literature reviews.
| System | Rubric Coverage | Faithfulness | Groundedness |
|---|---|---|---|
| OpenAI Deep Research | 0.703 | 0.80–0.86 | 0.31–0.59 |
| Gemini Deep Research | 0.693 | 0.86 | 0.59 |
| Others | 0.53–0.57 | 0.80–0.85 | 0.31–0.52 |
Other notable points:
- For all systems, performance on citation faithfulness is consistently higher than groundedness, highlighting a tendency for DARS to omit citations even when producing factual outputs.
- Open-ended questions allow top systems to demonstrate multi-step, creative, and strategic reasoning, which aligns closely with the aspirational goals of agentic research.
5. Access, Extension, and Research Workflow
The benchmark, along with expert rubrics, judge prompts, and evaluation code, is open-sourced at https://github.com/GAIR-NLP/ResearcherBench (Xu et al., 22 Jul 2025).
Researchers are enabled to:
- Run evaluations on new or customized DARS by using standardized scripts for data collection, insight rubric assessment, and factual verification.
- Add new tasks, extend rubrics, or experiment with alternative evaluation metrics and judge-model implementations.
- Apply the framework to emerging research domains beyond core AI by designing appropriate task and rubric structures grounded in authentic scientific workflows.
6. Implications, Limitations, and Future Directions
DeepResearchBench reorients the field away from purely retrieval-based or summarization-centric evaluation toward a direct measurement of conceptual understanding, methodological synthesis, and originality—factors vital for genuine scientific collaboration.
Implications:
- DARS with high performance on DeepResearchBench move closer to AI research systems capable of autonomous self-improvement and methodological innovation in alignment with long-term visions of AI-assisted science.
- The dual-metric framework provides direct feedback for improving both insight richness and citation discipline within research workflows.
Limitations and Extensions:
- The evaluation protocol centers on AI research scenarios and open-ended scientific questions; its generalizability to clinical, legal, or engineering domains requires domain-specific adaptations.
- The current dataset size (~65 tasks) offers a focused but not exhaustive challenge set; continual expansion and periodic refreshment are needed to maintain coverage of new scientific frontiers and methodologies.
- The evaluation of uncited factual statements remains an open area—future benchmarks may further incentivize citation completeness or develop mechanisms for verifying ungrounded reasoning chains.
By centering the evaluation on “understanding, insight, and innovation” and automating detailed, expert-grounded assessment workflows, DeepResearchBench establishes a rigorous foundation for next-generation AI research assistants and fosters new patterns of AI-human scientific collaboration (Xu et al., 22 Jul 2025).