Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 85 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 10 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 455 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry (2507.16280v1)

Published 22 Jul 2025 in cs.AI

Abstract: The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems - which we refer to as Deep AI Research Systems (DARS) - on frontier AI scientific questions. We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios such as laboratory discussions and interviews, spanning 35 different AI subjects and categorized into three types: technical details, literature review, and open consulting. Our dual evaluation framework combines rubric assessment, which uses expert-designed criteria to evaluate insight quality, with factual assessment, which measures citation accuracy (faithfulness) and coverage (groundedness). We evaluated several leading commercial DARS and baseline systems. Results show that OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions. Such capabilities represent a meaningful step toward AI self-improvement, aligning with the vision of ASI for AI. We open-source ResearcherBench to provide a standardized platform for promoting the development of next-generation AI research assistants, hoping to foster a new perspective in AI research evaluation for a novel pattern of scientific collaboration: https://github.com/GAIR-NLP/ResearcherBench.

Summary

  • The paper introduces ResearcherBench, a benchmark that evaluates Deep AI Research Systems on frontier scientific questions using rubric and factual assessments.
  • It details a three-step framework: dataset collection from real research scenarios, rubric assessment for insight quality, and factual assessment for citation accuracy.
  • Findings highlight superior performance in open consulting queries by systems like OpenAI Deep Research and Gemini Deep Research, setting new standards for AI research evaluation.

ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry

Introduction

"ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry" explores the evaluation of Deep AI Research Systems (DARS) using a novel benchmark designed to assess their capability to address frontier scientific questions. While existing benchmarks focus on evaluating AI as agents for web retrieval and report generation, "ResearcherBench" specifically targets the assessment of DARS on frontier AI scientific questions.

The paper presents a dataset of 65 research questions harnessed from scenarios including laboratory discussions and interviews, spanning 35 AI subjects. The evaluation framework combines rubric assessment, focusing on insight quality, with factual assessment, measuring citation accuracy (faithfulness) and coverage (groundedness). The paper found that OpenAI Deep Research and Gemini Deep Research systems perform well in open-ended consulting questions, representing a significant step toward AI self-improvement. Figure 1

Figure 1: ResearcherBench Framework Overview. The framework consists of three main components from top to bottom: (1) Dataset collection from authentic research scenarios leading to expert-generated rubrics, (2) Rubric assessment to evaluate coverage against rubrics, and (3) Factual assessment to measure faithfulness and groundedness scores.

ResearcherBench Framework

The ResearcherBench framework involves three primary steps: dataset collection, rubric assessment, and factual assessment.

  • Dataset Collection: The dataset comprises 65 research questions across 35 AI subjects. These questions are sourced from real-world scientific scenarios like lab discussions and interviews. The dataset categorizes questions into three types: technical details, literature review, and open consulting.
  • Rubric Assessment: This process uses expert-designed criteria to evaluate DARS-generated insights. Each question is broken into specific components with weightings, focusing on understanding, rigor, and analytical depth.
  • Factual Assessment: This step evaluates the citation accuracy (faithfulness) and overall citation coverage (groundedness). Claims are extracted and verified against their cited sources, and the results are quantified in faithfulness and groundedness scores. Figure 2

    Figure 2: AI Benchmark Topic Distribution with Representative Examples. Left Side: Pie chart showing the distribution of AI subjects in the benchmark. Right Side: Concrete question examples from major subjects.

Experimental Evaluation

The paper evaluated several leading commercial DARS platforms, such as OpenAI Deep Research, Gemini Deep Research, and others, using the dual evaluation framework. The results showed that systems excel at open consulting questions but varied in performance across technical details and literature review tasks. Notably, OpenAI Deep Research achieved the best performance in rubric assessment, while Gemini Deep Research exhibited a balanced performance in citation strategy optimization.

Key findings include the limited correlation between groundedness scores and research quality, and the superior performance of DARS over LLMs with basic web search capabilities on frontier research tasks. Figure 3

Figure 3: Performance Analysis by Question Type (Rubric Assessment Coverage). Performance comparison across different question types for Deep AI Research Systems. Each system shows varying strengths across open consulting, technical details, and literature review categories.

Implications and Future Directions

The ResearcherBench initiative provides a standardized platform for advancing AI research assistant development. The findings suggest a shift towards AI as genuine research partners, capable of exploring novel insights and conducting complex scientific inquiries. The implications extend to potential collaborations where AI systems contribute meaningfully to scientific advancements and foster accelerated AI self-improvement aligned with Artificial Superintelligence (ASI) aspirations.

Future work could involve expanding the benchmark to additional scientific domains, enabling cross-domain evaluation of DARS capabilities. Continuous updates to the benchmark, incorporating emerging scientific questions, could further maintain its relevance as frontier research evolves.

Conclusion

ResearcherBench emerges as a pivotal tool in the evaluation of DARS, emphasizing not only the ability to retrieve information but to deeply engage with and provide insights on frontier scientific queries. By promoting the development of next-generation AI research assistants, ResearcherBench lays foundational work for progressing towards AI systems that are valued partners in the scientific discovery process.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 5 likes.