ResearchQA Benchmark
- ResearchQA is a large-scale resource that converts survey articles into 21,000 expert-level queries and 160,000 rubric items for evaluating scholarly QA.
- It systematically measures citation quality, explanation depth, and limitation coverage to diagnose and compare diverse QA system competencies.
- Ph.D.-level human assessments combined with automated benchmarks reveal significant gaps in current systems' ability to emulate nuanced, research-level discourse.
ResearchQA is a large-scale resource for evaluating long-form question answering (QA) systems using queries and evaluation rubrics distilled from survey articles spanning 75 research fields. It addresses the need for comprehensive, multi-field QA benchmarking by systematically transforming survey literature into 21,000 research queries and 160,000 rubric items that specify granular, query-dependent evaluation criteria—including citation, explanation, and limitation coverage. The dataset, coupled with extensive Ph.D.-level human assessment and robust automated judges, enables both the diagnosis of system competency gaps and end-to-end system comparisons for scholarly QA across diverse domains (Yifei et al., 30 Aug 2025).
1. Construction of the ResearchQA Resource
The creation of ResearchQA begins by identifying survey articles from 75 distinct research areas. These survey articles provide dense coverage of field-specific trends, methods, and open challenges, making them suitable for formulating expert-level QA benchmarks. The process involves two key transformations:
- Query Extraction: Each survey article is segmented into distinct topical sections (e.g., “Limitations of Approach A,” “Future Directions,” “Comparisons with Alternative Methods”). These segments are individually converted into research queries designed to probe a model’s ability to generate detailed, scholarly responses.
- Rubric Design: Alongside each query, a set of rubric items is distilled. These items dictate what constitutes an ideal answer—such as whether the response should cite key papers, explain core concepts, discuss limitations, or compare multiple approaches. These rubrics reflect the depth and evaluative standards present in survey discourse.
If formalized, the process may be seen as computing an answer coverage score:
where encode the relative weights for each criterion as determined by the source survey content (no formula appears explicitly in the source data; this is a direct reflection of the survey-driven rubric structure).
2. Multi-Dimensional Evaluation Criteria
The ResearchQA resource emphasizes multi-criteria evaluation directly mapped from survey structure:
- Citations: Answers are evaluated on their integration and accurate referencing of significant papers highlighted in the source survey.
- Explanations: Depth and quality of explanation regarding methodologies, core concepts, or mechanisms discussed in survey sections.
- Limitations: Coverage of shortcomings or unresolved issues as articulated in survey material.
- Comparisons and Analyses: When survey sections contrast methodologies or results, rubric items demand accurate comparative analysis in system outputs.
Each rubric item is explicitly anchored to specific expectations grounded in scholarly communication, ensuring that system responses are evaluated against real research standards.
3. Ph.D. Annotator Judgments and Information Needs
A large-scale human evaluation was conducted using 31 Ph.D. annotators spanning 8 fields. They assessed how effectively system outputs addressed research-level information needs. Notable findings include:
- Query Support: 96% of the distilled queries were found to support genuine Ph.D.-level information needs.
- Rubric Relevance: 87% of rubric items were assessed as requiring at least a sentence of substantive answer in ideal responses.
- System Response Quality: While some QA systems approached criteria coverage, annotators consistently found that nuanced scholarly aspects—especially expert citation, depth of analysis, and critical assessment of limitations—were often addressed only superficially.
The assessment highlights the persistent gap between automated system outputs and the rigorous response standards typical of graduate-level research.
4. Comparative Analysis of QA System Competencies
The evaluation encompasses 18 different parametric, retrieval-augmented, and agentic QA systems. Analysis revealed:
- Citation competency was consistently weak—with the highest-ranking system fully satisfying less than 11% of citation-related rubric items.
- Limitation and comparison coverage also lagged; full credit was given for fewer than half of limitation (48%) and comparison (49%) rubric items even in the best agentic system.
- Maximum rubric coverage reached 75% for the top system, with no parametric or retrieval-augmented system exceeding 70% coverage of the rubric.
System responses frequently delivered plausible generalities but failed to replicate the thorough, evidence-based argumentation found in survey literature. Coverage of nuanced scholarly criteria—especially when multiple types (e.g., explanation, comparison, citation) are required within a single response—remained incomplete.
5. Error Analysis and Common Deficiencies
Analysis of rubric-specific error rates indicated:
- Persistent citation errors: Many responses omitted key references or misrepresented source papers.
- Superficial limitation treatment: System answers often glossed over or failed to acknowledge critical limitations central to the query.
- Difficulty with comparative analysis: There was a high frequency of mistakes or omissions when rubric items required explicit contrast between methods or results discussed in surveys.
These patterns show that while contemporary QA systems may generate relevant surface content, capturing the depth, granularity, and evidence-rich commentary expected in scholarly answers remains an unsolved challenge.
6. Data Release and Future Research Trajectories
The dataset and code are made available to the research community to enable transparent, multi-field benchmarking and rapid iteration in scholarly QA. Key future directions highlighted include:
- Integrating domain-specific knowledge bases for improved citation and technical detail coverage.
- Developing architectures or evaluation metrics specifically attuned to multi-criteria, rubric-driven assessment.
- Advancing hybrid models that combine deep learning with symbolic reasoning to address the breadth and depth of scholarly discourse needed for near-expert QA performance.
By enabling scalable, rubric-based evaluation aligned with genuine research needs, ResearchQA establishes a foundation for advancing LLMs that can emulate scholarly rigor and analytic depth across diverse scientific and technical domains.