ScholarQABench: LLM Literature Synthesis

Updated 17 September 2025

ScholarQABench is a multi-domain benchmark that rigorously evaluates LLMs on literature synthesis and citation accuracy using expert-crafted queries.
It employs a retriever-generator architecture and self-feedback loops to refine long-form answers and enhance evidence integration.
Its open-source framework facilitates reproducible research and model comparisons, advancing retrieval-augmented synthesis in scholarly domains.

ScholarQABench serves as a comprehensive evaluation suite specifically created to assess LLMs’ (LLMs) ability to synthesize scientific literature through robust retrieval, citation, and knowledge integration. Distinguished by its expert-written, multi-domain queries and a rigorous citation-centric evaluation framework, ScholarQABench is central to the empirical study of LLM-mediated literature review, comparison, and synthesis in complex research domains.

1. Benchmark Overview and Structure

ScholarQABench is designed as a large-scale, multi-domain benchmark for literature search and synthesis, encompassing 2,967 expert-written queries and 208 long-form answers. Its construction draws on multiple scientific disciplines—including computer science, physics, neuroscience, and biomedicine—ensuring that benchmark tasks reflect the real-world heterogeneity and rigor of literature synthesis across research fields. Benchmark tasks cover both single-paper fact verification and complex multi-paper synthesis, the latter requiring integration and comparison of evidence from multiple retrieved sources. Notably, questions are drawn from curated datasets such as SciFact, PubMedQA, and QASA (for single-document tasks) as well as newly authored multi-document synthesis queries and answers (e.g., ScholarQA-CS, ScholarQA-Bio, ScholarQA-Neuro, ScholarQA-Multi).

2. Evaluation Protocols and Performance Metrics

The ScholarQABench evaluation suite employs a multifaceted and citation-sensitive protocol. Central metrics include:

Correctness (“myred”): For single-document QA, accuracy measures are used (as in SciFact or PubMedQA); for long-form generation tasks (like QASA), ROUGE-L is employed to measure overlap with reference answers.
Citation Accuracy (“myblue”): This is operationalized via citation precision and citation recall, captured as a citation F1 score. Each citation-worthy statement is checked to verify correct attribution to evidence within the retrieved set, penalizing both fabricated and missing citations.
Quality Dimensions (“osgreen”): Assessments on relevance, coverage, and organization are performed using fine-grained human and LLM-based rubrics, targeting not just factuality but also the scholarly qualities expected in research synthesis—such as logical structure and breadth of included evidence.
Overall Usefulness: This is captured by rating how helpful the answer is for supporting a literature survey or review workflow, balancing informativeness, fidelity, and presentation.

These metrics collectively enable a nuanced comparison between model outputs and ground-truth human-constructed literature reviews, particularly in the dimension of citation grounding—a frequent failure mode for LLMs.

3. Model Comparisons and Empirical Observations

The benchmark supports comparative evaluation of both open and proprietary literature QA models. Notable observations include:

Retrieval-Augmented Models: OpenScholar-8B, a retrieval-augmented LM specifically trained for literature synthesis, outperforms proprietary models such as GPT-4o by approximately 5% in overall correctness and by 7% versus PaperQA2. When coupled with GPT-4o (OpenScholar-GPT4o), correctness increases by a further 12% over GPT-4o alone, evidencing the additive benefits of hybrid retrieval-generation architectures.
Citation Fidelity: While baseline LLMs like GPT-4o hallucinate up to 78–90% of citations (generating non-existent or spurious references), OpenScholar’s approach achieves citation accuracy on par with human experts, highlighting the centrality of retrieval-augmented pipelines in high-stakes academic synthesis.
Expert Preference: Pairwise evaluations show that responses from OpenScholar-8B and OS-GPT4o are preferred to expert-written answers in 51% and 70% of cases, respectively, in contrast to a mere 32% preference for GPT-4o, suggesting emerging parity—and occasional superiority—of these LMs over reference human answers.

4. Technical Design and Innovations

The ScholarQABench evaluation pipeline integrates several technical components:

Retriever-Generator Architecture: An input query $x$ is processed using a dense bi-encoder ( $\theta_{\mathrm{bi}}$ ) and cross-encoder reranker ( $\theta_{\mathrm{cross}}$ ) to identify top- $k$ relevant passages from a datastore $\mathbb{D}$ of 45 million open-access scientific papers. These passages, $R(x, \mathbb{D})$ , are concatenated with the query and passed to the generative LM ( $\mathcal{G}$ ), yielding an answer $y$ and citation set $\mathcal{C}$ : $(y, \mathcal{C}) = \mathcal{G}(x, \mathcal{R}(x, \mathbb{D}))$
Self-Feedback Inference Loop: Initial answers ( $y_0$ ) are iteratively refined through the generation of natural language feedback ( $f_1, …, f_T$ ), which identifies gaps, missing evidence, or organizational issues; additional retrieval is triggered as needed. This loop improves both content quality and citation accuracy.
Synthetic Training Data: The self-feedback loop generates high-quality synthetic data used to train the open-source OpenScholar-8B model, facilitating model efficiency and improved real-world generalization.
Human and LLM Assessment: Expert evaluations and LLM-based scoring (using models like Prometheus) provide reproducible, fine-grained assessments across multiple axes (organization, coverage, relevance), supplementing citation F1 and ROUGE-based metrics.

5. Open-Source Availability and Community Impact

All resources—including code, model checkpoints (OpenScholar-8B and retrievers), the evaluation framework, and the dataset (datastore variants peS2o V2/V3)—are released under open-source licenses. The public demo (openscholar.allen.ai) enables real-time experimentation. The full ecosystem—source available at github.com/AkariAsai/OpenScholar—enables reproducibility and extension.

This transparency and accessibility facilitate cross-lab comparisons and accelerate advancements in literature synthesis, establishing ScholarQABench as a de facto standard for benchmarking literature QA models.

6. Interpretation, Limitations, and Future Directions

Evaluation results indicate that current LLMs, despite retrieval-augmentation and iterative refinement, still struggle with comprehensive evidence coverage and nuanced synthesis in multi-paper tasks. While open models have reached or exceeded human parity in certain metrics, citation grounding and relevance in complex, multi-hop synthesis remain challenging, as reflected in both autometric and human evaluations. The design of ScholarQABench—spanning correctness, citation fidelity, and scholarly writing quality—lays the groundwork for targeted improvements in retrieval-augmented reasoning architectures, synthetic feedback loops, and more robust citation verification strategies.

Further, the link between self-improving inference, open re-usable infrastructure, and peer-reviewed evaluation is expected to catalyze principled research into literature-aware AI systems for both research and applied domains.

7. Technical Formulations

ScholarQABench formalizes critical mechanisms in LaTeX notation. For example, citation F1 combines citation-based precision and recall: $\text{Citation-F1} = 2 \times \frac{\text{Citation-Precision} \times \text{Citation-Recall}}{\text{Citation-Precision} + \text{Citation-Recall}}$ where each citation-worthy segment is independently checked against retrieved evidence. The main retrieval-augmented inference is notated as: $(y, \mathcal{C}) = \mathcal{G}(x, \mathcal{R}(x, \mathbb{D}))$ where $x$ is the query, $\mathbb{D}$ the datastore, and $\mathcal{R}$ the retrieval module.

These technical details ensure that the benchmark and its supporting models are grounded in objective, quantifiable methodologies, facilitating both interpretability and extensibility.

ScholarQABench, as introduced via OpenScholar and its associated ecosystem (Asai et al., 2024), provides a principled, open-source benchmark for evaluating LLM-driven scientific literature synthesis, with clear protocols, community-accessible tooling, and a rigorous focus on citation fidelity and domain coverage. Its adoption is expected to inform both the academic study and practical deployment of retrieval-augmented synthesis systems in scholarly contexts.

PDF Markdown Chat (Pro)

References (1)

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ScholarQABench.