Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 97 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 28 tok/s Pro
GPT-4o 93 tok/s
GPT OSS 120B 462 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

PeerQA: Scientific QA Benchmark

Updated 11 August 2025
  • The dataset’s main contribution is providing authentic, peer review–derived questions with expert annotations for evaluating evidence retrieval and answer generation.
  • PeerQA uses a decontextualization pipeline to transform compound, context-bound queries into standalone questions to enhance clarity and model performance.
  • The benchmark supports multi-task evaluation including retrieval (using MRR and Recall@10), unanswerable classification, and generation assessed via Rouge-L and AlignScore.

PeerQA is a scientific question answering dataset constructed from authentic peer review exchanges and author annotations, designed specifically for document-level QA in the context of modern scientific articles. Different from community-focused or classroom-oriented QA corpora, PeerQA leverages the natural dialog between reviewer and author, and targets tasks central to practical scientific QA including evidence retrieval, unanswerable question classification, and answer generation. The dataset is positioned as a challenging benchmark for long-context and multi-task QA, given its real-world provenance, domain diversity, and annotation by paper authors.

1. Dataset Composition and Annotation Protocol

PeerQA consists of 579 question–answer (QA) pairs sourced from peer reviews of 208 academic articles, predominantly in Machine Learning (ML) and NLP, with additional representation from Geoscience and Public Health. An auxiliary set contains 12,546 preprocessed and filtered unlabeled questions drawn from 2,623 papers, facilitating unsupervised or transfer learning paradigms.

Questions are extracted from reviewer comments, typically highly contextual and compound in nature. A decontextualization pipeline, leveraging InstructGPT and constituency parsing, rephrases these questions so each stands alone and is intelligible outside its original context. Sentences are split when necessary to isolate single answerable units.

Answers are annotated by the original paper authors. Authors are contacted individually with explicit instructions, annotation guidelines, and a demonstration video. In the annotation interface, authors can modify questions if ambiguous, classify them as unanswerable (e.g., when addressed only in the rebuttal or not present in the final paper), mark answer evidence by highlighting relevant portions of the camera-ready paper (GROBID-extracted), and provide free-form text answers.

The annotated dataset under CC-BY-NC-SA 4.0 includes metadata for each QA pair: document identifiers, question text, author-provided answer, marked answer evidence (text spans), and the unanswerable flag when appropriate.

2. Supported Tasks and Evaluation Protocols

PeerQA is explicitly designed to facilitate three critical research tasks for QA systems over scientific documents:

  • Evidence Retrieval: Given a (decontextualized) question and a document, models must rank passages (paragraphs or sentences) by relevance, identifying which spans contain the answer evidence. Baselines include cross-encoders, dense retrievers (e.g., MiniLM-L12-v2, Dragon+), multi-vector retrievers, and sparse methods such as BM25 and SPLADEv3. Performance is assessed with information retrieval metrics Mean Reciprocal Rank (MRR) and Recall@10.
  • Unanswerable Question Classification: Some questions from peer reviews are inherently unanswerable within the final paper context. PeerQA supports a binary classification task: given a question and document context, models must determine answerability based on the presence (or absence) of supporting evidence. Instruction-tuned LLMs—including Llama-3, Mistral, Command-R, GPT-3.5 Turbo, and GPT-4o—are benchmarked using Macro-F1 scores and class-specific recall/precision.
  • Answer Generation: Models perform sequence-to-sequence generation, producing free-form answers based on the question and either the entire paper or retrieved evidence passages (RAG setup). Evaluation uses Rouge-L for lexical overlap, AlignScore for factual consistency, and Prometheus (LLM-as-a-judge) for correctness. Papers average 12,000 tokens, making long-context modeling technically demanding even for contemporary LLMs.

3. Experimental Findings and Task Analysis

Empirical results from PeerQA baseline experiments reveal several patterns:

  • Models systematically perform better at evidence retrieval after decontextualization, especially with simple strategies such as prepending the paper title to paragraph passages. This suggests contextual cues critically affect document-level retrieval performance.
  • For answer generation, retrieval-augmented setups (providing only top-ranked paragraphs) outperform full-document contexts, even for state-of-the-art long-context LLMs. Peer analysis demonstrates a positive Pearson correlation (up to r=0.42) between retrieval recall and generation quality metrics.
  • Unanswerable question classification exposes model biases: some LLMs overpredict answerability, while others incline toward unanswerable judgments, indicating a need for finer calibration and better class balance training.

Error analysis discloses systemic evaluation metric sensitivity, over-generation (answers with more detail than required), and reasoning failures caused by inadequate context. The diversity and typical length of scientific documents make PeerQA a robust, challenging testbed for investigating the limits of current QA architectures.

4. Benchmarking Role and Comparative Strengths

PeerQA differentiates itself from prior scientific QA datasets with three principal features:

  • Natural Question Provenance: All questions stem directly from the peer review process, not crowdsourced or artificially generated, resulting in authentic information-seeking intent and domain coverage.
  • Expert Annotations: Answers are authored by the original paper authors, with explicit marking of answer evidence spans. This guarantees high answer quality and direct linkage to canonical document content.
  • Multi-Task Benchmarking: By supporting evidence retrieval, unanswerable classification, and answer generation, PeerQA offers a multi-faceted evaluation environment. Average paper length (~12K tokens) and evidence type multiplicity challenge long-context and multi-hop reasoning models.

Dense and sparse retrieval models (SPLADEv3, BM25) and recent LLMs (GPT-4o, Llama-3) are evaluated extensibly, with performance trends indicating the necessity of improved retrieval and context management for enhancing downstream answer accuracy.

5. Licensing, Accessibility, and Dataset Resources

PeerQA and its processing scripts are publicly released at https://github.com/UKPLab/peerqa, covered under CC-BY-NC-SA 4.0. The dataset thus supports unrestricted research and academic use (with attribution) but is not available for commercial purposes. In addition to the main annotated corpus, the release includes over 12,500 filtered unlabeled questions enabling unsupervised experimentation and annotation expansion by the community.

The data structure includes full metadata, evidentiary text spans, and annotation provenance. Researchers can directly apply standard IR, classification, and generation models, or integrate the resource into multi-task learning frameworks.

6. Implications for QA System Development

PeerQA’s design and analysis yield several insights for future QA system development:

  • Accurate retrieval is a precondition for high answer generation quality, necessitating advanced passage selection, decontextualization, and ranking algorithms.
  • Evaluation of answerability remains a nontrivial classification challenge, potentially requiring ensemble calibration or meta-labeling approaches to offset inherent model biases.
  • The inclusion of long-context, scientific articles compels further research on efficient context management—summarization, passage condensation, and memory-efficient transformers.

A plausible implication is that joint modeling of retrieval, answerability detection, and generation may be necessary to achieve human-level performance on practical document-level scientific QA, as demonstrated by observable error correlations in the PeerQA baseline experiments.

PeerQA serves as an authoritative real-world benchmark for scientific question answering, supporting rigorous evaluation and motivating further research on the practical deployment of QA systems in scholarly communication contexts (Baumgärtner et al., 19 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)