Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 37 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 111 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 243 tok/s Pro
2000 character limit reached

PeerQA: A Scientific Question Answering Dataset from Peer Reviews (2502.13668v1)

Published 19 Feb 2025 in cs.CL, cs.AI, and cs.IR

Abstract: We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which contain questions that reviewers raised while thoroughly examining the scientific article. Answers have been annotated by the original authors of each paper. The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP, as well as a subset of other scientific communities like Geoscience and Public Health. PeerQA supports three critical tasks for developing practical QA systems: Evidence retrieval, unanswerable question classification, and answer generation. We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks. Our experiments and analyses reveal the need for decontextualization in document-level retrieval, where we find that even simple decontextualization approaches consistently improve retrieval performance across architectures. On answer generation, PeerQA serves as a challenging benchmark for long-context modeling, as the papers have an average size of 12k tokens. Our code and data is available at https://github.com/UKPLab/peerqa.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces PeerQA, a dataset sourced from peer reviews to enhance scientific question answering by addressing evidence retrieval and unanswerable question classification.
  • It evaluates various retrieval and generation models using metrics like Rouge-L, AlignScore, and Prometheus Correctness, demonstrating that decontextualization improves retrieval performance.
  • The methodology highlights PeerQA’s potential to assist academic review workflows and drive innovations in AI-assisted peer review processes.

"PeerQA: A Scientific Question Answering Dataset from Peer Reviews" (2502.13668)

Introduction

The research paper "PeerQA: A Scientific Question Answering Dataset from Peer Reviews" introduces PeerQA, a dataset specifically crafted for scientific question answering by leveraging questions derived from peer reviews. This dataset emphasizes document-level question answering and serves to advance AI systems in extracting meaningful insights from complex scientific texts. PeerQA delineates its focus into three main tasks: evidence retrieval, unanswerable question classification, and answer generation. Figure 1

Figure 1: Overview of the PeerQA data collection process.

Dataset Characteristics

PeerQA comprises 579 question-answer pairs sourced from 208 academic articles, prominently featuring contributions from domains such as Machine Learning (ML), NLP, Geosciences, and Public Health. The dataset's construction involved extracting questions posed within peer reviews and annotating responses from the respective paper authors. A unique aspect of PeerQA is the inclusion of both labeled and unlabeled questions; it offers a set of 579 reviewed samples alongside 12,000 unlabeled questions for broader experimentation. Figure 2

Figure 2: Statistics of the PeerQA dataset.

Experimental Setup

The authors conducted baseline experiments to evaluate systems across PeerQA's core tasks. Notably, they investigated the retrieval performance of several architectures, including crossover, dense, sparse, and multi-vector retrieval models, demonstrating that even straightforward decontextualization significantly enhances retrieval accuracy. The release also involved comparing models' effectiveness in generating answers, aided by metrics such as Rouge-L, AlignScore, and Prometheus Correctness. Figure 3

Figure 3: Answerability scores with different contexts.

Answer Generation and Evaluation

PeerQA benchmarks long-context modeling given the average paper length of approximately 12,000 tokens—a challenge for current generative models. The experiments revealed that assembling relevant contexts from high-recall retrievers notably benefits answer generation performance. The dataset serves as a valuable reference for assessing models based on their ability to generate coherent answers in the presence of diverse, evidence-driven data scenarios. Figure 4

Figure 4: Evaluation metrics comparing generated and annotated responses.

Implications and Future Directions

By sourcing questions from authentic peer reviews, PeerQA ensures natural question-context alignment—a significant drawback in prior datasets sourced from less domain-expert inputs. It posits a strong case for deploying QA systems within academic review workflows, potentially aiding reviewers by illuminating overlooked sections or gently guiding them through complex, multi-faceted papers. Future directions could involve expanding PeerQA across further scientific domains, conducting cross-domain model evaluations, or utilizing unlabeled data to explore unsupervised learning techniques.

Conclusion

PeerQA stands as a substantial resource for refining scientific document-level QA systems, grounded uniquely in the context of peer reviews. It catalyzes advancement in AI-enabled peer reviewing processes, tasked with handling voluminous scientific production. Its protocols and findings open pathways for both improved AI methods and enriched academic collaboration dynamics.

The paper establishes a precedent for harmonizing AI research with scientific authoring tasks and could foreseeably drive developments where AI assists not just in reviewing but also generates supportive interpretive content directly augmenting scientific understanding.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run paper prompts using GPT-5.