Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 103 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s
GPT-5 High 37 tok/s Pro
GPT-4o 92 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 241 tok/s Pro
2000 character limit reached

SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers (2411.05338v1)

Published 8 Nov 2024 in cs.CL

Abstract: Scientific literature is typically dense, requiring significant background knowledge and deep comprehension for effective engagement. We introduce SciDQA, a new dataset for reading comprehension that challenges LLMs for a deep understanding of scientific articles, consisting of 2,937 QA pairs. Unlike other scientific QA datasets, SciDQA sources questions from peer reviews by domain experts and answers by paper authors, ensuring a thorough examination of the literature. We enhance the dataset's quality through a process that carefully filters out lower quality questions, decontextualizes the content, tracks the source document across different versions, and incorporates a bibliography for multi-document question-answering. Questions in SciDQA necessitate reasoning across figures, tables, equations, appendices, and supplementary materials, and require multi-document reasoning. We evaluate several open-source and proprietary LLMs across various configurations to explore their capabilities in generating relevant and factual responses. Our comprehensive evaluation, based on metrics for surface-level similarity and LLM judgements, highlights notable performance discrepancies. SciDQA represents a rigorously curated, naturally derived scientific QA dataset, designed to facilitate research on complex scientific text understanding.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces SciDQA, a dataset featuring 2,937 expert-curated QA pairs extracted from peer reviews to assess LLM comprehension of scientific literature.
  • It employs a meticulous curation process combining LLM-based extraction with expert annotation to handle multi-modal content and reasoning challenges.
  • Experimental results indicate that even advanced LLMs like GPT-4o face difficulties with multi-document and multi-modal reasoning tasks presented by the dataset.

SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers

Introduction

The paper introduces SciDQA, a dataset designed to challenge LLMs with deep reading comprehension tasks, specifically within the domain of scientific literature. This dataset consists of 2,937 high-quality question-answer (QA) pairs. It stands out from existing QA datasets by sourcing questions from peer reviews authored by domain experts and answers provided by paper authors, ensuring a thorough engagement with scientific texts. Figure 1

Figure 1: An instance in the SciDQA dataset. The question and answer corresponding to the paper are extracted from the reviewer-author discussion on OpenReview.

Dataset Creation and Curation

SciDQA's dataset curation involves capturing QA pairs from reviewer-author discussions available on OpenReview. A significant feature of this dataset is its focus on ML-domain articles. Questions extracted from peer reviews reflect the reviewers' need for clarity or further explanation, marking them as an excellent source for probing comprehensive understanding of research papers. An extensive manual annotation process is employed to maintain high relevance and quality, consisting of human expert annotation and editing. Figure 2

Figure 2: Dataset curation pipeline for SciDQA. LLM-based QA extraction from peer reviews is followed by a comprehensive human expert annotation and editing.

Challenges Presented by SciDQA

The questions in this dataset necessitate reasoning across multi-modal content within papers, such as figures, tables, and equations. Approximately 11% of the questions require reasoning over references to multiple documents. This dataset presents a considerable challenge for LLMs, pushing the boundaries of current capabilities by requiring the generation of factual and relevant responses.

Experimental Evaluation

Performance evaluation of several open-source and proprietary LLMs has been conducted using this dataset across different configurations, including closed-book settings and retrieval-augmented generation (RAG). The results highlight discrepancies in LLM performance, with proprietary models like GPT-4o showing notably robust performance compared to open-source counterparts. This suggests a potential gap in handling nuanced scientific inquiry when context is provided.

Implementation Challenges

Implementing the SciDQA dataset in practical applications requires handling computational resources effectively, especially given the multi-modal nature and context length associated with full-text scientific literature. Moreover, ensuring that models can perform inferential reasoning across multiple documents is critical. The RAG approach implemented for this dataset provides a baseline strategy to selectively focus on the most relevant sections of a document, enhancing comprehension.

Limitations and Future Directions

A major limitation of SciDQA is the potential exclusion of certain documents necessary for answering multi-document questions, affecting the dataset's completeness in challenging LLMs in real-world scenarios. Future work would entail expanding the dataset to encompass multi-disciplinary scientific fields beyond ML, as well as improving multimodal analysis capabilities in LLMs.

Conclusion

SciDQA provides an effective testbed for evaluating deep comprehension capabilities of LLMs on scientific texts. By utilizing peer review-derived questions and answers, it ensures engagement with complex and domain-specific content, promoting advancements in the understanding of scientific materials by AI systems.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run paper prompts using GPT-5.