SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation

Published 16 May 2024 in cs.CL and cs.AI | (2405.09939v2)

Abstract: We introduce SciQAG, a novel framework for automatically generating high-quality science question-answer pairs from a large corpus of scientific literature based on LLMs. SciQAG consists of a QA generator and a QA evaluator, which work together to extract diverse and research-level questions and answers from scientific papers. Utilizing this framework, we construct a large-scale, high-quality, open-ended science QA dataset containing 188,042 QA pairs extracted from 22,743 scientific papers across 24 scientific domains. We also introduce SciQAG-24D, a new benchmark task designed to evaluate the science question-answering ability of LLMs. Extensive experiments demonstrate that fine-tuning LLMs on the SciQAG dataset significantly improves their performance on both open-ended question answering and scientific tasks. To foster research and collaboration, we make the datasets, models, and evaluation codes publicly available, contributing to the advancement of science question answering and developing more interpretable and reasoning-capable AI systems.

Abstract PDF HTML Upgrade to Chat

Authors (9)

References (50)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces the SciQAG framework, which automatically generates and evaluates scientific QA pairs from a large corpus of scholarly articles.
It employs GPT-4 and open-source models refined via expert feedback and iterative fine-tuning to achieve high scores on the RACAR metric.
Experiments show that fine-tuned models like Vicuna produce diverse, accurate QA pairs, significantly advancing the training of scientific LLMs.

SciQAG: Auto-Generating Scientific Question Answering Datasets

The paper introduces SciQAG, a framework designed for the automatic generation and evaluation of scientific Question-Answer (QA) pairs derived from scientific literature. The framework addresses the challenges posed by the increasing volume and complexity of scientific publications, providing a means to efficiently extract and assess knowledge using LLMs. By generating high-quality QA pairs at scale, SciQAG facilitates the training and evaluation of LLMs in scientific domains.

Framework Components and Functionality

SciQAG integrates three primary components: Seed QA, QA Generator, and QA Evaluator (Figure 1). The Seed QA component leverages GPT-4 to generate initial QA pairs from a subset of scientific papers, refined through domain expert feedback to optimize prompt effectiveness. The QA Generator then uses these refined prompts to fine-tune an open-source generative model, enabling the creation of QA pairs from a large corpus of scientific articles. Finally, the QA Evaluator employs another LLM to assess the generated pairs across five key dimensions.

Figure 1: The SciQAG framework links Seed QA, QA Generator, and QA Evaluator steps to generate and evaluate a scientific QA dataset from scientific literature. The dashed line represents optional fine-tuning.

The framework supports multiple fine-tuning steps, enhancing its adaptability and performance. Seed QA can fine-tune the generator, and the generator can directly prompt LLMs without fine-tuning. The QA evaluator can filter data using RACAR scores for iterative improvement. The framework's flexibility allows for tailored implementations to suit specific research needs.

Dataset Creation and Characteristics

The authors curated a dataset of over 6 million scientific papers from the Web of Science (WoS) Core Collection, focusing on physical science disciplines such as materials science, chemistry, physics, and energy. To ensure balanced representation, they selected the 4,000 most cited papers from each of 24 WoS categories, resulting in a dataset of 96,000 papers (Figure 2). The TopicRank algorithm was then applied to extract 20 keywords per article, facilitating guided prompting for QA generation. The final output consists of 960,000 QA pairs, providing a substantial resource for training and benchmarking scientific LLMs.

Figure 2: Distribution of 6M papers from the WoS Core Collection across 24 WoS categories selected from Chemistry, Physics, Materials Science and Energy. To ensure data balance, we obtained the most cited \num{4000} papers from each category, forming a dataset of \num{24} $\times$ \num{4000} = \num{96000} papers.

Evaluation Metrics: The RACAR Framework

A key contribution of this work is the introduction of the RACAR metric, a five-dimensional evaluation framework designed to assess the quality of generated QA pairs. The dimensions include:

Relevance: Measures the alignment of QA pairs with the information in the source article.
Agnosticism: Assesses the context-independence of questions, ensuring they do not rely on specific elements like figures or tables.
Completeness: Evaluates whether answers comprehensively address all relevant aspects of the question.
Accuracy: Verifies the factual correctness of answers based on evidence from the paper.
Reasonableness: Checks the internal logical consistency of answers, ensuring they are free from contradictions.

GPT-4 was employed to assign scores on a scale of 1 to 3 for each dimension, with human expert evaluations used to validate the reliability of the automated scoring. Additionally, the authors analyzed the diversity of questions, coverage rate of answers, and source validation of numeric values to provide a holistic assessment of the dataset's quality.

Experimental Results and Analysis

The authors evaluated the performance of various LLMs, including GPT-3.5, Vicuna, and LongChat, in generating QA pairs, using the RACAR metric to compare their outputs. The results indicated that a fine-tuned Vicuna model outperformed other open-source models, although GPT-3.5 achieved higher scores across all dimensions (Table 1). Spearman and Pearson correlations were computed to compare GPT-4 assigned scores and expert-annotated scores (Figure 3).

Figure 3: Spearman and Pearson correlations between GPT-4 assigned scores and expert-annotated scores.

Analysis of question diversity revealed that most question pairs had low similarity scores, with an average similarity of 0.31, indicating substantial diversity in the generated questions. Coverage rate analysis showed an average coverage of 68% across the evaluation set, demonstrating that answers effectively sourced information from various parts of the original papers. Furthermore, source validation of numeric values indicated that 96.7% of numerical data in the answers were present in the source text, highlighting the generator's accuracy.

Practical Implications and Future Directions

The SciQAG framework offers a cost-effective solution for generating large volumes of high-quality scientific QA data. The generated dataset can be used to train and evaluate LLMs for scientific tasks, reducing the need for manual annotation and enabling the development of more knowledgeable and accurate models. The broad and deep scope of questions generated by SciQAG, along with the detailed and informative answers, makes it a valuable tool for enhancing the accessibility and understanding of complex scientific information.

Future research could focus on expanding the training dataset, incorporating Retrieval-Augmented Generation (RAG) techniques to further reduce hallucinations, and exploring additional evaluation metrics to capture nuanced aspects of QA quality. The SciQAG framework represents a significant step toward automating knowledge extraction from scientific literature, with potential applications in various domains, including scientific discovery, education, and information retrieval.

Conclusion

SciQAG provides a robust, open-source framework for generating and evaluating scientific QA pairs. By fine-tuning an open-source LLM and employing GPT-4 for quality assessment, the framework achieves high scores on the RACAR metric and demonstrates superior performance compared to other generative models. The resulting dataset and evaluation methods offer valuable resources for advancing scientific LLMs and promoting knowledge discovery.

Markdown Report Issue