OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

Published 21 Nov 2024 in cs.CL, cs.AI, cs.IR, cs.LG, and cs.DL | (2411.14199v1)

Abstract: Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large LMs assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.

Abstract PDF HTML Upgrade to Chat

Authors (25)

First 10 authors:

Citations (1)

View on Semantic Scholar

Summary

The paper introduces OpenScholar, a novel retrieval-augmented language model that integrates a datastore, retriever, and generator in an iterative self-feedback pipeline.
It employs a bi-encoder retriever with a cross-encoder reranker to accurately identify relevant passages and ensure precise citation verification.
Experimental evaluations on ScholarQABench demonstrate superior correctness, citation accuracy, and human preference compared to baseline systems.

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

Introduction

The paper "OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs" addresses the challenge of sifting through the vast and growing body of scientific literature to synthesize pertinent information for scientific queries. OpenScholar is proposed as a retrieval-augmented LLM (LM) specifically designed to meet this challenge by leveraging a vast database of scientific papers and enhancing retrieval accuracy and synthesis.

System Architecture

OpenScholar is composed of three primary components: a datastore, retrievers, and LLMs. OpenScholar integrates these components through a novel pipeline that incorporates self-feedback mechanisms to iteratively refine responses.

Datastore: OpenScholar utilizes the OpenScholar-DataStore (OSDS), which contains 45 million open-access scientific papers. Each paper is processed into embeddings to aid efficient retrieval.
Retriever: A bi-encoder retriever selects candidate passages, and a cross-encoder reranker refines these candidates to identify the most relevant passages.
LLM: The generator LLM synthesizes the retrieved passages and iteratively improves its output via self-feedback generation.
Figure 1: Overview of OpenScholar, illustrating its datastore, retrieval, and synthesis components, integrated with self-feedback mechanisms.

Inference Pipeline

The inference process in OpenScholar is marked by an iterative generation cycle that includes:

Initial Response Generation: The LM generates a preliminary response based on retrieved context.
Feedback Generation: The system generates feedback, which includes directives to add missing information or improve organization.
Response Refinement: Iterative refinements to the initial response are made based on the generated feedback.
Citation Verification: Finally, the system ensures that the generated text is adequately substantiated with correct citations.
Figure 2: Detailed overview of OpenScholar inference and training mechanisms.

Training Data and Model

The training of OpenScholar-8B was augmented with synthetic data generated via the inference-time pipeline. This synthetic data simulates complex information synthesis scenarios in scientific literature, enhancing the model's capability to perform similar tasks during actual inference. The model is trained with a combination of domain-specific synthetic data, general domain instruction tuning data, and fact verification data.

Evaluation Benchmark: ScholarQABench

OpenScholar's performance is assessed using ScholarQABench, a benchmark crafted for evaluating literature synthesis across multiple disciplines. The benchmark includes:

Task Diversity: Various task formats including long-form generation, multiple-choice, and classification tasks.
Domain Coverage: Spanning computer science, biomedicine, neuroscience, and physics.
Evaluation Protocols: Including automatic evaluations and human assessments for citation accuracy, correctness, and comprehensiveness.
Figure 3: An example from ScholarQA-CS illustrating question format and evaluation metrics.

Experimental Results

OpenScholar-8B and OpenScholar-GPT4o demonstrated superior performance on the ScholarQABench benchmark, outperforming comparable systems in areas such as correctness, citation accuracy, and human preference.

Correctness: Achieved significant improvements over baseline models.
Citation Accuracy: Maintained high precision and recall, matching human experts in citation verification.
Human Preference: In human evaluations, OpenScholar-generated responses were favored over expert-written responses in various instances, notably due to their comprehensive content coverage.

Figure 4: Fine-grained evaluation results highlight the effectiveness of OpenScholar in synthesizing scientific literature.

Conclusion

OpenScholar represents a significant step forward in automating the synthesis of scientific literature, leveraging retrieval-augmented LLMs to achieve high-quality responses backed by accurate citations. The system's iterative refinement approach, coupled with its robust retrieval mechanism, ensures it can meet the demanding requirements of scientific inquiry. OpenScholar is publicly released, along with its benchmarks and evaluation tools, to support further advancements in the field.

Markdown Report Issue