Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs (2411.14199v1)

Published 21 Nov 2024 in cs.CL, cs.AI, cs.IR, cs.LG, and cs.DL

Abstract: Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can LLMs (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.

Citations (1)

Summary

  • The paper introduces OpenScholar, a retrieval-augmented language model that synthesizes literature using a datastore of 45 million open-access papers.
  • It employs an iterative self-feedback mechanism to refine outputs and ensure accurate, citation-backed responses.
  • It establishes ScholarQABench and outperforms models like GPT-4o in correctness, achieving citation accuracy comparable to human experts.

OpenScholar: Synthesizing Scientific Literature with Retrieval-Augmented LLMs

The paper introduces OpenScholar, a retrieval-augmented LLM designed specifically to address the challenges of synthesizing scientific literature. By incorporating a datastore comprising 45 million open-access papers and leveraging retrieval techniques, OpenScholar facilitates the provision of citation-backed responses to scientific inquiries. This system addresses common issues experienced by LLMs, such as citation hallucinations, by incorporating an iterative self-feedback mechanism that refines initial outputs to improve reliability and accuracy.

Key Contributions

  • Introduction of OpenScholar: A novel retrieval-augmented LLM framework designed to handle scientific literature queries. OpenScholar retrieves relevant passages from a large datastore and uses them to support its generated responses with accurate citations. The model’s retrieval capabilities and data infrastructure specifically cater to multi-domain scientific synthesis, spanning areas such as computer science, physics, neuroscience, and biomedicine.
  • Developed Evaluation Benchmark: The paper presents ScholarQABench, a comprehensive benchmark for evaluating literature search models. This benchmark covers 2,967 expert-written queries and 208 decorated responses, enabling a robust assessment of models across multiple domains and ensuring that system outputs are evaluated for correctness, coverage, coherence, and citation accuracy.
  • Performance Analysis and Comparisons: Through extensive empirical evaluations, OpenScholar demonstrates its effectiveness by outperforming GPT-4o and PaperQA2 in correctness metrics on the ScholarQABench benchmark. This is especially notable considering its open-access architecture and smaller size relative to GPT-4o. Further, the research highlights that OpenScholar achieves citation accuracy comparable to human experts, marking a significant improvement over existing models that frequently hallucinate citations.

Implications and Future Directions

The implications of OpenScholar are multifaceted and extend to both practical applications and theoretical advancements in AI. The introduction of a domain-specific model for literature synthesis underscores the potential for specialized LLMs that integrate retrieval systems to optimize performance for specific tasks. This approach may inspire future developments that leverage domain-specific datastores and retrieval-augmented architectures. Given OpenScholar's demonstrated success in minimizing inaccuracies and enhancing citation validity, its adaptation or extension to other scientific domains or practical applications such as healthcare, legal research, and digital humanities could be highly beneficial. Furthermore, the work presents opportunities for exploring more advanced feedback mechanisms and adaptive learning strategies to enhance the model’s understanding and synthesis capabilities.

In conclusion, OpenScholar exemplifies a crucial step towards reliable automated synthesis of scientific knowledge by effectively combining retrieval-augmented LLMs with domain-specific datasets and evaluation benchmarks. By offering an open framework equipped to handle the intricacies of scientific literature, OpenScholar contributes significantly to the ongoing efforts to enhance AI's role in scientific discovery and research dissemination.

Youtube Logo Streamline Icon: https://streamlinehq.com