LitSearch: A Retrieval Benchmark for Scientific Literature Search (2407.18940v2)

Published 10 Jul 2024 in cs.IR, cs.AI, cs.CL, cs.DL, and cs.LG

Abstract: Literature search questions, such as "Where can I find research on the evaluation of consistency in generated summaries?" pose significant challenges for modern search engines and retrieval systems. These questions often require a deep understanding of research concepts and the ability to reason across entire articles. In this work, we introduce LitSearch, a retrieval benchmark comprising 597 realistic literature search queries about recent ML and NLP papers. LitSearch is constructed using a combination of (1) questions generated by GPT-4 based on paragraphs containing inline citations from research papers and (2) questions manually written by authors about their recently published papers. All LitSearch questions were manually examined or edited by experts to ensure high quality. We extensively benchmark state-of-the-art retrieval models and also evaluate two LLM-based reranking pipelines. We find a significant performance gap between BM25 and state-of-the-art dense retrievers, with a 24.8% absolute difference in recall@5. The LLM-based reranking strategies further improve the best-performing dense retriever by 4.4%. Additionally, commercial search engines and research tools like Google Search perform poorly on LitSearch, lagging behind the best dense retriever by up to 32 recall points. Taken together, these results show that LitSearch is an informative new testbed for retrieval systems while catering to a real-world use case.

PDF HTML Abstract

Overview of "LitSearch: A Retrieval Benchmark for Scientific Literature Search"

The paper "LitSearch: A Retrieval Benchmark for Scientific Literature Search" introduces an advanced benchmark named LitSearch, tailored to evaluate the efficacy of retrieval systems in addressing literature search queries. Recognizing the complexity inherent in queries related to scientific literature, the researchers developed LitSearch to provide a comprehensive evaluation platform comprising 597 literature search queries. These queries are particularly centered on recent advances in ML and NLP.

Methodology

LitSearch is curated through a combination of two unique methodologies:

Inline-Citation Questions: These are derived using GPT-4 to transform inline citations from scientific papers into meaningful search queries. This approach attempts to circumvent the often noisy and context-dependent nature of direct inline citations by generating questions that demand a deeper understanding of the cited work.
Author-Written Questions: Authors of recent conference papers were solicited to create questions about their work, which were then carefully vetted by experts to ensure quality and relevance.

Each question in LitSearch is linked to one or more scientific articles deemed as ground truth, making the benchmark both rigorous and practical for evaluating retrieval systems.

Experimental Findings

The benchmark tests a suite of state-of-the-art retrieval models, including traditional models like BM25 and advanced dense retrieval models such as GritLM, Instructor, and E5. Notably, GritLM, an advanced dense retrieval model, achieved a commendable recall@5 of 74.8%, significantly outperforming BM25 by a substantial margin of 24.8%.

Furthermore, when employing LLM-based reranking strategies, an additional improvement of 4.4% was observed in recall@5. This underscores the potential of LLMs in enhancing retrieval accuracy. It’s noteworthy that commercial search engines like Google lag considerably behind these state-of-the-art dense retrievers, highlighting the challenging nature of LitSearch.

Implications and Future Work

The creation of LitSearch presents significant implications for the development of retrieval models tailored to scientific literature. Its realistic queries and exhaustive benchmarking provide a valuable testbed for researchers seeking to optimize retrieval systems in the scientific domain. By demonstrating a pronounced gap between current retrieval systems and potential LLM-enhanced systems, this paper lays the groundwork for further exploration into improved retrieval mechanisms.

Future research could expand scope toward incorporating full-text retrieval capabilities. As observed, the inclusion of more textual content did not yield consistent improvements across models, suggesting a necessity for advancement in handling longer contexts efficiently. Additionally, enhancing model robustness to handle both inline-citation and author-generated queries remains a critical avenue for ongoing research.

In conclusion, LitSearch stands as a pivotal contribution to the literature retrieval community, offering a nuanced and challenging benchmark that aligns closely with real-world research needs. Its relevance will likely grow as the academic communities increasingly rely on automated systems to navigate ever-expanding scientific corpora.