DAPR: A Benchmark on Document-Aware Passage Retrieval (2305.13915v4)

Published 23 May 2023 in cs.IR and cs.CL

Abstract: The work of neural retrieval so far focuses on ranking short texts and is challenged with long documents. There are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. Wikipedia articles, research papers, etc. We propose and name this task \emph{Document-Aware Passage Retrieval} (DAPR). While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5\%) are due to missing document context. This drives us to build a benchmark for this task including multiple datasets from heterogeneous domains. In the experiments, we extend the SoTA passage retrievers with document context via (1) hybrid retrieval with BM25 and (2) contextualized passage representations, which inform the passage representation with document context. We find despite that hybrid retrieval performs the strongest on the mixture of the easy and the hard queries, it completely fails on the hard queries that require document-context understanding. On the other hand, contextualized passage representations (e.g. prepending document titles) achieve good improvement on these hard queries, but overall they also perform rather poorly. Our created benchmark enables future research on developing and comparing retrieval systems for the new task. The code and the data are available at https://github.com/UKPLab/arxiv2023-dapr.

PDF Abstract

Introduction to the DAPR Benchmark

The rise of Information Retrieval (IR) has significantly enhanced our ability to sift through vast amounts of data swiftly, pinpointing relevant pieces of information crucial for many applications. While traditional methods, such as BM25, rely on the frequency of term matching, the advancement of neural network models has shifted the paradigm towards understanding the deep semantic links between texts, thereby bolstering the efficiency and accuracy of information retrieval.

The Challenge of Long Documents

A critical obstacle faced by state-of-the-art neural passage retrievers is their limitation in handling long documents. These algorithms typically are optimized for short texts due to architectural constraints such as those imposed by Transformer-based models, which are computationally intensive. This poses a significant issue in real-life scenarios where users need to retrieve data from extensive documents.

Introducing DAPR: A New Task and Dataset

To bridge this gap, the new task of Document-Aware Passage Retrieval (DAPR) is proposed. The task challenges existing models to consider the broader context of a document when retrieving a specific passage relevant to a user's query. A key aspect of DAPR is the requirement for the retrieval system to interpret and utilize document-level context effectively.

Alongside the task specification, a benchmark comprised of multiple datasets from diverse domains has been developed. This benchmark is crucial for evaluating the performance of retrieval systems, as it includes both passage and whole-document retrieval scenarios. A distinction is made between Query-to-Passage (Q2P) and Query-to-Document (Q2D) within the benchmark to differentiate between the two levels of retrieval granularity.

Experimentation and Insights

The experiments within the paper incorporate extending current neural passage retrievers with various methods of incorporating document-level context. These methods include prepending document summaries, pooling over passage representations, and hybrid retrieval combining BM25 with neural approaches. Through the paper, it becomes evident that appending document summaries can negatively impact performance in specific contexts, pooling approaches fall short in representing entire document context, and hybrid retrieval markedly improves document retrieval tasks.

Conclusion

In conclusion, the paper's experiments highlight the complexity and potential of Document-Aware Passage Retrieval. Neural strategies implemented in isolation struggle with DAPR tasks, signaling a wide-open opportunity for future research to develop more sophisticated retrieval systems that can navigate the nuances of document-level context and long-form documents. The source code and datasets used for these studies are made publicly available, inviting further investigation and advancement in the field.