Introduction
Dense retrieval systems are integral to open-domain NLP applications. They help source relevant information by sifting through large data corpora. One crucial yet often overlooked aspect is the granularity of the retrieval unit—whether a document, passage, or sentence should be indexed and retrieved. This paper introduces a novel concept in dense retrieval that focuses on the granularity of retrieval units and its impact on the retrieval process's efficacy.
Propositions as Retrieval Units
While passages and sentences are routinely used as retrieval units, this paper proposes a different approach: using "propositions" as retrieval units. Propositions are defined as atomic expressions within the text, each elucidating a distinct factoid in a clear, standalone natural language format. Contrary to more extensive passage or complex sentence indexing, proposition indexing presents each fact as a self-contained unit, which could potentially refine retrieval quality.
Empirical Evaluation of Retrieval Granularity
An empirical comparison is drawn among different retrieval granularities utilizing a processed version of the English Wikipedia corpus, termed 'FACTOID WIKI.' This corpus is indexed at the levels of a 100-word passage, a sentence, and a proposition. The paper assesses the effectiveness of varying retrieval unit granularities through several experiments. Six different dual-encoder retrievers were tested on five open-domain QA datasets. A significant finding is that proposition-based retrieval substantially outperforms traditional passage or sentence-based methods in dense retrieval tasks.
Downstream Task Performance and Contributions
Propositional retrieval not only improves retrieval but also shows enhanced performance in downstream QA tasks. Propositions, being more condensed, provide a higher density of relevant information, hence requiring fewer input tokens and minimizing the inclusion of irrelevant content. Among the significant contributions are the proposition as a novel retrieval unit for dense retrieval and the introduction of 'FACTOID WIKI.' The paper shows proposition retrieval's generalizability and higher accuracy in downstream question-answering tasks within the same input token limit, asserting the practicality of propositions in enhancing dense retrievers' efficient information access.