Progressive Query Expansion for Retrieval Over Cost-constrained Data Sources (2406.07136v1)

Published 11 Jun 2024 in cs.IR

Abstract: Query expansion has been employed for a long time to improve the accuracy of query retrievers. Earlier works relied on pseudo-relevance feedback (PRF) techniques, which augment a query with terms extracted from documents retrieved in a first stage. However, the documents may be noisy hindering the effectiveness of the ranking. To avoid this, recent studies have instead used LLMs to generate additional content to expand a query. These techniques are prone to hallucination and also focus on the LLM usage cost. However, the cost may be dominated by the retrieval in several important practical scenarios, where the corpus is only available via APIs which charge a fee per retrieved document. We propose combining classic PRF techniques with LLMs and create a progressive query expansion algorithm ProQE that iteratively expands the query as it retrieves more documents. ProQE is compatible with both sparse and dense retrieval systems. Our experimental results on four retrieval datasets show that ProQE outperforms state-of-the-art baselines by 37% and is the most cost-effective.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel Progressive Query Expansion algorithm that merges PRF with LLMs to refine queries iteratively.
The methodology balances API retrieval costs against improved document ranking, achieving a 37% boost in MRR and R@1.
ProQE demonstrates robust performance across BM25 sparse and dense retrieval systems, offering practical cost savings for real-world applications.

Insights on "Progressive Query Expansion for Retrieval Over Cost-constrained Data Sources"

The paper "Progressive Query Expansion for Retrieval Over Cost-constrained Data Sources" by Rashid et al. presents a novel approach to enhancing query retrieval efficiency, particularly in systems where the data retrieval incurs a cost. The proposed solution, Progressive Query Expansion (ProQE), aims to optimize the retrieval of relevant documents when such documents are accessed through APIs that charge per retrieval.

Core Contributions and Methodology

The central innovation of this paper lies in the development of a progressive query expansion algorithm that judiciously combines pseudo-relevance feedback (PRF) techniques with the generative capabilities of LLMs. This merging is designed to mitigate the shortcomings of both methods: the potential noise introduced by irrelevant documents in PRF and the hallucination issues associated with LLMs.

The ProQE algorithm iteratively refines the initial query by progressively integrating additional terms extracted from documents retrieved in each iteration. This method allows the system to balance the cost of retrieval against the benefit of potentially more accurate document ranking. A distinctive feature of ProQE is its adaptability, functioning effectively across both sparse and dense retrieval systems. It employs LLMs not only for generating potential expansion terms but also as a mechanism for assessing the relevance of these terms, thus incorporating a robustness against non-factual content.

Experimental Design and Results

The researchers evaluated ProQE's effectiveness using four well-recognized retrieval datasets: Natural Questions, Web Questions, and TREC DL 19 and DL 20. The experimental setup contrasted ProQE against traditional PRF methods such as RM3 and Rocchio, and generative expansion techniques like query2doc and chain-of-thought (CoT) prompting. The results consistently show that ProQE outperforms these baselines, achieving an average performance improvement of 37% in terms of Mean Reciprocal Rank (MRR) and Recall at 1 (R@1). This indicates a significant enhancement in retrieval accuracy, which translates directly into cost-effective solutions for practical systems where document retrieval involves financial charges.

Moreover, the algorithm demonstrates solid performance across both BM25-based sparse retrieval and dense retrieval methods like DPR and TCT-Colbert, showcasing its flexibility and utility in diverse retrieval contexts.

Implications and Future Directions

The implications of this research are manifold. Practically, it provides a cost-efficient solution for systems reliant on external API-based document retrieval, such as legal and academic databases, where the minimization of unnecessary queries can lead to substantial cost savings. Theoretically, this approach opens pathways for integrating LLMs into retrieval systems not merely as generative tools but as active components in query optimization processes.

Future developments might explore refining the balance between LLM utilization and cost constraints even further, potentially incorporating adaptive learning algorithms to improve the relevance of term expansion over time. Additionally, exploring the integration of ProQE into live commercial systems could provide further insights into its scalable application and efficacy in real-world scenarios.

This paper stands as a substantive contribution to the domain of information retrieval, particularly in scenarios where cost considerations are pivotal. It combines theoretical innovation with practical applicability, offering a robust approach to enhance retrieval effectiveness while maintaining financial constraints.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1800722372619800884

https://twitter.com/realmofresearch/status/1801949522559733951