Exploring the Best Practices of Query Expansion with Large Language Models (2401.06311v3)

Published 12 Jan 2024 in cs.IR

Abstract: LLMs are foundational in language technologies, particularly in information retrieval (IR). Previous studies have utilized LLMs for query expansion, achieving notable improvements in IR. In this paper, we thoroughly explore the best practice of leveraging LLMs for query expansion. To this end, we introduce a training-free, straightforward yet effective framework called Multi-Text Generation Integration (\textsc{MuGI}). It leverages LLMs to generate multiple pseudo-references, integrating them with queries to enhance both sparse and dense retrievers. Our empirical findings reveal that: (1) Increasing the number of samples from LLMs benefits IR systems; (2) A balance between the query and pseudo-documents, and an effective integration strategy, is critical for high performance; (3) Contextual information from LLMs is essential, even boost a 23M model to outperform a 7B baseline model; (4) Pseudo relevance feedback can further calibrate queries for improved performance; and (5) Query expansion is widely applicable and versatile, consistently enhancing models ranging from 23M to 7B parameters. Our code and all generated references are made available at \url{https://github.com/lezhang7/Retrieval_MuGI}

PDF HTML Abstract

Analyzing MuGI: Multi-Text Generation Integration in Information Retrieval Systems

The paper "MuGI: Enhancing Information Retrieval through Multi-Text Generation Integration with LLMs" proposes a novel framework for augmenting the capabilities of Information Retrieval (IR) systems utilizing LLMs. The authors focus on overcoming the limitations of traditional IR methods by introducing a method termed Multi-Text Generation Integration (MuGI), which leverages LLMs for generating multiple pseudo references that enrich the original queries.

Core Methodology

MuGI is designed to improve both sparse and dense retrievers without any additional training requirements. The framework enhances IR by dynamically integrating multiple generative text samples with the original input query. This approach provides two primary functions: boosting the retrieval phase with an enriched query that carries more context and relevant keywords, and enabling a re-ranking phase that better captures document relevance.

MuGI for Sparse Retrieval: The approach utilizes lexical-based methods like BM25, augmented with LLM-generated pseudo references. Instead of merely expanding the query with static terms, MuGI employs an adaptive query repetition strategy, dynamically balancing the weight of the original query against the generated content based on pseudo-reference length.
MuGI for Dense Retrieval: For dense retrieval, MuGI augments query embeddings by concatenating multiple generated passages. This enhances the semantic richness of queries, thereby improving the alignment with relevant document embeddings in high-dimensional space.
MuGI Pipeline: This comprehensive application combines sparse and dense retrieval enhancements. Initially, MuGI-influenced queries retrieve a broad set of potential matches, which are subsequently refined in a dense re-ranking phase.

Experimental Results

The authors conducted extensive evaluations on both in-domain and out-of-distribution datasets using benchmarks like TREC DL19/DL20 and BEIR. Here are some critical findings:

Improved Performance: MuGI dramatically enhances the performance of the BM25 model, achieving improvements of 19.8% on TREC DL19 and up to 7.6% on the BEIR benchmarks. This demonstrates the efficacy of MuGI in improving sparse retrieval methods, even outperforming advanced dense retrievers such as ANCE in certain contexts.
Robust Reranking: The integration of MuGI notably enhances the reranking capabilities of dense retrieval models. When compared to traditional re-ranking methods, MuGI significantly boosts retrieval effectiveness, showing superior results against baselines including MonoT5 and Cohere Re-rankv2, especially under in-domain conditions.

Theoretical and Practical Implications

The research underscores the potential of generative models in bridging semantic gaps in IR tasks. By integrating multiple pseudo references, MuGI supplies additional context and vocabulary, thus enhancing both lexical and semantic retrieval dimensions. The implication is a more refined and effective IR pipeline capable of handling diverse queries with greater accuracy.

The approach not only offers a straightforward method to augment existing IR systems without necessitating dataset-dependent retraining but also paves the way for more sophisticated retrieval systems that capitalize on the capabilities of LLMs to provide enriched query contexts.

Future Directions

The paper suggests multiple avenues for further research. One pertinent direction is investigating the scalability of MuGI across varying domains and datasets. Additionally, the adaptation of this framework to incorporate continuous advancements in LLM architectures could further amplify retrieval precision and scalability. Exploring the integration of MuGI with emerging IR paradigms, like Retrieval-Augmented Generation models, may present opportunities for more nuanced and comprehensive information discovery systems.

In conclusion, this paper presents a significant contribution to the field of Information Retrieval, offering new insights into the application of LLMs for query expansion and retrieval enhancement. It highlights the potential of MuGI as a highly adaptable and efficient enhancement for both sparse and dense retrieval frameworks.

PDF Markdown Bookmark Chat (Pro)

References (33)

Authors (4)

Le Zhang (180 papers)
Yihong Wu (149 papers)
Qian Yang (146 papers)
Jian-Yun Nie (70 papers)

Related Papers

Find Related Papers

GitHub

GitHub - lezhang7/Retrieval_MuGI: Explore generated documents for enhanced IR with LLMs. We enhance BM25 to surpass strong dense retriever on many datasets. (11 stars)

Tweets

https://twitter.com/_reachsumit/status/1746925511299981661

YouTube

Show All Videos