Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Corpus-Steered Query Expansion with Large Language Models (2402.18031v1)

Published 28 Feb 2024 in cs.IR and cs.CL

Abstract: Recent studies demonstrate that query expansions generated by LLMs can considerably enhance information retrieval systems by generating hypothetical documents that answer the queries as expansions. However, challenges arise from misalignments between the expansions and the retrieval corpus, resulting in issues like hallucinations and outdated information due to the limited intrinsic knowledge of LLMs. Inspired by Pseudo Relevance Feedback (PRF), we introduce Corpus-Steered Query Expansion (CSQE) to promote the incorporation of knowledge embedded within the corpus. CSQE utilizes the relevance assessing capability of LLMs to systematically identify pivotal sentences in the initially-retrieved documents. These corpus-originated texts are subsequently used to expand the query together with LLM-knowledge empowered expansions, improving the relevance prediction between the query and the target documents. Extensive experiments reveal that CSQE exhibits strong performance without necessitating any training, especially with queries for which LLMs lack knowledge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4):357–389.
  2. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
  3. A review of ontology based query expansion. Information Processing & Management, 43(4):866–886.
  4. Overview of the trec 2019 deep learning track. arXiv preprint arXiv:2003.07820.
  5. Overview of the trec 2020 deep learning track. arXiv preprint arXiv:2102.07662.
  6. Perspectives on large language models for relevance judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ’23, page 39–50, Taipei, Taiwan. Association for Computing Machinery.
  7. Precise zero-shot dense retrieval without relevance labels. arXiv preprint arXiv:2212.10496.
  8. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
  9. Query expansion by prompting large language models. arXiv preprint arXiv:2305.03653.
  10. Umass at trec 2004: Novelty and hard. In Text Retrieval Conference.
  11. Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 15696–15707. PMLR.
  12. RealTime QA: What’s the answer right now? arXiv preprint arXiv:2207.13332.
  13. Victor Lavrenko and W. Bruce Croft. 2001. Relevance-Based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, page 120–127, New Orleans, Louisiana. Association for Computing Machinery.
  14. Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 2356–2362, Virtual Event, Canada. Association for Computing Machinery.
  15. Generative relevance feedback with large language models. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 2026–2031, Taipei, Taiwan. Association for Computing Machinery.
  16. Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland. Association for Computational Linguistics.
  17. Yonggang Qiu and Hans-Peter Frei. 1993. Concept based query expansion. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’93, page 160–169, Pittsburgh, Pennsylvania. Association for Computing Machinery.
  18. Stephen Robertson. 1990. On term selection for query expansion. Journal of Documentation, 46:359–364.
  19. Is ChatGPT good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14918–14937, Singapore. Association for Computational Linguistics.
  20. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  21. Large language models can accurately predict searcher preferences. arXiv preprint arXiv:2309.10621.
  22. Query2doc: Query expansion with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9414–9423, Singapore. Association for Computational Linguistics.
  23. When do generative query and document expansions fail? a comprehensive study across methods, retrievers, and datasets. arXiv preprint arXiv:2309.08541.
  24. Making retrieval-augmented language models robust to irrelevant context. arXiv preprint arXiv:2310.01558.
  25. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
Citations (2)

Summary

  • The paper introduces a novel hybrid CSQE method that leverages corpus-derived texts to counteract LLM hallucinations and enhance retrieval accuracy.
  • It demonstrates a two-step process where an LLM first selects relevant documents and then extracts key sentences from the corpus for query enrichment.
  • Experimental results show that CSQE outperforms SOTA methods across varied datasets, achieving robust performance without extra training.

Enhancing Information Retrieval with Corpus-Steered Query Expansion Using LLMs

Introduction to Corpus-Steered Query Expansion (CSQE)

In the domain of information retrieval, the introduction of LLMs has presented a novel approach towards enhancing query expansions, thereby improving the relevance and accuracy of retrieved documents. However, such improvements often come at the cost of generating expansions that may not align well with the retrieval corpus, leading to issues like hallucinations and the inclusion of outdated information. Addressing these challenges, the paper presents Corpus-Steered Query Expansion (CSQE), a method that enhances query expansion by merging the strength of LLMs in relevance assessment with the factual correctness and up-to-dateness inherent in the corpus itself. This approach not only mitigates the limitations associated with solely relying on the intrinsic knowledge of LLMs but also leverages it to identify and incorporate pivotal sentences from the corpus into the query expansion process.

CSQE Methodology

The proposed CSQE technique involves a two-step process where an LLM is first used to identify relevant documents from an initial retrieval set. Subsequently, it extracts key sentences that contribute significantly to the relevance of these documents. These corpus-derived expansions are then amalgamated with expansions generated through the LLM's intrinsic knowledge to enrich the original query. This hybrid approach, grounded in both corpus originated texts and LLM-generated expansions, is designed to enhance the relevancy and factuality of the expanded query, outperforming traditional methods that solely depend on LLMs.

Experimental Verification and Results

The effectiveness of CSQE was rigorously tested across both high-resource web search datasets and low-resource retrieval datasets spanning a variety of domains. The comparison against state-of-the-art (SOTA) models and traditional pseudo relevance feedback (PRF) methods demonstrated the superior performance of CSQE, highlighting its robustness and generalizability across different settings. Notably, the integration of CSQE with a basic BM25 model yielded significant improvements over LLM-knowledge empowered expansions and even surpassed the performance of the ContrieverFT model across all evaluated metrics, without necessitating any form of training.

Future Implications and Prospects

The introduction of CSQE signifies a promising advancement in the field of information retrieval, showcasing the potential of leveraging the synergistic capabilities of LLMs and corpus-derived data. The demonstrated proficiency in mitigating issues like hallucinations and incorporating up-to-date information from the corpus opens avenues for further exploration into hybrid models that combine the comprehensive knowledge of LLMs with the factual accuracy inherent in specific corpora. Furthermore, the flexibility of CSQE in adapting to various datasets without the need for intensive training or domain-specific fine-tuning presents an accessible solution for enhancing query expansion in information retrieval systems.

Conclusion

The Corpus-Steered Query Expansion method represents a significant step forward in addressing the existing limitations of LLM-based query expansions by strategically incorporating corpus-originated texts. The approach not only capitalizes on the extensive knowledge base of LLMs but also ensures that the expansions remain grounded in the factual and relevant content of the corpus, thereby improving both the effectiveness and reliability of information retrieval systems. The promising results and the method's adaptability to different domains underscore the potential of CSQE as a versatile tool in the evolving landscape of search technologies.

X Twitter Logo Streamline Icon: https://streamlinehq.com