Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Power of Noise: Redefining Retrieval for RAG Systems (2401.14887v4)

Published 26 Jan 2024 in cs.IR and cs.CL
The Power of Noise: Redefining Retrieval for RAG Systems

Abstract: Retrieval-Augmented Generation (RAG) has recently emerged as a method to extend beyond the pre-trained knowledge of LLMs by augmenting the original prompt with relevant passages or documents retrieved by an Information Retrieval (IR) system. RAG has become increasingly important for Generative AI solutions, especially in enterprise settings or in any domain in which knowledge is constantly refreshed and cannot be memorized in the LLM. We argue here that the retrieval component of RAG systems, be it dense or sparse, deserves increased attention from the research community, and accordingly, we conduct the first comprehensive and systematic examination of the retrieval strategy of RAG systems. We focus, in particular, on the type of passages IR systems within a RAG solution should retrieve. Our analysis considers multiple factors, such as the relevance of the passages included in the prompt context, their position, and their number. One counter-intuitive finding of this work is that the retriever's highest-scoring documents that are not directly relevant to the query (e.g., do not contain the answer) negatively impact the effectiveness of the LLM. Even more surprising, we discovered that adding random documents in the prompt improves the LLM accuracy by up to 35%. These results highlight the need to investigate the appropriate strategies when integrating retrieval with LLMs, thereby laying the groundwork for future research in this area.

Introduction

Advancements in LLMs have brought about remarkable capabilities in text generation and understanding, yet their constraint in managing large contexts heralds a limitation. RAG systems aim to surmount this challenge by enabling access to external, dynamically sourced information during response generation. In comprehensive analysis, the paper examines the integral role of the IR phase in a RAG setup, posing an essential research question on optimizing a retriever for effective RAG prompts, focusing on retriever document types—relevant, related and irrelevant—to the prompts.

Retrieval-Augmented Generation Systems

The RAG system composition enhances factual information generation by supplementing the LLM power with an IR component. A key advantage is the increase of the effective context size for LLMs. This dynamic retrieval enriches the input for the generative module, impacting the response's accuracy. The core inquiry is the role of the retriever, exploring its ideal characteristics for prompt optimization. The paper breaks new ground by not just considering the relevance but also the position of retrieved documents, and the surprising benefits of including irrelevant documents.

Experimental Insights

The paper meticulously assesses the IR phase's impact, revealing that related documents harm RAG system performance more than unrelated ones. The counterintuitive discovery is that irrelevant documents, when included in the context, can improve accuracy by up to 35%. Different configurations of the proximity of the gold document to the query are explored, and it is observed that nearby placement enhances LLM performance. These findings challenge the conventional perceivable utility of retrieved documents and advocate for reconsidering information retrieval strategies for RAG system optimization.

Future Directions

These insights demand a systematic rethinking of IR strategies within RAG frameworks. Given that LLMs can manage a finite number of documents, retrievers should supply a minimal set of documents, balancing relevant contents with a certain allowance for irrelevant material, which surprisingly tends to increase accuracy. Moreover, the paper calls for research on the apparent effectiveness of random, irrelevant documents in enhancing the efficiency of LLM responses within RAG systems. The work encourages future research to explore why noise in the system can be beneficial and to delineate the nuanced characteristics that contribute to this unexpected utility.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. The Falcon Series of Open Language Models. arXiv:2311.16867 [cs.CL]
  2. Entropy-based attention regularization frees unintended bias mitigation from lists.
  3. Open LLM Leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  4. Improving language models by retrieving from trillions of tokens. In International conference on machine learning. PMLR, Baltimora, 2206–2240.
  5. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  6. UnitedQA: A hybrid approach for open domain question answering.
  7. Retrieval augmented language model pre-training. In International conference on machine learning. PMLR, Vienna, 3929–3938.
  8. Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems.
  9. Unsupervised dense information retrieval with contrastive learning.
  10. Phi-2: The surprising power of small language models.
  11. Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning (ICML’23). JMLR.org, Honolulu, Hawaii, USA, Article 641, 12 pages.
  12. Dense passage retrieval for open-domain question answering.
  13. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
  14. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. Association for Computational Linguistic, Minneapolis, 2.
  15. Sharp nearby, fuzzy far away: How neural language models use context.
  16. Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. ACM, Xi’an, 39–48.
  17. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 7 (2019), 452–466. https://doi.org/10.1162/tacl_a_00276
  18. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Anna Korhonen, David R. Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, 6086–6096. https://doi.org/10.18653/V1/P19-1612
  19. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  20. Textbooks are all you need ii: phi-1.5 technical report.
  21. Lost in the middle: How language models use long contexts.
  22. Term weighting, and the vector space model. Cambridge University Press Cambridge, Cambridge. 109–133 pages.
  23. Augmented language models: a survey.
  24. AmbigQA: Answering Ambiguous Open-domain Questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 5783–5797. https://doi.org/10.18653/v1/2020.emnlp-main.466
  25. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. arXiv:2306.01116 [cs.CL]
  26. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. https://openreview.net/forum?id=R8sQPpGCv0
  27. Improving language understanding by generative pre-training.
  28. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  29. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  30. In-context retrieval-augmented language models.
  31. Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. , 29–48 pages.
  32. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
  33. Recipes for Building an Open-Domain Chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational Linguistics, Online, 300–325. https://doi.org/10.18653/v1/2021.eacl-main.24
  34. On the Role of Relevance in Natural Language Processing Tasks. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Madrid, 1785–1789.
  35. Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150 [cs.NE]
  36. Do long-range language models actually use long-range context?
  37. MosaicML NLP Team et al. 2023. Introducing mpt-7b: A new standard for open-source, ly usable llms.
  38. Llama: Open and efficient foundation language models.
  39. Llama 2: Open foundation and fine-tuned chat models.
  40. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc., Long Beach. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  41. TL;DR: Mining Reddit to Learn Automatic Summarization. In Proceedings of the Workshop on New Frontiers in Summarization, Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu (Eds.). Association for Computational Linguistics, Copenhagen, Denmark, 59–63. https://doi.org/10.18653/v1/W17-4508
  42. Generalized vector spaces model in information retrieval. In Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, Montreal, 18–25.
  43. Xlnet: Generalized autoregressive pretraining for language understanding.
  44. Pretrained Transformers for Text Ranking: BERT and Beyond. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials, Greg Kondrak, Kalina Bontcheva, and Dan Gillick (Eds.). Association for Computational Linguistics, Online, 1–4. https://doi.org/10.18653/v1/2021.naacl-tutorials.1
  45. Stabilizing transformer training by preventing attention entropy collapse. In International Conference on Machine Learning. PMLR, PMLR, Hawaii, 40770–40803.
  46. Optimizing Dense Retrieval Model Training with Hard Negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (¡conf-loc¿, ¡city¿Virtual Event¡/city¿, ¡country¿Canada¡/country¿, ¡/conf-loc¿) (SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 1503–1512. https://doi.org/10.1145/3404835.3462880
  47. ERNIE: Enhanced language representation with informative entities.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Florin Cuconasu (6 papers)
  2. Giovanni Trappolini (14 papers)
  3. Federico Siciliano (13 papers)
  4. Simone Filice (9 papers)
  5. Cesare Campagnano (5 papers)
  6. Yoelle Maarek (7 papers)
  7. Nicola Tonellotto (40 papers)
  8. Fabrizio Silvestri (75 papers)
Citations (85)
Youtube Logo Streamline Icon: https://streamlinehq.com