Mechanism underlying the effectiveness of noisy prompts in RAG

Investigate and elucidate why adding random, unrelated documents to the prompt context of Retrieval-Augmented Generation for open-domain question answering improves large language model accuracy, and identify the properties and mechanisms of the resulting "noisy state" that contribute to its effectiveness.

Background

The paper shows that including random documents—drawn from Wikipedia, Reddit, or even nonsensical word sequences—can improve the accuracy of LLMs in RAG settings, particularly when positioned near the query. The authors measure attention entropy and observe a threefold increase when random documents are added, connecting this observation to prior work on entropy collapse, but they refrain from asserting a definitive causal explanation.

They explicitly acknowledge that, despite observed patterns, they cannot yet answer why this noisy state is advantageous and call for future studies to determine the reasons and characteristics that make noise beneficial.

References

Although these experiments show a pattern, we cannot yet answer this question in a definitive manner. While out of the scope of this work, which focuses on the retriever component of RAG systems, we believe it is highly important to investigate the reasons for which the LLM shows this behavior. Future studies should aim to elucidate why this noisy state is more advantageous and identify the characteristics that contribute to its effectiveness.

The Power of Noise: Redefining Retrieval for RAG Systems (2401.14887 - Cuconasu et al., 26 Jan 2024) in Results, Subsection "On The Unreasonable Effectiveness Of Random Documents"