Formal theory for the RAG retrieval trade-off

Establish a formal and comprehensive theoretical explanation for the observed trade-off between the number of relevant documents and the number of totally irrelevant (random) documents included in the context of Retrieval-Augmented Generation prompts for open-domain question answering, clarifying why accuracy improves when a minimal set of retrieved documents is supplemented with random documents and why performance degrades when many semantically related but non-answer documents are included.

Background

The paper reports counter-intuitive findings in Retrieval-Augmented Generation (RAG) for open-domain question answering: adding random documents to the prompt can increase LLM accuracy, while adding high-scoring but non-answer documents (distractors) tends to degrade accuracy. The authors observe a trade-off between relevant and random documents and note that best effectiveness occurs when retrieving a small number of documents (about 3–5) and then filling the remaining context with random documents.

They explicitly state that building a formal or comprehensive theory to explain these findings is still an open research challenge, motivating the need for principled understanding beyond empirical heuristics.

References

While establishing a formal or comprehensive theory behind these findings remains an open research challenge, we can still infer that there seems to be a trade-off between the number of relevant and totally irrelevant documents.

— The Power of Noise: Redefining Retrieval for RAG Systems (2401.14887 - Cuconasu et al., 2024) in Results, Subsection "Retriever Trade-Off"

Formal theory for the RAG retrieval trade-off

Sponsor

Background

References

Related Problems