In Defense of RAG in the Era of Long-Context Language Models
Abstract: Overcoming the limited context limitations in early-generation LLMs, retrieval-augmented generation (RAG) has been a reliable solution for context-based answer generation in the past. Recently, the emergence of long-context LLMs allows the models to incorporate much longer text sequences, making RAG less attractive. Recent studies show that long-context LLMs significantly outperform RAG in long-context applications. Unlike the existing works favoring the long-context LLM over RAG, we argue that the extremely long context in LLMs suffers from a diminished focus on relevant information and leads to potential degradation in answer quality. This paper revisits the RAG in long-context answer generation. We propose an order-preserve retrieval-augmented generation (OP-RAG) mechanism, which significantly improves the performance of RAG for long-context question-answer applications. With OP-RAG, as the number of retrieved chunks increases, the answer quality initially rises, and then declines, forming an inverted U-shaped curve. There exist sweet points where OP-RAG could achieve higher answer quality with much less tokens than long-context LLM taking the whole context as input. Extensive experiments on public benchmark demonstrate the superiority of our OP-RAG.
- Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
- Mistral AI. 2024. Mistral large 2.
- Anthropic. 2024. Claude 3.5 sonnet.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
- Tri Dao. 2024. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR).
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS).
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Retrieval augmented generation or long-context llms? a comprehensive study and hybrid approach. arXiv preprint arXiv:2407.16833.
- Meta. 2024a. Introducing llama 3.1: Our most capable models to date.
- Meta. 2024b. Llama 3.1 models.
- Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
- OpenAI. 2023. GPT-4 technical report. ArXiv, 2303:08774.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
- A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554.
- Efficient transformers: A survey.(2020). arXiv preprint cs.LG/2009.06732.
- xAI. 2024. Grok-2 beta release.
- C-pack: Packaged resources to advance general chinese embedding. Preprint, arXiv:2309.07597.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297.
- ∞\infty∞bench: Extending long context evaluation beyond 100k tokens. Preprint, arXiv:2402.13718.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.