Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

In Defense of RAG in the Era of Long-Context Language Models (2409.01666v1)

Published 3 Sep 2024 in cs.CL
In Defense of RAG in the Era of Long-Context Language Models

Abstract: Overcoming the limited context limitations in early-generation LLMs, retrieval-augmented generation (RAG) has been a reliable solution for context-based answer generation in the past. Recently, the emergence of long-context LLMs allows the models to incorporate much longer text sequences, making RAG less attractive. Recent studies show that long-context LLMs significantly outperform RAG in long-context applications. Unlike the existing works favoring the long-context LLM over RAG, we argue that the extremely long context in LLMs suffers from a diminished focus on relevant information and leads to potential degradation in answer quality. This paper revisits the RAG in long-context answer generation. We propose an order-preserve retrieval-augmented generation (OP-RAG) mechanism, which significantly improves the performance of RAG for long-context question-answer applications. With OP-RAG, as the number of retrieved chunks increases, the answer quality initially rises, and then declines, forming an inverted U-shaped curve. There exist sweet points where OP-RAG could achieve higher answer quality with much less tokens than long-context LLM taking the whole context as input. Extensive experiments on public benchmark demonstrate the superiority of our OP-RAG.

In Defense of RAG in the Era of Long-Context LLMs

The paper "In Defense of RAG in the Era of Long-Context LLMs" by Tan Yu, Anbang Xu, and Rama Akkiraju presents a critical examination of the ongoing evolution in natural language processing, specifically focusing on the efficacy of Retrieval-Augmented Generation (RAG) versus contemporary long-context LLMs. This work revisits the relevance and potential superiority of RAG in addressing long-context question-answering tasks within the field.

Introduction

Retrieval-Augmented Generation (RAG) has long stood as a solution to the context window limitations inherent in earlier generations of LLMs. By integrating specific, external knowledge into LLMs, RAG systems enhance factual accuracy and reduce instances of hallucinations. Despite the recent advancements in LLMs that support expanded context windows (e.g., up to 1M tokens), current literature suggests that these long-context LLMs generally outperform RAG in handling extended context scenarios. The authors of this paper, however, contend that excessively long contexts can dilute the focus on pertinent information, thereby potentially degrading the quality of generated answers.

Proposed Method: Order-Preserve RAG (OP-RAG)

The authors introduce Order-Preserve Retrieval-Augmented Generation (OP-RAG), a novel mechanism to enhance the performance of traditional RAG systems for long-context applications. Traditional RAG sorts retrieved chunks purely based on relevance scores in descending order. In contrast, OP-RAG maintains the original order of these chunks as they appear in the source document. This order-preservation is critical because it aligns more naturally with the narrative flow and intrinsic structure of the document, which are often disrupted by pure relevance-based ordering.

Experimental Results

The experimental results are compelling, showing that OP-RAG outperforms both vanilla RAG and long-context LLMs without RAG in terms of quality and efficiency. Using datasets such as En.QA and EN.MC from the \inftyBench benchmark, OP-RAG demonstrated notable improvements:

  • On the En.QA dataset, OP-RAG with 48K retrieved tokens achieves a 47.25 F1 score using the Llama3.1-70B model, compared to 34.26 F1 score for the long-context Llama3.1-70B with 117K tokens as input.
  • Similarly, on the EN.MC dataset, OP-RAG achieves an accuracy of 88.65% with 24K tokens, outperforming other evaluated long-context LLMs without RAG.

These results underscore the claim that efficient retrieval and focused context utilization can significantly elevate model performance without necessitating massive token inputs.

Theoretical and Practical Implications

The OP-RAG mechanism challenges the current trend that favours long-context LLMs by illustrating that proper retrieval mechanisms can and do enhance context application efficiency. This research also highlights an optimal balance between the inclusion of relevant information and exclusion of distracting content, showing an inverted U-shaped performance curve as the retrieved chunk count increases.

Practically, this means that NLP systems can achieve better performance not merely by expanding the context window but by refining how context is retrieved and presented to the model. This finding is crucial for applications confined by computational resources or seeking to reduce inference costs without sacrificing answer quality.

Future Prospects

The paper opens up several avenues for further exploration. Future research could delve into refining OP-RAG methods to dynamically adapt to different types of queries and documents, developing more sophisticated chunking strategies, or integrating OP-RAG with other state-of-the-art LLMs to explore symbiotic enhancements.

In conclusion, this paper provides a robust argument and empirical evidence supporting the sustained relevance and potential superiority of RAG, particularly through the proposed OP-RAG mechanism. It asserts that the efficient and contextually-aware retrieval of information can maintain, if not surpass, the performance of even the most advanced long-context LLMs. Research in this direction could catalyze more optimized approaches to context handling in NLP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
  2. Mistral AI. 2024. Mistral large 2.
  3. Anthropic. 2024. Claude 3.5 sonnet.
  4. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
  5. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
  6. Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
  7. Tri Dao. 2024. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR).
  8. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS).
  9. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  10. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  11. Retrieval augmented generation or long-context llms? a comprehensive study and hybrid approach. arXiv preprint arXiv:2407.16833.
  12. Meta. 2024a. Introducing llama 3.1: Our most capable models to date.
  13. Meta. 2024b. Llama 3.1 models.
  14. Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
  15. OpenAI. 2023. GPT-4 technical report. ArXiv, 2303:08774.
  16. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
  17. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
  18. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554.
  19. Efficient transformers: A survey.(2020). arXiv preprint cs.LG/2009.06732.
  20. xAI. 2024. Grok-2 beta release.
  21. C-pack: Packaged resources to advance general chinese embedding. Preprint, arXiv:2309.07597.
  22. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297.
  23. ∞\infty∞bench: Extending long context evaluation beyond 100k tokens. Preprint, arXiv:2402.13718.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Tan Yu (17 papers)
  2. Anbang Xu (10 papers)
  3. Rama Akkiraju (9 papers)
Citations (8)
Youtube Logo Streamline Icon: https://streamlinehq.com