Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
Abstract: Retrieval Augmented Generation (RAG) has been a powerful tool for LLMs to efficiently process overly lengthy contexts. However, recent LLMs like Gemini-1.5 and GPT-4 show exceptional capabilities to understand long contexts directly. We conduct a comprehensive comparison between RAG and long-context (LC) LLMs, aiming to leverage the strengths of both. We benchmark RAG and LC across various public datasets using three latest LLMs. Results reveal that when resourced sufficiently, LC consistently outperforms RAG in terms of average performance. However, RAG's significantly lower cost remains a distinct advantage. Based on this observation, we propose Self-Route, a simple yet effective method that routes queries to RAG or LC based on model self-reflection. Self-Route significantly reduces the computation cost while maintaining a comparable performance to LC. Our findings provide a guideline for long-context applications of LLMs using RAG and LC.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088.
- Anthropic. 2024. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet/.
- Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
- Rq-rag: Learning to refine queries for retrieval augmented generation. arXiv preprint arXiv:2404.00610.
- Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
- A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011.
- Google. 2024. Gemini pricing. https://ai.google.dev/pricing.
- Greg Kamradt. 2023. Needle in a haystack - pressure testing llms. https://github.com/gkamradt/LLMTest_NeedleInAHaystack.
- Longt5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 724–736.
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
- Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060.
- Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
- Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
- Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282.
- Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839.
- Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172.
- The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
- In search of needles in a 10m haystack: Recurrent memory finds what llms miss. arXiv preprint arXiv:2402.10790.
- Same task, more tokens: the impact of input length on the reasoning performance of large language models. arXiv preprint arXiv:2402.14848.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Llatrieval: Llm-verified retrieval for verifiable generation. arXiv preprint arXiv:2311.07838.
- How to train your dragon: Diverse augmentation towards generalizable dense retrieval. arXiv preprint arXiv:2302.07452.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
- Yuanhua Lv and ChengXiang Zhai. 2009. Adaptive relevance feedback in information retrieval. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 255–264.
- Query rewriting for retrieval-augmented large language models. arXiv preprint arXiv:2305.14283.
- Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
- OpenAI. 2023. Gpt-3.5-turbo. https://platform.openai.com/docs/models/gpt-3-5-turbo.
- OpenAI. 2024a. Gpt-4o. https://openai.com/index/hello-gpt-4o/.
- OpenAI. 2024b. Openai-api pricing. https://platform.openai.com/docs/overview.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
- Scrolls: Standardized comparison over long language sequences. arXiv preprint arXiv:2201.03533.
- Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
- Counting-stars: A simple, efficient, and reasonable strategy for evaluating long-context large language models. arXiv preprint arXiv:2403.11802.
- Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
- Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025.
- Corrective retrieval augmented generation. arXiv preprint arXiv:2401.15884.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
- Lv-eval: A balanced long-context benchmark with 5 length levels up to 256k. arXiv preprint arXiv:2402.05136.
- Chengxiang Zhai and John Lafferty. 2001. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the tenth international conference on Information and knowledge management, pages 403–410.
- Infinity bench: Extending long context evaluation beyond 100k tokens. arXiv preprint arXiv:2402.13718.
- Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.