Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Large Language Models in Retrieval-Augmented Generation (2309.01431v2)

Published 4 Sep 2023 in cs.CL
Benchmarking Large Language Models in Retrieval-Augmented Generation

Abstract: Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of LLMs. However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different LLMs, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on LLMs. We analyze the performance of different LLMs in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.

Benchmarking LLMs in Retrieval-Augmented Generation: An Analysis of Key Abilities

Introduction

Retrieval-Augmented Generation (RAG) represents a significant stride in addressing the limitations of LLMs, notably in mitigating hallucinations, knowledge obsolescence, and improving domain-specific expertise. Despite the promising benefits of integrating external knowledge through retrieval methods, challenges persist regarding LLMs' effective use and reliability in leveraging external information. This paper presents a systematic examination of the impact of RAG on LLMs, focusing on four critical abilities: noise robustness, negative rejection, information integration, and counterfactual robustness. The paper pioneers the creation of the Retrieval-Augmented Generation Benchmark (RGB), encompassing both English and Chinese testbeds, to evaluate these abilities in LLMs.

Critical Abilities of RAG in LLMs

RAG introduces a paradigm where models consult external information to bolster their responses. This approach, however, underscores the need for models to possess specific capabilities to utilize external knowledge effectively:

  • Noise Robustness: The ability of an LLM to sift useful information from noisy, irrelevant details in retrieved documents.
  • Negative Rejection: The model's capacity to withhold responses when reliable information is absent in the retrieved documents.
  • Information Integration: Competence in synthesizing answers from multiple documents, especially for complex queries requiring cross-referencing of information.
  • Counterfactual Robustness: The adeptness at identifying and disregarding incorrect facts within the external knowledge, especially when preemptively warned about potential inaccuracies.

The Retrieval-Augmented Generation Benchmark (RGB)

The introduction of RGB constitutes a novel approach to assess these capabilities within LLMs. RGB is crafted from recent news information to ensure the benchmark's relevance and challenge to current models. The benchmark encompasses 4 testbeds, each tailored to evaluate one of the critical RAG abilities defined. This innovative benchmark facilitates a nuanced understanding of how LLMs perform RAG, identifying their strengths and shortcomings across different linguistic and task-specific contexts.

Evaluation Findings

The analysis of six leading LLMs on RGB unveiled substantial insights into the current state of RAG in LLMs. The models showcased varied levels of noise robustness, though a marked decline in performance was observed as the noise ratio increased. Conversely, a substantial challenge was identified in negative rejection, with models struggling to accurately withhold responses when confronted with noise-exclusive document sets. In information integration tasks, models demonstrated limited capabilities, suggesting a need for improved reasoning and synthesis over multiple documents. Counterfactual robustness proved particularly challenging, with models often misled by incorrect information despite internal knowledge contradicting the retrieved facts.

Discussion

The distilled insights from evaluating LLMs against RGB underscore the urgency for targeted improvements in RAG methodologies. While LLMs embody remarkable potential, their current usage of RAG reveals critical vulnerabilities—specifically in handling information noise, integrating multifaceted data points, and discerning factual inaccuracies. Addressing these challenges is pivotal for enhancing LLMs' reliability and applicability across diverse applications.

Future Directions

The findings advocate for a more nuanced approach to developing and refining RAG techniques in LLMs. Future research may focus on:

  • Enhancing models' comprehension and contextual discrimination to improve noise robustness and negative rejection rates.
  • Developing advanced reasoning algorithms for better information integration from disparate sources.
  • Implementing robust fact-checking mechanisms within LLMs to bolster counterfactual robustness.

These future directions demonstrate a promising pathway to harnessing the full potential of RAG, thereby enabling LLMs to offer more accurate, reliable, and contextually aware responses.

Conclusion

The comprehensive evaluation conducted through RGB reveals that although RAG presents a beacon for advancing LLM capabilities, substantial hurdles remain. By systematically dissecting the necessary abilities for effective RAG implementation, this paper lays a foundational blueprint for future advancements in LLM development. As the field continues to evolve, focused efforts on addressing the outlined challenges will be instrumental in unlocking the transformative potential of retrieval-augmented generation within LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering. arXiv:2307.16877.
  2. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. arXiv:2302.04023.
  3. A Drop of Ink Makes a Million Think: The Spread of False Information in Large Language Models. arXiv:2305.04812.
  4. Improving language models by retrieving from trillions of tokens. arXiv:2112.04426.
  5. Skeleton-to-Response: Dialogue Generation Guided by Retrieval Memory. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 1219–1228. Minneapolis, Minnesota: Association for Computational Linguistics.
  6. Retrieval-guided Dialogue Response Generation via a Matching-to-Generation Framework. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 1866–1875. Hong Kong, China: Association for Computational Linguistics.
  7. Factual Error Correction for Abstractive Summarization Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6251–6258. Online: Association for Computational Linguistics.
  8. A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
  9. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  10. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases. arXiv:2306.16092.
  11. Compositional Semantic Parsing with Large Language Models. In The Eleventh International Conference on Learning Representations.
  12. Open LLM Leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open˙llm˙leaderboard.
  13. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. arXiv:2301.07597.
  14. REALM: Retrieval-Augmented Language Model Pre-Training. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
  15. Rethinking with Retrieval: Faithful Large Language Model Inference. arXiv:2301.00303.
  16. Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations.
  17. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv preprint arXiv:2305.08322.
  18. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 874–880. Online: Association for Computational Linguistics.
  19. Atlas: Few-shot Learning with Retrieval Augmented Language Models. arXiv:2208.03299.
  20. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv., 55(12).
  21. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781713829546.
  22. Large Language Models with Controllable Working Memory. In Findings of the Association for Computational Linguistics: ACL 2023, 1774–1793. Toronto, Canada: Association for Computational Linguistics.
  23. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca˙eval.
  24. Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? An Examination on Several Typical Tasks. arXiv:2305.05862.
  25. Evaluating Verifiability in Generative Search Engines. arXiv:2304.09848.
  26. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1906–1919. Online: Association for Computational Linguistics.
  27. OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt.
  28. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. arXiv:2302.12813.
  29. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv:2307.16789.
  30. QwenLM. 2023. Qwen-7B. https://github.com/QwenLM/Qwen-7B.
  31. The Curious Case of Hallucinations in Neural Machine Translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1172–1183. Online: Association for Computational Linguistics.
  32. Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation. arXiv:2307.11019.
  33. In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT. arXiv:2304.08979.
  34. REPLUG: Retrieval-Augmented Black-Box Language Models. arXiv:2301.12652.
  35. THUDM. 2023a. ChatGLM-6B. https://github.com/THUDM/ChatGLM-6B.
  36. THUDM. 2023b. ChatGLM2-6B. https://github.com/THUDM/ChatGLM2-6B.
  37. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 10014–10037. Toronto, Canada: Association for Computational Linguistics.
  38. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. Red Hook, NY, USA: Curran Associates Inc.
  39. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In International Conference on Learning Representations.
  40. CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility. arXiv:2307.09705.
  41. Search-in-the-Chain: Towards Accurate, Credible and Traceable Large Language Models for Knowledge-intensive Tasks. arXiv:2304.14732.
  42. BELLE: Bloom-Enhanced Large Language model Engine. https://github.com/LianjiaTech/BELLE.
  43. M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models.
  44. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364.
  45. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In The Eleventh International Conference on Learning Representations.
  46. DocPrompting: Generating Code by Retrieving the Docs. In The Eleventh International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jiawei Chen (160 papers)
  2. Hongyu Lin (94 papers)
  3. Xianpei Han (103 papers)
  4. Le Sun (111 papers)
Citations (177)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com