Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Seven Failure Points When Engineering a Retrieval Augmented Generation System (2401.05856v1)

Published 11 Jan 2024 in cs.SE and cs.AI
Seven Failure Points When Engineering a Retrieval Augmented Generation System

Abstract: Software engineers are increasingly adding semantic search capabilities to applications using a strategy known as Retrieval Augmented Generation (RAG). A RAG system involves finding documents that semantically match a query and then passing the documents to a LLM such as ChatGPT to extract the right answer using an LLM. RAG systems aim to: a) reduce the problem of hallucinated responses from LLMs, b) link sources/references to generated responses, and c) remove the need for annotating documents with meta-data. However, RAG systems suffer from limitations inherent to information retrieval systems and from reliance on LLMs. In this paper, we present an experience report on the failure points of RAG systems from three case studies from separate domains: research, education, and biomedical. We share the lessons learned and present 7 failure points to consider when designing a RAG system. The two key takeaways arising from our work are: 1) validation of a RAG system is only feasible during operation, and 2) the robustness of a RAG system evolves rather than designed in at the start. We conclude with a list of potential research directions on RAG systems for the software engineering community.

Introduction

In light of the evolution of LLMs and their integration into various applications, the use of Retrieval-Augmented Generation (RAG) systems has become increasingly significant. RAG systems are designed to augment the capabilities of LLMs by incorporating a retrieval component that sources relevant documents in response to queries. These systems are instrumental in overcoming challenges associated with directly utilizing LLMs, such as the production of hallucinated content and the difficulty in updating the knowledge base that these models draw from. This paper explores the specific engineering challenges encountered in RAG system implementation across three distinct domains.

Failure Points in RAG Systems

The authors of the paper outline seven critical failure points discovered through empirical experimentation utilizing the BioASQ dataset, consisting of 15,000 documents and 1,000 question and answer pairs. Among these failure points are issues related to missing content where the system fails to provide answers due to the absence of sufficient documents, and ranking errors where relevant documents are not surfaced effectively. Other highlighted failure modes include consolidation strategy limitations, difficulties in extracting the correct answers from provided context, incorrect response formatting, specificity issues, and problems with incomplete answers. Accurate identification and resolution of these failure points are crucial for the reliable operation of RAG systems in practical settings.

Case Studies and Practical Observations

Besides theoretical and empirical insights, the paper chronicles key lessons from three case studies within research, education, and biomedical domains. For instance, the Cognitive Reviewer assists researchers in scientific document analysis by ranking papers according to the research objective. The AI Tutor, integrated within a learning management system, indexes various materials to provide students with contextually accurate answers. The BioASQ case paper, operating at a larger scale with biomedical documents, reinforces the importance of meticulous inspection and the limitations of automated evaluation methods. These practical applications provide a rich experience report which is invaluable to practitioners in this space.

Looking Forward: Recommendations and Research Areas

The paper culminates with pertinent lessons for the future engineering of RAG systems supported by an extensive review of key learnings across the case studies. It underscores the need for ongoing system calibration to address the dynamic nature of input data and system interaction in real-life scenarios. Additionally, it lays out future research directions that hold promise in improving the robustness and efficiency of the RAG systems. These include exploring optimal chunking techniques for documents, investigating the trade-offs between RAG systems, and fine-tuning LLMs, as well as the development of software testing and monitoring practices tailored to RAG system specifications.

In essence, this paper serves as a comprehensive experience report on engineering RAG systems, offering both a practitioner's guide and a research roadmap. It shines a light on the intricacies of implementing RAG systems and paves the way for subsequent advancements in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Fu Bang. 2023. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings. In 3rd Workshop for Natural Language Processing Open Source Software.
  2. Self-adaptive Machine Learning Systems: Research Challenges and Opportunities. 133–155. https://doi.org/10.1007/978-3-031-15116-3_7
  3. Benchmarking Large Language Models in Retrieval-Augmented Generation. arXiv preprint arXiv:2309.01431 (2023).
  4. Efficient Open Domain Multi-Hop Question Answering with Few-Shot Data Synthesis. arXiv preprint arXiv:2305.13691 (2023).
  5. Threshy: Supporting safe usage of intelligent web services. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1645–1649.
  6. Beware the evolving ‘intelligent’web service! An integration architecture tactic to guard AI-first components. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 269–280.
  7. Retrieval augmented language model pre-training. In International conference on machine learning. PMLR, 3929–3938.
  8. Fid-light: Efficient and effective retrieval-augmented text generation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1437–1447.
  9. Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282 (2020).
  10. BioASQ-QA: A manually curated corpus for biomedical question answering. Scientific Data 10 (2023), 170. Citation Key: 422.
  11. LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B. arXiv:2310.20624 [cs.LG]
  12. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  13. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172 (2023).
  14. G-eval: Nlg evaluation using gpt-4 with better human alignment, may 2023. arXiv preprint arXiv:2303.16634 (2023).
  15. Retrieval-based prompt selection for code-related few-shot learning. In Proceedings of the 45th International Conference on Software Engineering (ICSE’23).
  16. OpenAI. 2023. GPT-4 Technical Report. https://doi.org/10.48550/ARXIV.2303.08774
  17. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.
  18. Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics 11 (2023), 1–17.
  19. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Scott Barnett (20 papers)
  2. Stefanus Kurniawan (5 papers)
  3. Srikanth Thudumu (12 papers)
  4. Zach Brannelly (1 paper)
  5. Mohamed Abdelrazek (24 papers)
Citations (50)
Youtube Logo Streamline Icon: https://streamlinehq.com