The Chronicles of RAG: The Retriever, the Chunk and the Generator (2401.07883v1)

Published 15 Jan 2024 in cs.LG, cs.AI, cs.CL, and cs.IR

Abstract: Retrieval Augmented Generation (RAG) has become one of the most popular paradigms for enabling LLMs to access external data, and also as a mechanism for grounding to mitigate against hallucinations. When implementing RAG you can face several challenges like effective integration of retrieval models, efficient representation learning, data diversity, computational efficiency optimization, evaluation, and quality of text generation. Given all these challenges, every day a new technique to improve RAG appears, making it unfeasible to experiment with all combinations for your problem. In this context, this paper presents good practices to implement, optimize, and evaluate RAG for the Brazilian Portuguese language, focusing on the establishment of a simple pipeline for inference and experiments. We explored a diverse set of methods to answer questions about the first Harry Potter book. To generate the answers we used the OpenAI's gpt-4, gpt-4-1106-preview, gpt-3.5-turbo-1106, and Google's Gemini Pro. Focusing on the quality of the retriever, our approach achieved an improvement of MRR@10 by 35.4% compared to the baseline. When optimizing the input size in the application, we observed that it is possible to further enhance it by 2.4%. Finally, we present the complete architecture of the RAG with our recommendations. As result, we moved from a baseline of 57.88% to a maximum relative score of 98.61%.

PDF Abstract

Introduction to Retrieval-Augmented Generation

The landscape of AI language comprehension and response generation is experiencing a transformation with the advent of Retrieval-Augmented Generation (RAG). This technique enhances LLMs with the ability to seek external data, aiming to produce more accurate and relevant answers, particularly when the query involves information not present in the training data. However, integrating a RAG into applications comes with a spectrum of challenges, such as the seamless incorporation of retrieval models, learning efficient representations, managing diverse data, optimizing computational efficiency, conducting evaluations, and improving text generation quality.

Experimenting with RAG in Brazilian Portuguese

To tackle these challenges, a series of experiments was conducted, focusing on the Brazilian Portuguese language. The researchers chose a diverse set of retrieval methods, including sparse and dense retrievers, and investigated various chunking strategies to refine the integration of retrieval into the response generation process. Their experimentation also touched upon the implications of document positioning within the prompt, examining how it influences the quality of the content generated. A notable part of the paper involved comparing the response quality of two popular LLMs, GPT-4 and Gemini, integrating the retrieved data.

Understanding Evaluation Metrics and Strategies

Evaluating a RAG system cannot rely on traditional metrics alone, as simple comparisons between two text samples can miss out on semantic similarities. This paper recommends a more nuanced evaluation system with a scale of relevance and accuracy. Moreover, they introduced a "relative maximum score" metric that captures the potential peak performance a RAG system might achieve, allowing for a clearer understanding of where performance can be improved relative to an ideal system.

Advances and Conclusions

The researchers discovered that improvements in retrieval methods significantly enhance RAG performance, with their best approach marking a notable improvement in Mean Reciprocal Rank at 10 queries (MRR@10). Additionally, they observed that optimizing the number of chunks retrieved could lead to further performance boosts. Their extensive testing led to recommendations for implementing RAG systems, highlighting the interconnectedness between the retriever's quality and the final RAG performance. The paper culminates in a strategy that dramatically reduced performance degradation, moving from a baseline degradation score of over 50% to a much-improved score around 1.4% to 2.3%.

Looking Ahead

This research, though rooted in experimenting with a specific dataset, highlighted the universal importance of data quality for RAG applications. Further work may include expanding the dataset landscape, exploring segmentation and chunk construction techniques, and continuing to refine retriever methods. The paper illustrates the dynamic nature of RAG research and offers valuable contributions applicable to AI systems catering to languages other than English, such as Brazilian Portuguese.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Paulo Finardi (7 papers)
Leonardo Avila (1 paper)
Rodrigo Castaldoni (1 paper)
Pedro Gengo (2 papers)
Celio Larcher (2 papers)
Marcos Piau (4 papers)
Pablo Costa (2 papers)
Vinicius Caridá (3 papers)

Citations (16)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/_reachsumit/status/1747520039920775468