Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework (2406.14783v2)

Published 20 Jun 2024 in cs.IR and cs.CL

Abstract: Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages LLMs to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon's RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF's approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo's evaluation criteria.

PDF HTML Abstract

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

Abstract

The paper addresses the evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems, with a specific focus on domain-specific knowledge and the difficulties in developing gold standard benchmarks for internal company tasks. The authors propose a comprehensive evaluation framework leveraging LLMs to generate synthetic queries and employing LLM-as-a-judge to rate documents and answers. This framework culminates in the ranking of RAG variations through RAGElo, an Elo-based competition. The evaluation highlighted significant aspects such as hallucination detection, the superiority of RAG-Fusion (RAGF) in completeness, and a positive correlation between LLM-as-a-judge ratings and human annotations.

Introduction

The leap in conversational QA systems driven by LLMs is constrained by hallucination issues and the inability to handle unanswerable questions reliably. This is exacerbated in enterprise settings where domain-specific knowledge is pivotal. The paper details the challenges faced by Infineon Technologies in managing their RAG-based conversational QA system. They underscore the advent of RAGF, which combines multiple query variations to enhance answer quality.

Related Work

The landscape of evaluative methodologies for RAG systems has been evolving, with several initiatives striving to refine the factual evaluation and the overall quality assessment. The paper references advancements such as FaaF, SelfCheckGPT, and CRAG, which contribute to the fidelity of RAG evaluations. The emergence of LLM-as-a-judge paradigms showcases robust alternatives to traditional evaluation metrics reliant on ground-truth benchmarks.

Retrieval Augmented QA with Rank Fusion

RAGF differentiates itself from traditional RAG by generating and leveraging multiple query variations, subsequently merging the resultant rankings through Reciprocal Rank Fusion (RRF). This approach is posited to minimize the incidence of irrelevant document retrieval and bolster the quality of the final answers. The paper meticulously elucidates the comparative advantages of RAGF over standard RAG configurations.

Development of a Synthetic Test Set

Given the paucity of extensive real-world query logs, the authors advocate for leveraging LLMs to generate synthetic queries based on in-domain corporate documentation. This novel approach ensures a broader range of high-quality queries, reflecting real user interactions more accurately.

LLM-as-a-Judge for RAG Pipelines

The introduction of RAGElo is a pivotal aspect of the paper. By embedding the evaluation within a two-step process—initial document relevance scoring followed by pairwise answer assessment—the framework seeks to mitigate the limitations of conventional evaluation methods. The LLM-as-a-judge, complemented by this structured methodology, demonstrates a commendable alignment with human expert ratings.

Experimental Results

Quality of Retrieved Documents

The evaluation using Mean Reciprocal Rank@5 (MRR@5) scores reveals that RAGF consistently outperforms RAG. The BM25 retrieval method emerged as the strongest baseline, underscoring the potential limitations of unrefined embedding-based searches.

Pairwise Evaluation of Answers

RAGElo's pairwise assessments highlight a preference for RAGF over traditional RAG, particularly in terms of completeness. Statistical analyses reaffirm this preference while noting that RAG excels in precision, revealing a nuanced trade-off between these two dimensions.

Discussion and Conclusion

The paper concludes that RAGElo's evaluation framework aligns well with human annotators, supporting the superiority of RAGF in delivering complete and detailed answers. However, the authors prudently recommend considering domain-specific nuances and the inherent precision demands of certain applications. Future work will explore the applicability of RAGF across varying contexts and refine its implementation for enhanced retrieval and response accuracy.

The authors capably demonstrate that RAGElo, with its Elo-based ranking and LLM-as-a-judge methodology, presents a robust, scalable solution for evaluating complex RAG systems. This stands to facilitate more nuanced, efficient, and reliable performance assessments, driving further advancements in the field of retrieval-augmented generation.