RAGAS: Automated Evaluation of Retrieval Augmented Generation (2309.15217v1)

Published 26 Sep 2023 in cs.CL

Abstract: We introduce RAGAs (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With RAGAs, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.

PDF HTML Abstract

Retrieval Augmented Generation (RAG) systems, which combine LLMs with external knowledge sources, have become popular for reducing hallucinations and providing up-to-date information. However, evaluating the quality of RAG outputs is challenging, especially in real-world scenarios where ground truth answers are often unavailable. Traditional evaluation methods, such as measuring perplexity or using datasets with short extractive answers, may not fully capture RAG performance or are incompatible with black-box LLM APIs.

The RAGAS framework (James et al., 2023 ) introduces a suite of reference-free metrics for automatically evaluating different aspects of a RAG pipeline's output. It focuses on three key dimensions: Faithfulness, Answer Relevance, and Context Relevance. The core idea behind RAGAS is to leverage the capabilities of an LLM itself to perform the evaluation tasks, thus removing the dependency on human-annotated ground truth.

Here's a breakdown of the RAGAS metrics and how they are implemented:

Faithfulness: This metric assesses the degree to which the generated answer is supported by the retrieved context. It helps identify hallucinations where the LLM generates claims not present in the provided documents.
- Implementation:
  - Given a question q and the generated answer as(q), an LLM is first prompted to break down the answer into a set of individual statements, S(as(q)).
  - For each statement $s_i$ in $S$ , the LLM is prompted again to verify if $s_i$ can be inferred from the retrieved context c(q). This verification step involves asking the LLM to provide a verdict (Yes/No) and a brief explanation for each statement based on the context.
  - The Faithfulness score $F$ is calculated as the ratio of statements supported by the context ( $|V|$ ) to the total number of statements ( $|S|$ ):
    
    $F = \frac{|V|}{|S|}$

* Practical Application: Use this metric to evaluate different prompt engineering strategies, fine-tuned models, or generation parameters to see which configuration results in more factually consistent answers grounded in the source material. A low faithfulness score indicates the RAG system is prone to hallucination.

Answer Relevance: This metric measures how well the generated answer directly addresses the user's question. It penalizes incomplete answers or those containing irrelevant information.
- Implementation:
  - Given a generated answer as(q), an LLM is prompted to generate $n$ potential questions ( $q_1, q_2, \dots, q_n$ ) that the given answer could be responding to.
  - Embeddings are obtained for the original question q and each of the generated questions ( $q_i$ ) using an embedding model (like OpenAI's text-embedding-ada-002).
  - The cosine similarity sim(q, qi) is computed between the original question embedding and each generated question embedding.
  - The Answer Relevance score $AR$ is the average of these similarities:
    
    $AR = \frac{1}{n} \sum_{i=1}^n sim(q, q_i)$

* Practical Application: This metric is useful for comparing RAG systems based on how focused and responsive their answers are. If an answer relevance score is low, it might suggest issues with the LLM's instruction following or that the retrieved context doesn't contain enough information to fully answer the question, leading to an incomplete or off-topic response.

Context Relevance: This metric evaluates the quality of the retrieved context by measuring the extent to which it contains only information relevant to answering the question. It helps identify issues in the retrieval phase, such as retrieving overly long or noisy passages.
- Implementation:
  - Given a question q and the retrieved context c(q), an LLM is prompted to extract the sentences from c(q) that are essential for answering q.
  - The Context Relevance score $CR$ is calculated as the ratio of the number of extracted relevant sentences to the total number of sentences in the retrieved context:
    
    $CR = \frac{\text{number of extracted sentences}}{\text{total number of sentences in c(q)}}$
  - The prompt specifically instructs the LLM not to alter the sentences and to return "Insufficient Information" if no relevant sentences are found or the question cannot be answered from the context.
- Practical Application: Use this metric to tune your retrieval system (e.g., vector database chunk size, embedding model choice, retriever algorithm). A low context relevance score indicates the retriever is pulling in too much irrelevant noise, which can dilute relevant information and potentially negatively impact the LLM's generation.

Implementation Considerations:

LLM Dependency: RAGAS relies heavily on the performance and availability of the LLM used for evaluation (e.g., gpt-3.5-turbo). The quality of the RAGAS scores is dependent on the LLM's ability to follow instructions precisely and perform the required tasks (statement extraction, verification, question generation, sentence extraction). Results may vary with different LLM providers or models.
Computational Cost and Latency: Evaluating a RAG output involves multiple API calls to the LLM per metric. For example, Faithfulness requires one call to extract statements and then potentially several more calls to verify each statement. This can be computationally expensive and slow, especially when evaluating a large dataset of questions.
Prompt Sensitivity: The metrics are defined by specific prompts used to interact with the evaluation LLM. Changes to these prompts could potentially alter the resulting scores.
Integration: RAGAS provides integrations with popular RAG frameworks like LlamaIndex and LangChain, making it easier to incorporate automated evaluation into development workflows.

Practical Usage:

RAGAS is valuable during the development and iteration phase of a RAG system. You can use it to:

Compare different retrieval strategies (e.g., different chunk sizes, different embedding models, keyword search vs. vector search).
Evaluate the impact of different LLM models or prompting techniques on the generation quality.
Identify which component of your RAG pipeline (retrieval or generation) is underperforming. A low Context Relevance might point to retrieval issues, while low Faithfulness or Answer Relevance might indicate problems with the LLM's processing of the context or its ability to answer the question effectively.
Automate regression testing as you make changes to your RAG pipeline.

While the paper focuses on evaluation using OpenAI models, the RAGAS framework is designed to be adaptable, allowing practitioners to potentially substitute other LLMs or embedding models, although empirical validation would be necessary to understand the impact on metric reliability. The WikiEval dataset (James et al., 2023 ) provides a benchmark to test agreement with human judgment for different evaluation setups.