Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RAGAS: Automated Evaluation of Retrieval Augmented Generation (2309.15217v1)

Published 26 Sep 2023 in cs.CL
RAGAS: Automated Evaluation of Retrieval Augmented Generation

Abstract: We introduce RAGAs (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With RAGAs, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.

Retrieval Augmented Generation (RAG) systems, which combine LLMs with external knowledge sources, have become popular for reducing hallucinations and providing up-to-date information. However, evaluating the quality of RAG outputs is challenging, especially in real-world scenarios where ground truth answers are often unavailable. Traditional evaluation methods, such as measuring perplexity or using datasets with short extractive answers, may not fully capture RAG performance or are incompatible with black-box LLM APIs.

The RAGAS framework (James et al., 2023 ) introduces a suite of reference-free metrics for automatically evaluating different aspects of a RAG pipeline's output. It focuses on three key dimensions: Faithfulness, Answer Relevance, and Context Relevance. The core idea behind RAGAS is to leverage the capabilities of an LLM itself to perform the evaluation tasks, thus removing the dependency on human-annotated ground truth.

Here's a breakdown of the RAGAS metrics and how they are implemented:

  1. Faithfulness: This metric assesses the degree to which the generated answer is supported by the retrieved context. It helps identify hallucinations where the LLM generates claims not present in the provided documents.
    • Implementation:

      • Given a question q and the generated answer as(q), an LLM is first prompted to break down the answer into a set of individual statements, S(as(q)).
      • For each statement sis_i in SS, the LLM is prompted again to verify if sis_i can be inferred from the retrieved context c(q). This verification step involves asking the LLM to provide a verdict (Yes/No) and a brief explanation for each statement based on the context.
      • The Faithfulness score FF is calculated as the ratio of statements supported by the context (V|V|) to the total number of statements (S|S|):

        F=VSF = \frac{|V|}{|S|}

* Practical Application: Use this metric to evaluate different prompt engineering strategies, fine-tuned models, or generation parameters to see which configuration results in more factually consistent answers grounded in the source material. A low faithfulness score indicates the RAG system is prone to hallucination.

  1. Answer Relevance: This metric measures how well the generated answer directly addresses the user's question. It penalizes incomplete answers or those containing irrelevant information.
    • Implementation:

      • Given a generated answer as(q), an LLM is prompted to generate nn potential questions (q1,q2,,qnq_1, q_2, \dots, q_n) that the given answer could be responding to.
      • Embeddings are obtained for the original question q and each of the generated questions (qiq_i) using an embedding model (like OpenAI's text-embedding-ada-002).
      • The cosine similarity sim(q, qi) is computed between the original question embedding and each generated question embedding.
      • The Answer Relevance score ARAR is the average of these similarities:

        AR=1ni=1nsim(q,qi)AR = \frac{1}{n} \sum_{i=1}^n sim(q, q_i)

* Practical Application: This metric is useful for comparing RAG systems based on how focused and responsive their answers are. If an answer relevance score is low, it might suggest issues with the LLM's instruction following or that the retrieved context doesn't contain enough information to fully answer the question, leading to an incomplete or off-topic response.

  1. Context Relevance: This metric evaluates the quality of the retrieved context by measuring the extent to which it contains only information relevant to answering the question. It helps identify issues in the retrieval phase, such as retrieving overly long or noisy passages.
    • Implementation:

      • Given a question q and the retrieved context c(q), an LLM is prompted to extract the sentences from c(q) that are essential for answering q.
      • The Context Relevance score CRCR is calculated as the ratio of the number of extracted relevant sentences to the total number of sentences in the retrieved context:

        CR=number of extracted sentencestotal number of sentences in c(q)CR = \frac{\text{number of extracted sentences}}{\text{total number of sentences in c(q)}}

      • The prompt specifically instructs the LLM not to alter the sentences and to return "Insufficient Information" if no relevant sentences are found or the question cannot be answered from the context.

    • Practical Application: Use this metric to tune your retrieval system (e.g., vector database chunk size, embedding model choice, retriever algorithm). A low context relevance score indicates the retriever is pulling in too much irrelevant noise, which can dilute relevant information and potentially negatively impact the LLM's generation.

Implementation Considerations:

  • LLM Dependency: RAGAS relies heavily on the performance and availability of the LLM used for evaluation (e.g., gpt-3.5-turbo). The quality of the RAGAS scores is dependent on the LLM's ability to follow instructions precisely and perform the required tasks (statement extraction, verification, question generation, sentence extraction). Results may vary with different LLM providers or models.
  • Computational Cost and Latency: Evaluating a RAG output involves multiple API calls to the LLM per metric. For example, Faithfulness requires one call to extract statements and then potentially several more calls to verify each statement. This can be computationally expensive and slow, especially when evaluating a large dataset of questions.
  • Prompt Sensitivity: The metrics are defined by specific prompts used to interact with the evaluation LLM. Changes to these prompts could potentially alter the resulting scores.
  • Integration: RAGAS provides integrations with popular RAG frameworks like LlamaIndex and LangChain, making it easier to incorporate automated evaluation into development workflows.

Practical Usage:

RAGAS is valuable during the development and iteration phase of a RAG system. You can use it to:

  • Compare different retrieval strategies (e.g., different chunk sizes, different embedding models, keyword search vs. vector search).
  • Evaluate the impact of different LLM models or prompting techniques on the generation quality.
  • Identify which component of your RAG pipeline (retrieval or generation) is underperforming. A low Context Relevance might point to retrieval issues, while low Faithfulness or Answer Relevance might indicate problems with the LLM's processing of the context or its ability to answer the question effectively.
  • Automate regression testing as you make changes to your RAG pipeline.

While the paper focuses on evaluation using OpenAI models, the RAGAS framework is designed to be adaptable, allowing practitioners to potentially substitute other LLMs or embedding models, although empirical validation would be necessary to understand the impact on metric reliability. The WikiEval dataset (James et al., 2023 ) provides a benchmark to test agreement with human judgment for different evaluation setups.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Amos Azaria and Tom M. Mitchell. 2023. The internal state of an LLM knows when its lying. CoRR, abs/2304.13734.
  2. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  4. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  5. Gptscore: Evaluate as you desire. CoRR, abs/2302.04166.
  6. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  7. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  8. Language models (mostly) know what they know. CoRR, abs/2207.05221.
  9. Large language models struggle to learn long-tail knowledge. CoRR, abs/2211.08411.
  10. Generalization through memorization: Nearest neighbor language models. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  11. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP. CoRR, abs/2212.14024.
  12. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096.
  13. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  14. Halueval: A large-scale hallucination evaluation benchmark for large language models. CoRR, abs/2305.11747.
  15. Lost in the middle: How language models use long contexts.
  16. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada. Association for Computational Linguistics.
  17. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. CoRR, abs/2303.08896.
  18. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. CoRR, abs/2305.14251.
  19. In-context retrieval-augmented language models. CoRR, abs/2302.00083.
  20. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
  21. REPLUG: retrieval-augmented black-box language models. CoRR, abs/2301.12652.
  22. Is chatgpt a good NLG evaluator? A preliminary study. CoRR, abs/2303.04048.
  23. Large language models are not fair evaluators. CoRR, abs/2305.17926.
  24. KNN-LM does not improve open-ended text generation. CoRR, abs/2305.14625.
  25. Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 27263–27277.
  26. Interpretable unified language checking. CoRR, abs/2304.03728.
  27. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  28. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jithin James (2 papers)
  2. Luis Espinosa-Anke (35 papers)
  3. Steven Schockaert (67 papers)
  4. Shahul ES (2 papers)
Citations (111)
Youtube Logo Streamline Icon: https://streamlinehq.com