Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need (2406.18064v3)

Published 26 Jun 2024 in cs.CL

Abstract: We present a comprehensive study of answer quality evaluation in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive "thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This approach suits factual business contexts where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two LLMs, evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4's assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights the potential of LLMs as reliable evaluators in closed-domain, closed-ended settings, particularly when human evaluations require significant resources.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (4)

Yang Wang (670 papers)
Alberto Garcia Hernandez (1 paper)
Roman Kyslyi (3 papers)
Nicholas Kersting (4 papers)

Tweets

https://twitter.com/_reachsumit/status/1806160325353611375

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need (2406.18064v3)

Related Papers

Tweets