RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance (2501.03995v1)

Published 7 Jan 2025 in cs.LG, cs.CV, cs.IR, cs.IT, and math.IT

Abstract: Retrieval-augmented generation (RAG) improves LLMs by using external knowledge to guide response generation, reducing hallucinations. However, RAG, particularly multi-modal RAG, can introduce new hallucination sources: (i) the retrieval process may select irrelevant pieces (e.g., documents, images) as raw context from the database, and (ii) retrieved images are processed into text-based context via vision-LLMs (VLMs) or directly used by multi-modal LLMs (MLLMs) like GPT-4o, which may hallucinate. To address this, we propose a novel framework to evaluate the reliability of multi-modal RAG using two performance measures: (i) the relevancy score (RS), assessing the relevance of retrieved entries to the query, and (ii) the correctness score (CS), evaluating the accuracy of the generated response. We train RS and CS models using a ChatGPT-derived database and human evaluator samples. Results show that both models achieve ~88% accuracy on test data. Additionally, we construct a 5000-sample human-annotated database evaluating the relevancy of retrieved pieces and the correctness of response statements. Our RS model aligns with human preferences 20% more often than CLIP in retrieval, and our CS model matches human preferences ~91% of the time. Finally, we assess various RAG systems' selection and generation performances using RS and CS.

PDF Abstract

Evaluation of Multimodal Retrieval Augmented Generation Systems with RAG-Check Framework

The paper "RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance" presents an innovative approach to improving and evaluating Retrieval-Augmented Generation (RAG) systems, which integrate external knowledge sources to enhance the performance of LLMs. This paper specifically addresses the occurrences of hallucinations—incorrect or irrelevant responses generated by these models—that are prevalent in contexts requiring high precision, including applications in medicine, insurance, and autonomous systems. The researchers introduce a comprehensive framework, RAG-check, designed for assessing the reliability of multi-modal RAG systems and present novel performance measures known as the relevancy score (RS) and correctness score (CS).

Overview of RAG-Check Framework

The RAG-Check framework aims to systematically evaluate the performance of multi-modal RAG systems from the dual perspectives of the retrieval process and the generation output. Specifically, the framework is composed of three core components:

Partitioning and Categorization: The generated response by RAG is partitioned into distinct segments referred to as spans. Each span is categorized as either an "objective" fact suitable for evaluation or a "subjective" statement, often non-evaluable, especially if it implies conjecture or personal opinions.
Relevancy Score (RS) Model: This model quantifies the relevance of each retrieved document (text or image) to the query. Unlike conventional methods that rely on embedding-based cosine similarity, this model incorporates a more sophisticated cross-attention mechanism, significantly enhancing the detection of relevant queries to images by over 20% compared to existing methods such as CLIP.
Correctness Score (CS) Model: Following the generation of responses using potentially multi-modal inputs, this model assesses the accuracy of the generated text. The CS model verifies each objective span's contents for consistency with the original context. Utilizing a similar architecture as RS, trained with a dataset of validated model-annotations, CS aligns with human judgments 91% of the time.

Technical Contribution and Results

The technical contributions of this paper are noteworthy; it not only proposes RS and CS as measures of retrieval and generation reliability but also provides a robust methodology for training these models using a comprehensive dataset derived from both automated and human-evaluated sources. Notably, the RAG-check framework was evaluated across a variety of RAG configurations, using different vision-LLMs (VLMs) and LLMs, proving the general applicability and robustness of the proposed scores.

The paper’s empirical studies show the RS model achieving an approximate 89% average relevancy score for retrieved top-5 images when compared to baselines like CLIP, which scored significantly lower. For the CS model, alignments with human evaluators emphasize its effectiveness in detecting context accuracy, setting a new standard for multi-modal RAG systems.

Implications and Future Work

This paper bears significant implications for the development and evaluation of RAG systems, providing an improved reliability framework that can be leveraged to minimize hallucinations in applications of critical correctness. Beyond the immediate improvements suggested by the findings, this work opens avenues for implementing similar evaluation frameworks across different modalities and in more varied AI contexts. With technological advances such as broader integration of direct multi-modal models like GPT-4o in RAG frameworks, there remain opportunities for further advancements and refinements, especially in enhancing computational efficiency given the trade-offs noted in the paper.

Overall, this research represents an invaluable contribution to the field of artificial intelligence and may serve as a foundational model for future work in evaluating and improving RAG system performance in multi-modal AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Matin Mortaheb (12 papers)
Mohammad A. Amir Khojastepour (11 papers)
Srimat T. Chakradhar (9 papers)
Sennur Ulukus (258 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/omarsar0/status/1877016342361976896

https://twitter.com/_reachsumit/status/1876833033782743114

YouTube

Show All Videos