RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering (2407.13998v2)

Published 19 Jul 2024 in cs.CL and cs.AI

Abstract: Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating LLM based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA's answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.

PDF HTML Abstract

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

Introduction

The paper introduces Long-form RobustQA (Lfrqa), a sophisticated dataset aimed at bridging the gap in current evaluation methodologies for LLM-based Retrieval-Augmented Generation Question Answering (RAG-QA) systems. Unlike existing datasets that often focus on single-source corpora with short extractive answers, Lfrqa emphasizes human-written long-form answers synthesized from multiple documents. This dataset encompasses 26,000 queries spanning seven different domains, making it a comprehensive tool for assessing the cross-domain robustness of RAG-QA systems.

Dataset Characteristics

Lfrqa is constructed to address specific inadequacies in previous datasets:

Grounding in Corpus: Answers in Lfrqa are explicitly annotated in relation to the underlying corpus, ensuring that the generated answers are both relevant and accurate.
Long-form Answers: The dataset includes paragraph-length answers, making it more suitable for evaluating the generative capabilities of modern LLMs.
Multi-document Integration: Answers are derived from multiple documents, demanding models to synthesize information from various sources.
Coherence and Completeness: Annotators integrate conflicting information into coherent narratives, enhancing the quality and applicability of responses.
Diverse Domains: The dataset spans multiple domains including biomedical, finance, lifestyle, recreation, technology, science, and writing, providing a robust benchmark for cross-domain performance.
Human Quality Control: High-quality, human-annotated answers are integral to Lfrqa, ensuring the reliability of the benchmark.
Large-scale Evaluation: A vast evaluation set enables extensive experimentation and benchmarking.

RAG-QA Task Formulation

In the RAG-QA pipeline, two primary components are considered:

Passage Retrieval: Utilizing models like ColBERTv2, the task is to select the most relevant passages from a large document collection.
Answer Generation: Leveraging leading LLMs to generate coherent and accurate answers based on the retrieved passages.

The paper focuses on the performance of various LLMs, evaluating their robustness and effectiveness in generating long-form answers across diverse domains.

Evaluation Framework

The paper introduces a novel evaluation framework termed RAG-QA Arena, which implements a pairwise comparison approach:

Human and Model-based Evaluations: Both human judges and LLM-based evaluators assess the quality of the answers by comparing them against Lfrqa. This method ensures a scalable and reliable evaluation of model performance.
Evaluation Metrics: Pairwise preference metrics such as win-rate and win+tie rate are used to gauge the quality of the generated answers.

Results highlight a strong correlation between human judgments and model-based evaluations, validating the efficacy of using LLMs as evaluators in this benchmark.

Experimental Results

The paper conducts extensive experiments using multiple leading LLMs:

GPT-4-turbo
GPT-4-0125-preview
Mixtral-8x22B-Instruct
Llama-3-70b-Instruct
Command R+
Qwen1.5-110b-chat

It was observed that increasing the number of retrieved passages from 5 to 10 significantly improves performance, particularly for GPT-4 models. The best win rate against Lfrqa was recorded as 41.3% for GPT-4o, underscoring the high quality of Lfrqa's long-form answers.

Implications and Future Directions

The development of Lfrqa and the RAG-QA Arena framework marks a significant step in the evaluation of RAG-QA systems:

Benchmarking Robustness: Lfrqa provides a robust benchmark for evaluating the cross-domain performance of LLM-based RAG-QA systems, making it an essential tool for researchers.
Improving QA Models: High-quality annotations and a comprehensive evaluation framework can guide the fine-tuning and improvement of generative models.
Future Research: The dataset and evaluation framework open avenues for future research in prompt engineering, model training, and further exploration of retrieval mechanisms to enhance answer generation quality.

Conclusion

Lfrqa and the RAG-QA Arena framework set a new standard for evaluating RAG-QA systems, emphasizing the need for coherence, completeness, and domain robustness in long-form answer generation. The high-quality annotations and robust evaluation metrics make this work a valuable resource for advancing the field of NLP and improving the capabilities of generative QA models.