OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain (2412.13018v2)

Published 17 Dec 2024 in cs.CL

Abstract: As a typical and practical application of LLMs, Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge. In this paper, we introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including (1) a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; (2) a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47\% acceptance ratio in human evaluations on generated instances; (3) a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline; and (4) robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets and highlights the performance variations of RAG systems across diverse topics and tasks, revealing significant opportunities for RAG models to improve their capabilities in vertical domains. We open source the code of our benchmark in \href{https://github.com/RUC-NLPIR/OmniEval}{https://github.com/RUC-NLPIR/OmniEval}.

Summary

The paper introduces a matrix-based evaluation framework that categorizes queries into five distinct RAG tasks across 16 financial subtopics.
The paper employs a hybrid approach combining GPT-4 data generation with human annotation, achieving an 87.47% acceptance rate for evaluation instances.
The paper demonstrates that RAG systems can outperform closed-book LLMs in finance while revealing areas for improvement in complex reasoning and conversational tasks.

OmniEval: An Omnidirectional and Automatic Evaluation Benchmark for RAG in Finance

This paper introduces OmniEval, a comprehensive and automated benchmarking framework designed to evaluate Retrieval-Augmented Generation (RAG) systems specifically within the financial domain. The framework addresses the challenge of assessing RAG models' performance in specialized domains through a multidimensional approach that combines automated data generation and systematic evaluation metrics.

Core Features of OmniEval

Matrix-based Evaluation Framework: OmniEval introduces a structured evaluation matrix that categorizes queries into five distinct RAG tasks—extractive QA, multi-hop reasoning, long-form QA, contrast QA, and conversational QA—across 16 financial subtopics. This allows for a detailed assessment of RAG systems' abilities to process varied query scenarios within the financial domain.
Automated Data Generation: The framework utilizes a hybrid approach combining GPT-4 for automatic generation of evaluation instances and human annotation to ensure data quality. The acceptance rate of 87.47% for human-evaluated instances underscores the efficacy of this automated approach.
Comprehensive Multi-stage Evaluation: OmniEval assesses both the retrieval and generation phases of the RAG pipeline, recognizing that effective retrieval is essential for generating accurate, domain-specific answers.
Robust Evaluation Metrics: The framework employs both rule-based metrics (e.g., Rouge-L and MAP) and advanced LLM-based metrics, such as hallucination detection and numerical accuracy, to provide a nuanced assessment of model performance.

Experimental Insights

Experiments conducted using a variety of retrievers (e.g., GTE-Qwen2-1.5b) and LLMs (e.g., Qwen2.5-72B-Instruct) highlight substantial variability in RAG system performance across tasks and topics. Notably, the paper finds that GTE-Qwen2-1.5b consistently achieves superior retrieval performance, suggesting the integration of pre-trained linguistic knowledge is beneficial. Additionally, RAG models generally outperform closed-book LLMs on domain-specific tasks, although there remains ample room for improvement, particularly in complex reasoning and conversational tasks.

Implications and Future Directions

The introduction of OmniEval represents a significant step toward standardized, comprehensive evaluation methodologies for RAG systems in specialized domains. The ability to automatically and systematically assess model capabilities across diverse scenarios paves the way for more targeted advancements in RAG technology. Future work may explore enhancing retrievers and generators' capabilities in handling complex and domain-specific queries, as well as extending OmniEval's framework to additional verticals beyond finance.

Overall, OmniEval provides valuable insights into the current state and limitations of RAG systems in the financial domain, offering a structured path forward for their continued development and optimization.

PDF Markdown

Related Papers

GitHub

GitHub - RUC-NLPIR/OmniEval (40 stars)

Tweets

https://twitter.com/rohanpaul_ai/status/1872300359462560143

https://twitter.com/_reachsumit/status/1869270343510331857

https://twitter.com/javaeeeee1/status/1869354407969931707