Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 28 tok/s

GPT-5 High 35 tok/s Pro

GPT-4o 94 tok/s

GPT OSS 120B 476 tok/s Pro

Kimi K2 190 tok/s Pro

2000 character limit reached

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems (2410.13716v2)

Published 17 Oct 2024 in cs.CL and cs.AI

Abstract: Traditional retrieval-augmented generation (RAG) benchmarks evaluate systems using heuristic-based metrics, but these require human preferences as the ground truth for reference. In contrast, arena-based benchmarks, where systems compete against each other, require an expensive LLM as a judge for a reliable evaluation. We present a simple efficient technique to combine the best of both worlds. The idea is to train a surrogate judge using heuristic metrics as input, to output the LLM as a judge prediction. In our work, we develop MIRAGE-Bench, a synthetic arena-based RAG benchmark for 18 diverse languages on Wikipedia focused on multilingual answer generation evaluation. It extensively couples both heuristic features and LLM as a judge for evaluation. We benchmark 19 multilingual LLMs, and observe a high correlation (Kendall Tau ($\tau$) = 0.909) using our surrogate judge and between GPT-4o as a teacher using the Bradley-Terry framework. Our results show proprietary and large open-source LLMs currently dominate on MIRAGE-Bench. Our code and datasets are made publicly available here: https://github.com/vectara/mirage-bench.

Collections

Summary

The paper introduces a novel evaluation approach that hybridizes heuristic metrics with LLM-based pairwise comparisons.
It presents a multilingual benchmark that extends RAG system evaluations to 18 languages beyond English.
The surrogate judge model, trained via a random forest, achieves a high Kendall Tau of 0.909 with LLM evaluations.

Mirage: Advancements in Multilingual Retrieval-Augmented Generation Systems

The paper "Mirage: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems" introduces Mirage, an innovative benchmark designed to evaluate Retrieval-Augmented Generation (RAG) systems across 18 diverse languages. This work presents a method that combines heuristic evaluation features and arena-based competition to offer an efficient evaluation of multilingual RAG systems.

Key Contributions

Novel Evaluation Approach: Traditional RAG benchmarks either focus on heuristic metrics requiring human ground truth or use LLMs for head-to-head model comparisons, which can be resource-intensive. The authors propose a hybrid evaluation method that trains a learning-to-rank model to act as a surrogate judge, leveraging both heuristic features and LLM evaluations.
Multilingual Benchmark: Mirage extends the evaluation of RAG systems beyond the prevalent English-focused benchmarks, incorporating 18 languages using the miracl dataset for multilingual generation evaluation.
Surrogate Judging Model: A surrogate judge, trained via a random forest learning-to-rank model, is used to predict rankings based on heuristic features, achieving a high correlation with rankings determined by expensive LLM-based pairwise comparisons, specifically using GPT-4o.

Methodology

The methodology introduces a three-step evaluation process:

Heuristic Feature Extraction: Seven heuristic features are used, including language detection, citation quality, support, reranker score, and fluency. These are computed using both deterministic metrics and LLMs like Llama-3 for qualitative evaluation measures.
Arena-Based Evaluation with LLMs: Utilizing GPT-4o as a judge, pairwise comparisons are conducted for a selected subset of queries to create a reliable ranking using the Bradley-Terry model.
Training the Surrogate Judge: Random forest models are trained to predict Bradley-Terry model coefficients, enabling the scalable evaluation of new models without the need for repeated LLM-based assessments.

Results and Insights

Correlation with LLM Rankings: The surrogate judge model showed high correlation with LLM-based rankings (Kendall Tau coefficient of 0.909), indicating effective judgment approximation through heuristic features.
Performance of Models: Proprietary and large open-source LLMs demonstrated superior performance in multilingual RAG scenarios, highlighting the current capabilities and limitations of smaller models.
Impact of Data and Models: Instruction fine-tuning on the Mirage training data notably enhanced the performance of smaller models, such as the 7B parameter models, suggesting the potential for improving multilingual RAG capabilities through targeted training.

Implications and Future Directions

The paper underscores the necessity of improved multilingual model evaluation frameworks, offering Mirage as a step towards more inclusive and diverse RAG system assessments. The hybrid method combining heuristic and arena-based evaluations provides researchers with a practical tool for evaluating and improving multilingual RAG systems efficiently. The paper suggests future exploration into expanding multilingual benchmarks, incorporating other high-performing LLMs as reference judges, and investigating additional features to further refine surrogate judging models.

This work highlights the potential for Mirage to drive advancements in multilingual RAG systems, a critical area given the global necessity of LLM technologies. By enabling efficient and comprehensive multilingual evaluations, Mirage could aid in the development of more robust and versatile RAG systems going forward.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (5)

GitHub

GitHub - vectara/mirage-bench

Tweets

https://twitter.com/beirmug/status/1847296943279775936

https://twitter.com/_reachsumit/status/1847158601657594017

https://twitter.com/arxivsanitybot/status/1847631389467738309