- The paper introduces a novel evaluation approach that hybridizes heuristic metrics with LLM-based pairwise comparisons.
- It presents a multilingual benchmark that extends RAG system evaluations to 18 languages beyond English.
- The surrogate judge model, trained via a random forest, achieves a high Kendall Tau of 0.909 with LLM evaluations.
Mirage: Advancements in Multilingual Retrieval-Augmented Generation Systems
The paper "Mirage: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems" introduces Mirage, an innovative benchmark designed to evaluate Retrieval-Augmented Generation (RAG) systems across 18 diverse languages. This work presents a method that combines heuristic evaluation features and arena-based competition to offer an efficient evaluation of multilingual RAG systems.
Key Contributions
- Novel Evaluation Approach: Traditional RAG benchmarks either focus on heuristic metrics requiring human ground truth or use LLMs for head-to-head model comparisons, which can be resource-intensive. The authors propose a hybrid evaluation method that trains a learning-to-rank model to act as a surrogate judge, leveraging both heuristic features and LLM evaluations.
- Multilingual Benchmark: Mirage extends the evaluation of RAG systems beyond the prevalent English-focused benchmarks, incorporating 18 languages using the miracl dataset for multilingual generation evaluation.
- Surrogate Judging Model: A surrogate judge, trained via a random forest learning-to-rank model, is used to predict rankings based on heuristic features, achieving a high correlation with rankings determined by expensive LLM-based pairwise comparisons, specifically using GPT-4o.
Methodology
The methodology introduces a three-step evaluation process:
- Heuristic Feature Extraction: Seven heuristic features are used, including language detection, citation quality, support, reranker score, and fluency. These are computed using both deterministic metrics and LLMs like Llama-3 for qualitative evaluation measures.
- Arena-Based Evaluation with LLMs: Utilizing GPT-4o as a judge, pairwise comparisons are conducted for a selected subset of queries to create a reliable ranking using the Bradley-Terry model.
- Training the Surrogate Judge: Random forest models are trained to predict Bradley-Terry model coefficients, enabling the scalable evaluation of new models without the need for repeated LLM-based assessments.
Results and Insights
- Correlation with LLM Rankings: The surrogate judge model showed high correlation with LLM-based rankings (Kendall Tau coefficient of 0.909), indicating effective judgment approximation through heuristic features.
- Performance of Models: Proprietary and large open-source LLMs demonstrated superior performance in multilingual RAG scenarios, highlighting the current capabilities and limitations of smaller models.
- Impact of Data and Models: Instruction fine-tuning on the Mirage training data notably enhanced the performance of smaller models, such as the 7B parameter models, suggesting the potential for improving multilingual RAG capabilities through targeted training.
Implications and Future Directions
The paper underscores the necessity of improved multilingual model evaluation frameworks, offering Mirage as a step towards more inclusive and diverse RAG system assessments. The hybrid method combining heuristic and arena-based evaluations provides researchers with a practical tool for evaluating and improving multilingual RAG systems efficiently. The paper suggests future exploration into expanding multilingual benchmarks, incorporating other high-performing LLMs as reference judges, and investigating additional features to further refine surrogate judging models.
This work highlights the potential for Mirage to drive advancements in multilingual RAG systems, a critical area given the global necessity of LLM technologies. By enabling efficient and comprehensive multilingual evaluations, Mirage could aid in the development of more robust and versatile RAG systems going forward.