M-RewardBench: Evaluating Reward Models in Multilingual Settings
The paper presents M-RewardBench, a substantial advancement in the evaluation of reward models (RMs) applied to multilingual contexts. Reward models, crucial for aligning LLMs with human preferences, have primarily focused on English, leaving their multilingual capabilities underexplored. This paper addresses this gap by introducing a comprehensive benchmark across 23 diverse languages, enabling a systematic evaluation of RMs beyond English.
Construction and Evaluation
M-RewardBench is meticulously curated, comprising 2,870 preference instances that test the chat, safety, reasoning, and translation capabilities of RMs. This dataset is distinct in its inclusion of 23 languages, representing diverse scripts and language families, such as Afro-Asiatic and Sino-Tibetan, and ensuring a diverse linguistic testing ground. The evaluation framework assesses generative, classifier-based, and implicit RMs using a weighted average accuracy metric across various tasks.
Key Findings
- Performance Disparities: A significant performance gap exists between English and non-English languages. Generative models display relatively stronger multilingual generalization, with a lesser performance drop compared to classifier and implicit RMs.
- Translation Quality: Higher-quality translations lead to improved RM performance, highlighting the importance of translation fidelity in multilingual evaluation.
- Linguistic Dimensions: RMs perform better on resource-rich languages and those using Latin or Cyrillic scripts. Indo-European and Sino-Tibetan languages exhibit higher performance compared to Afro-Asiatic and Turkic, suggesting resource availability is a critical factor.
- Translation Tasks: Models struggle more with hard translation tasks, particularly in language directions with limited resources. There is consistent evidence that models perform better translating from English than to English, particularly evident in challenging translation scenarios.
- Label Consistency: Despite variations in language, some models maintain consistency in evaluations, implying robustness in specific RMs.
Implications
The findings imply a pressing need for multilingual RM development, given their increasing role in aligning global LLM applications. The performance gap in low-resource languages calls for more data and model-specific advancements to ensure equal RM proficiency across languages.
Future Prospects
Future research should explore correlations between RM benchmarks and practical LLM performance in downstream applications. Additionally, expanding human-written translations could further refine multilingual evaluations. The paper also underscores the potential for tailoring RMs to culturally specific preferences, advancing the alignment of LLMs with diverse global users.
In conclusion, M-RewardBench is a pivotal tool for advancing the evaluation of RMs in multilingual contexts. By providing this benchmark, the paper encourages broader research into developing models that can meet the language alignment needs of a globally diverse population. The release of M-RewardBench is expected to facilitate ongoing progress in multilingual reward model development and evaluation.