M-RewardBench: Evaluating Reward Models in Multilingual Settings (2410.15522v2)

Published 20 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Reward models (RMs) have driven the state-of-the-art performance of LLMs today by enabling the integration of human feedback into the LLMing process. However, RMs are primarily trained and evaluated in English, and their capabilities in multilingual settings remain largely understudied. In this work, we conduct a systematic evaluation of several reward models in multilingual settings. We first construct the first-of-its-kind multilingual RM evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances for 23 typologically diverse languages, that tests the chat, safety, reasoning, and translation capabilities of RMs. We then rigorously evaluate a wide range of reward models on M-RewardBench, offering fresh insights into their performance across diverse languages. We identify a significant gap in RMs' performances between English and non-English languages and show that RM preferences can change substantially from one language to another. We also present several findings on how different multilingual aspects impact RM performance. Specifically, we show that the performance of RMs is improved with improved translation quality. Similarly, we demonstrate that the models exhibit better performance for high-resource languages. We release M-RewardBench dataset and the codebase in this study to facilitate a better understanding of RM evaluation in multilingual settings.

PDF HTML Abstract

M-RewardBench: Evaluating Reward Models in Multilingual Settings

The paper presents M-RewardBench, a substantial advancement in the evaluation of reward models (RMs) applied to multilingual contexts. Reward models, crucial for aligning LLMs with human preferences, have primarily focused on English, leaving their multilingual capabilities underexplored. This paper addresses this gap by introducing a comprehensive benchmark across 23 diverse languages, enabling a systematic evaluation of RMs beyond English.

Construction and Evaluation

M-RewardBench is meticulously curated, comprising 2,870 preference instances that test the chat, safety, reasoning, and translation capabilities of RMs. This dataset is distinct in its inclusion of 23 languages, representing diverse scripts and language families, such as Afro-Asiatic and Sino-Tibetan, and ensuring a diverse linguistic testing ground. The evaluation framework assesses generative, classifier-based, and implicit RMs using a weighted average accuracy metric across various tasks.

Key Findings

Performance Disparities: A significant performance gap exists between English and non-English languages. Generative models display relatively stronger multilingual generalization, with a lesser performance drop compared to classifier and implicit RMs.
Translation Quality: Higher-quality translations lead to improved RM performance, highlighting the importance of translation fidelity in multilingual evaluation.
Linguistic Dimensions: RMs perform better on resource-rich languages and those using Latin or Cyrillic scripts. Indo-European and Sino-Tibetan languages exhibit higher performance compared to Afro-Asiatic and Turkic, suggesting resource availability is a critical factor.
Translation Tasks: Models struggle more with hard translation tasks, particularly in language directions with limited resources. There is consistent evidence that models perform better translating from English than to English, particularly evident in challenging translation scenarios.
Label Consistency: Despite variations in language, some models maintain consistency in evaluations, implying robustness in specific RMs.

Implications

The findings imply a pressing need for multilingual RM development, given their increasing role in aligning global LLM applications. The performance gap in low-resource languages calls for more data and model-specific advancements to ensure equal RM proficiency across languages.

Future Prospects

Future research should explore correlations between RM benchmarks and practical LLM performance in downstream applications. Additionally, expanding human-written translations could further refine multilingual evaluations. The paper also underscores the potential for tailoring RMs to culturally specific preferences, advancing the alignment of LLMs with diverse global users.

In conclusion, M-RewardBench is a pivotal tool for advancing the evaluation of RMs in multilingual contexts. By providing this benchmark, the paper encourages broader research into developing models that can meet the language alignment needs of a globally diverse population. The release of M-RewardBench is expected to facilitate ongoing progress in multilingual reward model development and evaluation.