- The paper introduces BenchMAX as a comprehensive multilingual benchmark assessing LLM capabilities in instruction following, reasoning, long context, and code generation.
- It employs a rigorous pipeline with machine translation, triple native post-editing, and LLM-based quality control to ensure high-quality data across 17 languages.
- Experimental results reveal persistent language disparities, emphasizing that simply increasing model size does not bridge performance gaps in multilingual settings.
Okay, I can provide a detailed summary of the paper you've provided.
Title: BenchMAX: A Comprehensive Multilingual Evaluation Suite for LLMs
Authors: Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan
Abstract Summary:
The paper introduces BenchMAX, a new multilingual benchmark designed to evaluate the advanced capabilities of LLMs across multiple languages. Unlike previous benchmarks that focus on basic understanding, BenchMAX emphasizes instruction following, reasoning, long context understanding, code generation, and tool use. The dataset covers 17 languages and uses a rigorous annotation pipeline involving machine translation, independent annotation by three native speakers, and LLM-based quality control. Experiments using BenchMAX revealed performance variations across languages, suggesting that simply scaling model size doesn't eliminate performance gaps in multilingual settings. The authors provide the dataset and code publicly.
1. Introduction & Motivation:
- LLMs demonstrate proficiency in various tasks, including instruction following, reasoning, long context understanding, and code generation.
- These capabilities are considered language-agnostic in principle. For example, mathematical or coding logic shouldn't depend on the language used to express the problem.
- However, empirical studies show that LLMs exhibit unbalanced performance across different languages on the same tasks.
- Existing benchmarks have limitations:
- Many use multiple-choice formats that don't fully evaluate generative capabilities.
- Limited language overlap makes it difficult to assess performance in specific languages.
- Some benchmarks like P-MMEval have narrow focuses and LLMs achieve remarkably high scores, creating a gap between research evaluation and real-world applications.
- BenchMAX aims to address these limitations by providing a comprehensive and challenging multilingual evaluation suite.
Key Contributions & Features of BenchMAX:
- Comprehensive: BenchMAX covers a broader range of language-agnostic capabilities (6 crucial capabilities, 10 diverse tasks).
- Multi-way Multilingual: Evaluates LLMs across 17 languages, representing diverse language families and writing systems. Increases the proportion of languages utilizing non-Latin scripts.
- Challenging: Tasks include instruction following, reasoning (math and science), code generation (function completion, problem solving), long context understanding, tool use, and general/domain translation.
- High-Quality: The dataset construction involves:
- Machine translation from English to 16 other languages.
- Post-editing by three native-speaking annotators per sample.
- LLM-based selection of the best translation version to reduce annotator bias.
- Novelty: Introduces a "domain translation" challenge requiring fine-grained control and domain-specific terminology understanding.
2. Related Work:
- Discusses the limitations of older multilingual benchmarks designed for discriminative models (classification tasks) like XNLI, XCOPA, and XCSQA, pointing out their limited complexity and diversity.
- Mentions that MGSM is commonly used by LLM teams, but BenchMAX extends the language coverage.
- Addresses issues with MMLU (translated version), including lack of a unified version and the presence of ground truth errors. BenchMAX uses GPQA instead of MMLU due to its higher quality annotations.
- Highlights that BenchMAX incorporates a broader range of capabilities compared to aggregated benchmarks like SeaEval and P-MMEval, and emphasizes human post-editing for improved data quality.
3. Benchmark Construction:
- Language Selection: 17 languages were chosen to represent diverse language families and writing systems (Table 2 lists the languages, ISO codes, families, and script systems).
- Capabilities Selection:
- Instruction Following: Evaluated with rule-based (IFEval) and model-based (Arena-hard) assessments.
- Reasoning: Assesses math reasoning (MGSM) and science reasoning (GPQA).
- Code Generation: Evaluates function completion (Humaneval+) and problem solving (LiveCodeBench).
- Long Context Modeling: Evaluates question answering with long documents (RULER).
- Tool Use: Assesses the ability to select and invoke functions (Nexus).
- Translation: Evaluates general (Flores, TED, WMT24) and domain translation.
- Construction Pipeline (Figure 2):
1. Machine Translation: English data is translated to non-English languages using Google Translate and/or LLM-based translation like GPT-4o, depending on the presence of extractable constraints.
2. Human Post-Editing: Three native-speaking annotators post-edit each sample. Automatic verifiers (rule-based and model-based using Qwen2.5-72B) are used to assess annotation quality. An iterative process with at least three iterations is employed.
3. LLM-Based Final Version Selection: GPT-4o-mini is used to select the final translation version, mitigating position bias. Pairwise comparisons are used to determine the best translation.
- Handling Complex Constraints: Addresses the challenge of extracting keywords from translated instructions for rule-based verification. Keywords are enclosed in special symbols to facilitate extraction (Table 4 and Table 5 illustrate this).
4. Experimental Results:
- Evaluation Setup:
- Evaluated Models: Llama3.1, Qwen2.5, Gemma2, InternLM2.5, Aya-Expanse, DeepSeek-V3, and GPT-4o-mini.
- Inference Configuration: Greedy decoding (except for problem solving, where temperature = 0.2). Standard chat templates and system prompts are used.
- Multilingual Benchmark Results (Table 6, Figure 4):
- Larger models generally improve performance, but language disparities persist.
- Effective utilization of language-agnostic capabilities is challenging in multilingual contexts. Reasoning capabilities vary significantly across languages.
- Model performance is often biased toward high-resource languages.
- DeepSeek-V3 narrows the gap between open-source and closed-source models.
- Translation capabilities are positively correlated with other evaluated capabilities.
- Models within the same family exhibit consistent performance patterns across languages.
5. Analysis:
- High Correctness Agreement: High F1 scores (>0.9) indicate strong agreement between problem-solving correctness in English versus other languages (Figure 6). However, F1 scores are lower for low-resource languages.
- Challenges in Evaluating Domain-Specific Translation: Traditional metrics like BLEU and TER are unreliable due to the presence of unchanged text. Model-based metrics (XCOMET) have high variance. A performance retention rate is proposed but doesn't work as expected. (Table 7)
- Machine-Translated Task Data: Can lead to both overestimation and underestimation of model performance (Figure 8). Human-translated data leads to better performance, especially in rule-based instruction following.
- Conventional Understanding Tasks: Insufficient for evaluating multilingual capabilities. BenchMAX reveals different performance patterns compared to traditional understanding tasks.
6. Conclusion:
- BenchMAX is a comprehensive multilingual benchmark designed to assess LLM capabilities across diverse languages and tasks.
- State-of-the-art LLMs still exhibit uneven performance across different languages.
- Simply increasing model size does not eliminate performance gaps, highlighting the need for further research to achieve balanced multilingual capabilities.
Impact Statement:
The authors state that the goal of the paper is to advance the field of machine learning and that there are no specific societal consequences to highlight.
In essence, the paper presents a valuable new resource for the multilingual NLP community. It provides a more rigorous and comprehensive way to evaluate LLMs across a wider range of languages and tasks than previous benchmarks, and offers insights into the challenges of building truly multilingual models.