- The paper introduces BenchMAX, a multilingual evaluation suite that assesses advanced LLM capabilities including generation, reasoning, and code synthesis across 17 languages.
- The methodology integrates machine translation, multi-round human annotation, and LLM-based adjudication to produce high-fidelity parallel datasets for diverse tasks.
- Experimental results highlight persistent performance disparities in low-resource languages, underscoring the need for innovative training and prompt engineering strategies.
BenchMAX: A Comprehensive Multilingual Benchmark for Advanced LLM Evaluation
Introduction
BenchMAX addresses a critical deficiency in the evaluation of LLMs: the lack of comprehensive, high-quality, and genuinely multilingual benchmarks for advanced, language-agnostic capabilities. While prior multilingual benchmarks focus predominantly on basic understanding or multiple-choice tasks, they fail to comprehensively assess generative and reasoning competencies, particularly in low-resource or non-Latin-script languages. BenchMAX introduces an extensive parallel evaluation suite across 17 diverse languages, carefully curated through a hybrid pipeline of machine translation, multi-round human annotation, and LLM-based adjudication. The suite encompasses ten tasks covering six crucial advanced LLM capabilities, including multi-paradigm instruction following, mathematical and scientific reasoning, executable code synthesis, long-context understanding, tool use, and both general and domain-specific translation.
Benchmark Suite Design and Task Coverage
BenchMAX explicitly targets advanced LLM evaluation with an emphasis on generation and reasoning. Languages were selected to span typologically distinct families and varied writing systems, including Indo-European, Sino-Tibetan, Dravidian, Kra-Dai, Japonic, Hungaro-Uralic, Afro-Asiatic, Koreanic, Nilo-Saharan, Bantu, and others, with a significant proportion using non-Latin scripts. This breadth enables robust analysis of script, typology, and resource level effects on LLM transfer.
The suite assesses:
- Instruction Following, with both rule-based verifiable constraints (from IFEval) and model-based generation (Arena-Hard), requiring compliance with explicit constraints and nuanced generative adherence.
- Reasoning Tasks, including math (MGSM) and high-difficulty, Google-proof scientific QA (GPQA, targeting advanced physics, chemistry, biology).
- Code Generation, split into function completion (HumanEval+) and competitive problem-solving (LiveCodeBench), both demanding executable, correct Python programs.
- Long-context Reading, via RULER-derived synthetic QA on up to 128k-token documents.
- Tool Use, based on Nexus, evaluating correct mapping from natural queries to function calls among multiple distractors.
- Translation (General and Domain), including robust evaluation of fine-grained domain adaptation and segment selection in translation.
Each task is carefully adapted for multilingual transfer, using prompt tricks and post-editing to ensure constraints, technical terminology, and task instructions are faithfully rendered, minimizing confounding signal loss due to poor translation.
Annotation and Evaluation Protocol
BenchMAX sets a new standard for parallel data quality. Raw English tasks are first machine-translated (using Google Translate or GPT-4o for tasks with complex constraints), with keyword and constraint preservation aided by special symbol/bracketing schemes. Each instance is post-edited by three native-speaking annotators with disciplinary expertise as appropriate. Automatic verifiers (rule-based and model-based, including GEMBA-SQM and Qwen2.5-Instruct) provide iterative feedback and identify problematic translations, which are refined in multiple rounds. The selection of the final version is delegated to an LLM (GPT-4o-mini), using randomization and pairwise battles to reduce annotator positional bias and cost.
This process yields a genuinely parallel, high-fidelity dataset, eliminating many known flaws of prior multilingual benchmarks (e.g., ground truth errors, format mismatches, missing constraint fidelity, poor preservation of technical structure).
Experimental Findings
Multilingual Evaluation Results
State-of-the-art multilingual LLMs (Llama3.1, Qwen2.5, DeepSeek-V3, Gemma2, InternLM2.5, Aya-Expanse, GPT-4o-mini) are evaluated across all BenchMAX tasks. Results demonstrate that:
- Scaling model size consistently improves average performance across most tasks and languages.
- Performance disparities (GAP) between English and non-English languages remain substantial, often with high-resource (French, Chinese, Russian) languages performing well and low-resource (Swahili, Telugu, Bengali) languages trailing, even for the largest models.
- Scaling alone does not guarantee reduced disparity: Within-family analysis (e.g., Gemma2) shows that larger models do not uniformly yield smaller cross-lingual GAPs. In some instances, smaller variants outperform their larger counterparts on GAP minimization.
- Language-agnostic capabilities, such as logical and scientific reasoning, tool use, and code generation, are highly modulated by language proficiency: Performance swings of over 30 points are observed between dominant and low-resource/complex script languages on identical tasks.
- Certain models (e.g., Qwen2.5) reveal anomalous strengths in specific non-dominant languages (e.g., outperform English on Korean science reasoning), providing evidence of complex and nontrivial cross-lingual generalization behaviors.
Task and Metric-Specific Observations
- Domain-specific translation evaluation is nontrivial: Traditional corpus-similarity metrics (spBLEU, TER) poorly reflect translation quality for domain- and code-mixed content, frequently overestimating model ability. Model-based metrics (XCOMET-XXL) are sensitive to domain/low-resource settings and display high variance.
- Human-annotated reference translations are essential for accurate evaluation of instruction following and code tasks. Machine-translated data can both over- and under-estimate model performance; in many cases, human translation yields gains of >4 points, especially in more complex, constraint-heavy languages.
- Correctness agreement (F1) between English and other language outputs is high (often >0.9) for reasoning/code tasks among strong LLMs, indicating that for well-supported languages, semantic equivalence is largely preserved.
- Dominant reliance on conventional discriminative understanding tasks (XCOPA, XWinograd) leads to qualitatively different, and often misleading, model rankings compared to BenchMAX's generative tasks. For example, models top-ranked on discriminative metrics underperform in generative and executable tasks, reinforcing the necessity of multi-paradigm evaluation.
Correlation and Family Analysis
- Strong intra-family performance correlations are seen across majority of tasks and languages, showing family-specific training and tokenization choices propagate systematically.
- Translation capabilities (specifically domain translation) correlate positively with other advanced competencies. For certain instruction-following tasks, prompt injection and task confusion become nontrivial for large models, creating negative scaling effects.
Implications and Future Directions
BenchMAX represents a step-change in the rigor and breadth of multilingual LLM evaluation. It offers the following implications:
- Current SOTA LLMs, including both proprietary (GPT-4o-mini) and open-source (DeepSeek-V3), remain highly unbalanced in advanced capability transfer across languages. Model scaling narrows the average gap but does not resolve systemic deficiencies for non-dominant languages.
- Translation quality for technical and domain-specific contexts is inadequately measured by existing metrics. New automatic evaluation paradigms and targeted human evaluation will be required to track meaningful cross-lingual performance.
- Benchmark construction with human-in-the-loop for all languages, especially for code-sensitive and constraint-following tasks, is mission-critical for accurate measurement and practical deployment scenarios.
- Open-source models such as DeepSeek-V3 are approaching, or in some areas exceeding, the performance of closed-source models, establishing open models as viable candidates for multilingual NLP applications.
- There remains significant low-hanging fruit in training, prompt engineering, and fine-tuning strategies for low-resource and morphologically complex languages. BenchMAX provides a foundational testbed to drive such innovations.
Conclusion
BenchMAX establishes a comprehensive, high-fidelity multilingual evaluation platform addressing the shortcomings of prior LLM benchmarking efforts. Its extensive linguistic and functional coverage, rigorous multilingual annotation, and focus on generative, executable, and advanced reasoning capabilities enables accurate, fine-grained diagnosis of modern multilingual LLMs. The findings demonstrate that even top-tier models manifest persistent, nontrivial weaknesses in low-resource and non-dominant languages, particularly on advanced generative tasks, and that solving these deficits will require innovations beyond simple parameter scaling or naive data augmentation. BenchMAX serves as a practical and theoretical platform for driving the next phase of research in robust, equitable multilingual AI.
Reference: "BenchMAX: A Comprehensive Multilingual Evaluation Suite for LLMs" (2502.07346)