Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation (2503.10497v2)

Published 13 Mar 2025 in cs.CL

Abstract: Existing LLM evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-linguistic reasoning abilities. This dual limitation makes it challenging to comprehensively assess LLMs' performance in the multilingual setting. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark. Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language. To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance. Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs. The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, with gaps of up to 24.3%. Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.

Evaluation of LLMs through MMLU-ProX: A Multilingual Perspective

The paper "MMLU-ProX: A Multilingual Benchmark for Advanced LLM Evaluation" presents MMLU-ProX, a rigorously constructed multilingual benchmark aimed at assessing the performance of state-of-the-art LLMs across diverse linguistic and cultural contexts. This paper addresses the considerable challenges traditional benchmarks face when evaluating sophisticated LLMs in a multilingual framework. MMLU-ProX aims to bridge the gap by incorporating 13 languages and evaluating performance based on a stringent multilingual metric. This venture not only highlights the persisting inadequacies of LLMs in handling typologically diverse languages but also proposes a structured method to measure and improve their cross-lingual reasoning capabilities.

Conceptual Framework and Methodology

MMLU-ProX builds upon the challenging reasoning-focused framework of MMLU-Pro, with specific enhancements aimed at improving the benchmark's multilingual capabilities. The intrinsic complexity of the questions ensures robust model evaluation. The paper elaborates on a semi-automated translation process where state-of-the-art LLMs provide initial translations which are then meticulously evaluated by experts. This process ensures high standards of conceptual accuracy, terminological consistency, and cultural relevance across languages – factors often found lacking in traditional translation processes.

The authors employ both 5-shot chain-of-thought (CoT) and zero-shot strategies to evaluate the LLMs, offering a comprehensive view of performance across linguistic and cultural spectrums. This strategic combination allows for nuanced insights into the reasoning processes of LLMs, reflecting their adaptability and structured decision-making skills across languages.

Experimental Findings

The paper provides an extensive evaluation of 25 cutting-edge LLMs, identifying clear degradation in performance when models are applied to low-resource languages compared to high-resource languages like English. Notably, models such as Qwen2.5-72B exhibit over 70% accuracy on English but drop significantly to around 40% in Swahili, underscoring persistent disparities in multilingual capabilities. The results emphasize that increased model parameters correlate with improved multilingual performance, highlighting the importance of scale and diversity in training data for equitable results across languages.

An analysis of reasoning-enhanced models such as QwQ-32B demonstrates varying impacts on performance, suggesting that specific enhancements may influence LLM capabilities differentially across languages. The systematic evaluation with MMLU-ProX reveals LLMs' potential when properly trained and fine-tuned, but also the stark need for continuous improvements, particularly in lower-resource languages.

Implications and Future Directions

By establishing MMLU-ProX, the paper sets a new standard for multilingual LLM evaluation, aiming to promote further research and development of models that are capable and reliable across various languages and cultures. The implications of such research are profound, offering pathways to more equitable and inclusive AI technologies that can serve a diverse global user base.

Future work as mentioned in the paper will focus on expanding MMLU-ProX to incorporate more languages and evaluate new emerging models. Such efforts are critical in addressing the inherent biases of current models and enhancing their cross-lingual applicability. The paper provides a robust tool and benchmark that can guide future model training and development toward truly global and inclusive language solutions.

Overall, the paper is a pivotal step towards advancing multilingual LLM evaluation and development. It offers comprehensive insights and actionable data, enabling researchers and practitioners to better understand and improve the cross-lingual reasoning capabilities of AI, pushing the bounds of LLMs towards more universal applicability and reliability.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (32)
  1. Weihao Xuan (14 papers)
  2. Rui Yang (221 papers)
  3. Heli Qi (9 papers)
  4. Qingcheng Zeng (30 papers)
  5. Yunze Xiao (13 papers)
  6. Yun Xing (14 papers)
  7. Junjue Wang (13 papers)
  8. Huitao Li (3 papers)
  9. Xin Li (980 papers)
  10. Kunyu Yu (3 papers)
  11. Nan Liu (140 papers)
  12. Qingyu Chen (57 papers)
  13. Douglas Teodoro (16 papers)
  14. Edison Marrese-Taylor (29 papers)
  15. Shijian Lu (151 papers)
  16. Yusuke Iwasawa (43 papers)
  17. Yutaka Matsuo (128 papers)
  18. Irene Li (47 papers)
  19. Aosong Feng (27 papers)
  20. Dairui Liu (9 papers)