Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multilingual MMLU: Global LLM Benchmarking

Updated 3 February 2026
  • Multilingual MMLU is a comprehensive evaluation framework that adapts MMLU benchmarks to multiple languages while preserving cultural and linguistic nuances.
  • It employs varied dataset construction modalities—including machine translation, hybrid human review, and native authoring—to reduce errors and enhance authenticity.
  • Empirical evaluations reveal significant cross-lingual performance gaps, emphasizing the need for culturally aware and dialect-sensitive evaluation protocols.

Multilingual MMLU refers to the extension and adaptation of Massive Multitask Language Understanding (MMLU) benchmarks—originally developed for English—into multiple languages, thereby supporting rigorous evaluation of general knowledge reasoning capabilities in LLMs across diverse linguistic, cultural, and typological contexts. This paradigm integrates translation, native question curation, and domain alignment to assess how well LLMs handle multiple-choice questions beyond Anglocentric and Western domains, and further identifies cross-lingual gaps, translation artifacts, and culture-specific knowledge barriers.

1. Dataset Construction Modalities

Multilingual MMLU construction follows distinct strategies that crucially impact evaluation quality and validity. The principal modalities are:

  • Direct Machine Translation: Many benchmarks employ fully automatic translation of the original English MMLU questions into target languages, using systems such as Google Translate, DeepL, or LLM-driven pipelines. For example, EU20-MMLU translates all 57 subjects into 21 European languages via DeepL, with post-editing for low-COMET segments, resulting in overall high parallelism (COMET-KIWI ≈0.82) and a maximum error rate <2.3% in any language (Thellmann et al., 2024). However, such translation can systematically introduce errors—proper name mistranslations (“John Constable” → “Juan Alguacil”), meaning shifts, and idiomatic losses—accounting for 4–9pp of apparent accuracy drop in GPT-4’s Spanish MMLU, as shown in manual audits (Plaza et al., 2024).
  • Human and Hybrid Translation with Post-editing: Global-MMLU (Singh et al., 2024) and MMLU-ProX (Xuan et al., 13 Mar 2025) employ a hybrid LLM+human pipeline. Google Translate is used for bulk translation, but professional and community annotators review at least 50 samples per language, correcting fluency, clarity, and semantic fidelity. MMLU-ProX introduces LLM “self-reflection” and cross-verification stages, with all flagged translations reviewed by domain experts and thresholds for acceptability (mean Likert >4.0), producing mean expert ratings >4.2.
  • Natively Authored Benchmarks: Increasingly, new multilingual MMLU variants forego translation altogether in favor of native question curation. Benchmarks such as SinhalaMMLU (Pramodya et al., 3 Sep 2025), TurkishMMLU (Yüksel et al., 2024), TUMLU (Isbarov et al., 16 Feb 2025), and HKMMLU (Cao et al., 4 May 2025) extract or write questions from real curricula, government exams, and e-learning platforms, preserving cultural context, correct terminology, and authentic educational framing. This approach yields superior naturalness and cultural resonance (e.g., SinhalaMMLU: naturalness of native STEM items 97.3% vs. 71.1% for translated [GlobalMMLU-si]), and is fundamental for dialectal coverage (e.g., DialectalArabicMMLU for five Arabic dialects (Altakrori et al., 31 Oct 2025)).

Table: Construction Approaches by Benchmarks

Benchmark Approach Languages
Global-MMLU LLM+human hybrid 42
EU20-MMLU Machine+post-edit 21 (Europe)
MMLU-ProX LLM+expert review 13 (expanding)
SinhalaMMLU Native authoring Sinhala
TurkishMMLU Native authoring Turkish
TUMLU Native authoring 8 Turkic
DialectalArabicMMLU Hybrid+native 5 dialects + MSA
HKMMLU Native/mixed HK Trad. Chinese

Notably, benchmarks using natively curated questions with explicit cultural alignment avoid pervasive translationese artifacts, as evidenced by marked differences in both fluency and model accuracy (Pramodya et al., 3 Sep 2025, Yüksel et al., 2024).

2. Evaluation Protocols and Prompt Engineering

Multilingual MMLU evaluation typically involves multiple-choice question answering, with each item offering several answer options (A–E, or fewer). Prompts conform to standardized templates based on the original MMLU, but critical details can affect model outputs:

  • Contextual Prompts: Few-shot prompting (e.g., k=5 in EU20-MMLU and Teuken-7B (Thellmann et al., 2024, Ali et al., 2024)) is common, as is chain-of-thought (CoT) scaffolding in reasoning-centered evaluations (Xuan et al., 13 Mar 2025, Yüksel et al., 2024). For example, TurkishMMLU shows substantial CoT gains in Mathematics (GPT-4o: +25pp) (Yüksel et al., 2024).
  • Language and Dialect Priming: DialectalArabicMMLU finds that explicit “oracle” prompts specifying the dialect degrade performance (–2pp), likely due to inconsistency in dialectal data exposure and prompt-template alignment (Altakrori et al., 31 Oct 2025). Chain-of-thought and few-shot strategies may yield variable effects across languages.
  • Accuracy and Macro-averaging: The canonical metric is accuracy—fraction of questions answered correctly per language (Accuracy=#correct#total\mathrm{Accuracy} = \frac{\#\text{correct}}{\#\text{total}}), with macro-averaging across languages in multilingual settings to control resource bias.
  • Automated Harnesses: Benchmarks utilize frameworks like LM Eval Harness for parallel, reproducible batch evaluation (Thellmann et al., 2024, Pramodya et al., 3 Sep 2025).

3. Empirical Results and Cross-Lingual Disparities

Results from multilingual MMLU evaluations reveal a pronounced resource gap:

  • High-Resource Languages: Top models (e.g., Qwen2.5-72B, GPT-4o, Claude 3.5 Sonnet) achieve 75–90% accuracy in English and major European languages (Thellmann et al., 2024, Yüksel et al., 2024, Xuan et al., 13 Mar 2025). EU21-MMLU reports Meta-Llama-3.1-8B-Instruct at 57.6% on average across 21 EU languages (Ali et al., 2024).
  • Low-Resource and Culturally Specific Cases: Accuracy drops substantially in under-served languages and dialects—e.g., Giriama (OpenAI o1: 70.8%, Llama-70B: 41%, Mistral-large: 35.6%) vs. Latvian (OpenAI o1: 88.8%, Llama-70B: 57.3%) (Etori et al., 14 Mar 2025); in Sinhala, even leading models plateau at 67% (Claude 3.5 Sonnet) (Pramodya et al., 3 Sep 2025). Similar degradation is visible in HKMMLU (HK Traditional Chinese): the best open-source model achieves 69.1% (Qwen 2.5-72B), vs. 74.8% (DeepSeek-V3) (Cao et al., 4 May 2025).
  • Dialectal Gaps: In DialectalArabicMMLU, mean scores across five dialects are ~15pp lower than English, with further loss compared to MSA. Oracle priming for dialects further degrades accuracy (Altakrori et al., 31 Oct 2025).
  • Translation Artifacts: Systematic translation errors (name mistranslation, cultural context loss) can account for a nontrivial fraction of failing items; up to 88% of Spanish mismatches in “Miscellaneous” MMLU were attributed to translation, and post-editing recovered 41% of previously failed items (Plaza et al., 2024).
  • Cross-lingual knowledge barrier: Even when surface translation is perfect, models trained on English exhibit ∼10–13pp accuracy drop when evaluated on a “mixup” format (mixed languages per question/options), reflecting limited internal cross-lingual transfer (Chua et al., 2024).

4. Cross-Lingual Evaluation Methods and Predictive Modelling

  • Pivot-based Evaluation (MEXA): MEXA (Kargaran et al., 2024) assesses MMLU performance indirectly, measuring alignment between model hidden states for English and non-English sentences using parallel corpora. A cosine-based retrieval accuracy (μ\mu) is computed per layer, then aggregated (mean/max pooling). MEXA’s mean alignment score (μmean\mu_\text{mean}) correlates with actual m-MMLU accuracy at ρ=0.95\rho = 0.95 (FLORES-200) or ρ=0.88\rho = 0.88 (Bible data) across models, supporting linear prediction of multilingual performance from latent representation similarity.
  • Data Quality and Selection: Classifier-driven filtering (MLP or FastText) of multilingual pretraining corpora, trained on knowledge-rich MCQ data, can yield LLMs that achieve a given MMLU accuracy using as little as 15% of the training tokens (Messmer et al., 14 Feb 2025). Stringent selection of structured/cultural content consistently replaces “volume over quality” training.
  • Trigger Prompts (PolyPrompt): Language-specific learned trigger tokens, dynamically prepended to input, boost held-out Global MMLU scores by 3.7–19.9% on 1B-parameter Llama variants, outstripping translation pipelines and naive prompting (Roll, 27 Feb 2025).

5. Cultural and Dialectal Coverage

  • Cultural Sensitivity: Global-MMLU provides cultural sensitivity (CS) and culturally agnostic (CA) labels, revealing that ∼28% of MMLU questions require culture- or geography-specific knowledge, with >84% of CS questions Western-oriented (Singh et al., 2024). Model rankings shift between CA and CS subsets, indicating an overestimation of global competence if only translation is used.
  • Dialectal/Scriptual Inclusion: Benchmarks such as TUMLU/TUMLU-mini (Isbarov et al., 16 Feb 2025), HKMMLU (Cao et al., 4 May 2025), and DialectalArabicMMLU (Altakrori et al., 31 Oct 2025) introduce non-Latin scripts (e.g., Sinhalese, Uyghur Arabic), local dialects, and within-language orthographical variants, with dual-script evaluation for quality control.
  • Recommendations: SinhalaMMLU and TurkishMMLU highlight the necessity of sourcing questions from local curricula, involving native experts, and anchoring difficulty to real-world education levels—for genuine assessment beyond translationese (Pramodya et al., 3 Sep 2025, Yüksel et al., 2024).

6. Persistent Limitations and Best Practices

  • Translation-induced confounds: Naive translation introduces proper-name, terminology, and context errors, systematically confounding benchmark interpretation (Plaza et al., 2024). Linguistically or culturally divergent items especially need human adaptation.
  • Low-Resource Language Gaps: Even the best LLMs exhibit 20–30pp performance drop in Swahili or Giriama relative to English, with scale effects unable to close the full gap (Xuan et al., 13 Mar 2025, Etori et al., 14 Mar 2025). Mixed-language fine-tuning (random chunk translation during further training) recovers ∼2pp of cross-lingual deficit (Chua et al., 2024).
  • Evaluation Recommendations:
  1. Prefer natively authored, balanced, and culturally validated benchmarks for under-represented languages.
  2. Where translation is unavoidable, combine LLM translation with professional post-editing, domain expert review, and spot human checks (minimum 50 samples/language).
  3. Publish macro-averaged per-language and per-domain scores to avoid population and domain bias.
  4. Report CA and CS breakdowns to avoid conflating Western cultural competence with universal reasoning.
  5. Leverage multi-script and dialectal variants, matching real-world linguistic practice.
  • Research Impact: Multilingual MMLU benchmarks have shifted the field toward more equitable evaluation of LLM capabilities worldwide, exposing resource, script, and cultural disparities previously hidden under English-centric metrics (Pomerenke et al., 11 Jul 2025). Consortia projects such as the AI Language Proficiency Monitor (Pomerenke et al., 11 Jul 2025), Global-MMLU (Singh et al., 2024), and domain-specific efforts (SinhalaMMLU (Pramodya et al., 3 Sep 2025), TurkishMMLU (Yüksel et al., 2024), TUMLU (Isbarov et al., 16 Feb 2025)) now make rigorous cross-language, cross-domain diagnostics possible on up to 200 languages.
  • Toward Equitable Evaluation: Emphasis shifts toward releasing natively constructed, script- and dialect-balanced benchmarks, and performing robust human-in-the-loop translation pipeline QA. Benchmarks are increasingly required for low-resource, scriptally diverse, and culture-specific contexts.
  • Open Problems: Addressing the “crosslingual knowledge barrier,” scaling high-quality native question curation, and developing robust methods for dialectal and code-mixed contexts remain open research frontiers.

In sum, multilingual MMLU constitutes the definitive suite of protocols and resources for both breadth- and depth-oriented evaluation of LLMs across global linguistic and cultural diversity. Recent advances emphasize the necessity of moving beyond translation, focusing on native, culturally grounded content, and leveraging sophisticated protocol engineering to ensure that evaluation reflects true reasoning and knowledge transfer, not merely translation proficiency or resource bias (Pramodya et al., 3 Sep 2025, Singh et al., 2024, Plaza et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multilingual MMLU.