Measuring MMLU: Assessing Language Models
- MMLU is a benchmark that measures LLMs’ general knowledge and reasoning across diverse fields using standardized multiple-choice questions.
- It employs zero-shot, few-shot, and chain-of-thought prompting to evaluate models’ performance, highlighting gaps relative to human experts.
- Recent extensions include local-language adaptations and contamination control techniques, enhancing cultural inclusivity and benchmark integrity.
Massive Multitask Language Understanding (MMLU) quantifies the breadth and depth of a model’s general knowledge and reasoning ability over a wide range of expert-level domains and disciplines. An MMLU evaluation typically presents a series of multiple-choice questions (MCQs) drawn from highly varied subject matter—such as humanities, STEM, social sciences, law, and medicine—to probe the factual, procedural, and reasoning competencies of LLMs under standardized testing conditions. Since its introduction, MMLU has become the canonical metric for tracking progress and competitive ranking in LLM development. In response to its widespread adoption, the core methodologies, design choices, weaknesses, and extensions (including rigorous contamination control and multilingual adaptation) have become active topics of technical inquiry and benchmark innovation.
1. Foundational Principles and Benchmark Construction
The original MMLU benchmark, developed by Hendrycks et al., comprises 57 distinct tasks (subjects) spanning humanities, social sciences, STEM fields, and other domains, totaling over 15,000 rigorously curated multiple-choice questions (Hendrycks et al., 2020). Each subject typically provides at least 100 held-out test items, with exactly four answer choices per question. Question pools are sourced from official practice exams (e.g., AP tests, GRE subject exams, USMLE, bar exam), university-level assessments, and textbook material, ensuring a mix of rote knowledge, interdisciplinary application, and multi-step reasoning items.
For every test question , with gold answer and model prediction , accuracy is defined as
where is the total number of test instances. Macro-averaged accuracy across subjects is given by
with the per-subject accuracy.
Recent local-language adaptations (e.g. CMMLU (Li et al., 2023), MMCU (Zeng, 2023), ArabicMMLU (Koto et al., 2024), TurkishMMLU (Yüksel et al., 2024), SinhalaMMLU (Pramodya et al., 3 Sep 2025), GreekMMLU (Zhang et al., 5 Feb 2026), BnMMLU (Joy, 25 May 2025), TUMLU (Isbarov et al., 16 Feb 2025), and Global-MMLU (Singh et al., 2024)) extend this framework by sourcing native, curriculum-aligned MCQs with subject matter experts for annotation and verification, or by applying rigorous culturally-adapted translations and community review.
2. Evaluation Protocols and Scoring Paradigms
The MMLU evaluation protocol relies on zero-shot and few-shot prompting—no gradient updates or train-time exposure to the test set—mirroring practical deployment scenarios and highlighting the model's out-of-distribution generalization.
- Zero-shot: The model is prompted with only the question stem and possible answers, with a domain-appropriate instruction (e.g., "Please select the correct answer:") and required to emit a single letter or label.
- Few-shot: A small number of complete example Q&A pairs (drawn from a separate development set) are prepended; five is typical. This probes in-context learning and prompt conditioning.
- Chain-of-Thought (CoT) prompting: For select variants (e.g., MMLU-Pro (Wang et al., 2024), TurkishMMLU (Yüksel et al., 2024), TUMLU (Isbarov et al., 16 Feb 2025)), annotated step-by-step solutions or rationales are appended to encourage multi-hop reasoning.
Scoring follows strict exact-match evaluation on single- or multiple-answer MCQs:
- Answer is only counted as correct if predicted choice(s) exactly match the gold label set—no partial credit.
- For multi-select questions (less common), the unordered set must match exactly.
Recent extensions propose additional metrics to capture robustness and calibration:
- Micro- vs macro-averaging: Global accuracy (micro) vs. unweighted average over subject accuracies (macro).
- Accuracy drops under shuffling: Measuring performance decline when answer options are permuted, diagnosing label-position bias (Gupta et al., 2024).
- Chance-adjusted metrics: Rescale accuracy to account for pure chance, e.g., with choices (Gupta et al., 2024).
- Multi-shuffle robustness: 0 (correct on all 1 independent shuffles).
3. Empirical Findings and Performance Trends
Early GPT-3 class models (175B parameters) achieved micro-average accuracy 244% (few-shot) on MMLU, well above random (25%), yet far below human expert (Hendrycks et al., 2020). Closed-source models such as GPT-4o now reach 88–92% (Dunham et al., 2024), but performance remains highly variable per domain:
- Strongest domains: broad factual recall (e.g., History, Biology, General Knowledge); weaker on STEM procedural tasks, calculation-heavy topics, and areas demanding formal legal, mathematical, or symbolic reasoning.
- Per-task performance exhibits high variance: law, ethical reasoning, and multi-step physics or mathematics often hover at or just above chance in open-source or non-instruction-tuned models (Hendrycks et al., 2020, Zeng, 2023, Koto et al., 2024).
Performance on local-language MMLU variants is systematically lower:
- On MMCU (Chinese): Top zero-shot accuracy per domain ranges from 23.9% (Law) to 51.2% (Medicine), with overall gaps >18% between best and worst models (Zeng, 2023).
- ArabicMMLU: Even best open-source Arabic-centric model (Jais-chat 30B) achieves only 62.3% accuracy (random: 29%) (Koto et al., 2024).
- BnMMLU (Bengali): Open-source models trail top proprietary LLMs by 15–20%, with reasoning and procedural categories hardest (Joy, 25 May 2025).
- GreekMMLU, TurkishMMLU, TUMLU, SinhalaMMLU, and LAG-MMLU all report analogous results: performance drops with increased cultural specificity and decreased resource status, and native human annotation is critical for validity (Zhang et al., 5 Feb 2026, Yüksel et al., 2024, Isbarov et al., 16 Feb 2025, Pramodya et al., 3 Sep 2025, Etori et al., 14 Mar 2025).
4. Robustness, Contamination, and Benchmark Integrity
Multiple studies have exposed both linguistic and methodological artifacts that compromise benchmark fidelity:
- Contamination: Many LLMs are trained on republished MMLU data, leading to memorization rather than genuine understanding. MMLU-CF addresses this by procedural decontamination: question rephrasing, shuffling, distractor replacement, and withholding a private test set (Zhao et al., 2024). GPT-4o's accuracy drops from 88.0% (original MMLU) to 73.4% (MMLU-CF).
- Prompt Sensitivity and Label-position Bias: Models demonstrate significant performance drops (6–27%) when answer order is shuffled, particularly in smaller/non-instruction-tuned models (Gupta et al., 2024). To mitigate lottery effects, best practice includes reporting average accuracy and variance across shuffles, as well as chance-corrected performance.
- Dataset Noise: Manual re-annotation (MMLU-Redux) shows that up to 9% of MMLU questions are plagued by ambiguity, label error, or multiple correct answers—up to 57% error rate in Virology and double-digit error rates in Chemistry, Logical Fallacies, and Law (Gema et al., 2024). Excluding or correcting these yields 2–5 point boosts in exact-match accuracy and affects model ranking.
- Translation Artifacts and Cultural Bias: Automatic translation degrades accuracy (3 point loss is typical) and introduces content- and culture-specific errors, especially in disciplines with idiomatic, symbolic, or region-bound knowledge (e.g., US Foreign Policy questions translated to Spanish) (Plaza et al., 2024, Singh et al., 2024). Global-MMLU implements post-editing, native speaker calibration, and explicit annotation of culturally sensitive (CS) versus agnostic (CA) questions—28% of MMLU is labeled as CS, predominantly Western-centric (Singh et al., 2024).
5. Extensions, Localizations, and Methodological Innovations
To accommodate the limits of classic MMLU, variant benchmarks have been introduced:
- MMLU-Pro: Increases option count (4→10), eliminates trivial/noisy items, and injects reasoning-focused questions; accuracy drops by 4–5 points; prompt-variance falls 62-fold; Chain-of-Thought delivers double-digit gains, confirming genuine reasoning demand (Wang et al., 2024).
- Global-MMLU: Translates and human-edits MMLU to 42 languages, with rigorous annotation of translation quality and cultural/geographic dependence. Model rankings shift substantially when comparing CS and CA subsets. CS items are heavily concentrated in humanities, social sciences, and Western-centric topics; STEM remains largely CA (Singh et al., 2024).
- Language- and Culture-Specific MMLUs: MMCU, CMMLU, GreekMMLU, TurkishMMLU, ArabicMMLU, BnMMLU, TUMLU, SinhalaMMLU, and LAG-MMLU demonstrate that natively authored, curriculum-aligned, and locally validated MCQs are essential to measure true model proficiency in non-English domains (Zeng, 2023, Li et al., 2023, Zhang et al., 5 Feb 2026, Yüksel et al., 2024, Koto et al., 2024, Joy, 25 May 2025, Isbarov et al., 16 Feb 2025, Pramodya et al., 3 Sep 2025, Etori et al., 14 Mar 2025).
- Scaling and Prompting Strategies: Increasing parameter count consistently improves macro-accuracy; five-shot and chain-of-thought prompting benefit larger models and reasoning-centric domains, but can degrade performance in instruction-tuned models already aligned to the test format (Li et al., 2023, Wang et al., 2024, Zhang et al., 5 Feb 2026).
6. Benchmark Limitations, Ongoing Challenges, and Future Directions
Persistent limitations of the MMLU methodology include:
- Dataset Noise and Ground Truth Flaws: Systematic error annotation and MMLU-Redux are now required for high-fidelity evaluations, as label noise can obscure or invert comparative rankings (Gema et al., 2024).
- Contamination: Widespread training set leakage inflates headline metrics and confounds subtle model comparisons; MMLU-CF and closed-source test sets have become critical for assessing true generalization (Zhao et al., 2024).
- Translation and Cultural Misalignment: Benchmarks created by machine translation misrepresent both linguistic and cultural realities; full-scale adaptation or expert post-editing is necessary for robust cross-lingual evaluation (Plaza et al., 2024, Singh et al., 2024).
- Prompt and Option Sensitivity: Model performance remains volatile under minor prompt or answer-order shifts; metrics like prompt variance and multi-shuffle robustness are now standard (Gupta et al., 2024, Wang et al., 2024).
- Domain Imbalance and Western-centricity: Global MMLU shows that 28% of questions are culturally sensitive, 86.5% of which are Western-centric, limiting their diagnostic value for non-Western LLMs (Singh et al., 2024).
Best practices now demand:
- Rigorous dataset annotation, error correction, and continuous community curation (e.g., MMLU-Redux).
- Decontaminated, closed-source (or rotating) test splits with public validation and cross-validation protocols (e.g., MMLU-CF).
- Culturally and linguistically native question pools for new languages, rather than naïve translation (e.g., MMCU, GreekMMLU, TUMLU, SinhalaMMLU).
- Reporting of accuracy by subject, domain, cognitive type, and robustness to prompt/option perturbation.
7. Significance and Outlook
Measuring Massive Multitask Language Understanding via robust, contamination-controlled, and culturally-inclusive benchmarks remains central for tracking the epistemic and reasoning frontiers of LLMs. Current “headline” accuracy numbers must be interpreted with careful attention to contamination, dataset noise, regional and cultural alignment, and prompt sensitivity. As models approach saturation on standard MMLU, advances in benchmark difficulty (MMLU-Pro), dataset rigor (MMLU-Redux), and cross-lingual, curriculum-grounded expansion (Global MMLU, MMCU, ArabicMMLU, TurkishMMLU, TUMLU, GreekMMLU, BnMMLU, SinhalaMMLU, LAG-MMLU) are setting the agenda for the next phase of evaluation science (Hendrycks et al., 2020, Zeng, 2023, Li et al., 2023, Wang et al., 2024, Gupta et al., 2024, Singh et al., 2024, Zhao et al., 2024, Gema et al., 2024, Joy, 25 May 2025, Isbarov et al., 16 Feb 2025, Pramodya et al., 3 Sep 2025, Koto et al., 2024, Yüksel et al., 2024, Zhang et al., 5 Feb 2026, Etori et al., 14 Mar 2025). The future of MMLU measurement will require continuous refinement of benchmark integrity, expansion to underrepresented languages and domains, and more nuanced metrics that capture not just breadth, but the precision, cultural fluency, and reasoning depth of next-generation LLMs.