Papers
Topics
Authors
Recent
2000 character limit reached

MMLU-Pro Benchmark

Updated 28 November 2025
  • MMLU-Pro is a reasoning-focused benchmark that filters trivial questions, expands option sets, and emphasizes multi-step reasoning in LLM evaluations.
  • It introduces refined metrics like Accuracy Drop, Prompt Robustness, and CoT Gain to capture deeper model reasoning and expose shortcut learning.
  • The benchmark extends to multilingual and domain-specific variants, ensuring robust, global evaluation of advanced language models.

The MMLU-Pro benchmark is a robust, reasoning-focused extension of the original Massive Multitask Language Understanding (MMLU) benchmark regime. MMLU-Pro addresses rapid performance saturation and superficiality in MMLU by filtering trivial questions, expanding option sets, and systematically emphasizing complex, multi-step reasoning. It forms the foundation for a suite of high-discrimination language understanding benchmarks, including multilingual and domain-specialized variants, and catalyzes new approaches for evaluating shortcut learning, anchoring bias, and the true reasoning capacities of LLMs across diverse domains and languages.

1. Motivation and Benchmark Philosophy

MMLU-Pro was created in response to critical limitations of the original MMLU. By mid-2024, state-of-the-art LLMs (e.g., GPT-4, GPT-4-Turbo) had reached 86–87% accuracy on MMLU, revealing high scores, but little discriminative headroom and unstable results under prompt variation (Wang et al., 3 Jun 2024). Core weaknesses included:

  • Knowledge Recall Dominance: Four-option MCQs allowed models to excel through retrieval and surface heuristics, rarely requiring genuine reasoning.
  • Prompt Sensitivity: Leaderboard positions varied by 4–11% with natural prompt rewrites.
  • Trivial and Noisy Data: Many questions were too easy or contained annotation errors.

MMLU-Pro directly addresses these with three design decisions: (a) more challenging and reasoning-intensive questions, (b) expanded distractor sets (up to 10 per item), and (c) expert-driven filtering of trivial or ambiguous content (Wang et al., 3 Jun 2024). The overarching goal is to generate a stable and discriminative evaluation bedrock for measuring meaningful progress in LLM reasoning, not just knowledge retrieval.

2. Dataset Construction and Properties

MMLU-Pro’s construction entails removal of easy/noisy questions, infusion of new, reasoning-demanding items, and expansion of answer choices.

  • Filtering: Any MMLU item answered correctly by >4 of 8 open-source LLMs is discarded (~42% filtering rate), targeting those with superficial cues (Wang et al., 3 Jun 2024).
  • Augmentation: New questions emphasizing chain-of-thought inference, multidimensional logic, and advanced numerical derivation are imported from STEM Website, TheoremQA, and SciBench, then converted to MCQ format via GPT-4-Turbo followed by manual review.
  • Option Expansion: Up to 10 plausible distractors are included per question, minimizing random-guess probability and increasing model uncertainty.
    • 83% of items retain all 10 options; mean options per item is 9.47.
  • Expert Review: Multi-stage annotation corrects answer keys, purges non-verbal questions (tables, images), and eliminates false negatives flagged by Gemini-1.5-Pro.
  • Domain Coverage: 14 academic/professional domains: mathematics, physics, chemistry, engineering, biology, computer science, economics, business, health, law, psychology, philosophy, history, and "other".
Property MMLU MMLU-Pro
# Questions 13,937 12,032
# Domains 57 14 ("Pro" core)
# Options per item 4 up to 10
Focus Fact recall Reasoning

Key conceptual types in MMLU-Pro include multi-step numeric derivation, theorem application, conceptual inference, and complex legal or philosophical logic.

3. Evaluation Metrics and Scoring Regimes

Beyond standard accuracy, MMLU-Pro and its extensions introduce metrics to expose reasoning and bias:

  • Accuracy: Proportion of correctly answered items in zero-shot, multiple-choice setups.
  • Accuracy Drop (Δ\DeltaAccuracy==AccuracyMMLU_{\text{MMLU}} -AccuracyMMLU-Pro_{\text{MMLU-Pro}}): Quantifies increased challenge.
  • Prompt Robustness: Maximum score change across 24 prompt variants.
  • Chain-of-Thought (CoT) Gap: Difference in performance under CoT prompts vs. direct answer—they show that CoT is essential for MMLU-Pro but not for MMLU.
  • Multilingual/Domain Extensions: Accuracy micro/macro-averaged across language splits (Xuan et al., 13 Mar 2025, KJ et al., 27 Jan 2025).

Summary table of key formulas:

Metric Formula
Accuracy Accuracy=# correct# total\mathrm{Accuracy} = \frac{\#\ \mathrm{correct}}{\#\ \mathrm{total}}
Prompt Sensitivity Sensitivity=maxi,jAcciAccj\mathrm{Sensitivity} = \max_{i,j}|Acc_i - Acc_j|
CoT Gain ΔCoT(m)=Accm,CoTAccm,DA\Delta_\mathrm{CoT}(m) = \mathrm{Acc}_{m,\mathrm{CoT}} - \mathrm{Acc}_{m,\mathrm{DA}}
MacroAvg MacroAcc=1LiAcci\mathrm{MacroAcc} = \frac{1}{L}\sum_i \mathrm{Acc}_i

4. Experimental Findings and Discriminative Power

Multiple empirical results substantiate MMLU-Pro’s value:

  • Difficulty Increase: Leader model (GPT-4o) drops from 88.7% (MMLU) to 72.6% (MMLU-Pro); others lose 16–33 percentage points. This sharper spread directly resolves leaderboard saturation (Wang et al., 3 Jun 2024).
  • Chain-of-Thought Superiority: On MMLU-Pro, CoT delivers +19.1% (GPT-4o), +15.3% (GPT-4-Turbo), and similar double-digit gains for open LLMs. On MMLU, CoT is often flat or negative (–3% to +1.5%), underscoring that MMLU-Pro requires deeper reasoning.
  • Improved Robustness: Sensitivity to prompt template shrinks from ~5% (MMLU) to ~2% (MMLU-Pro).
  • Discriminative Gap: For top models, accuracy gaps are much wider—GPT-4o vs. GPT-4-Turbo gap is ~9% on MMLU-Pro (versus ~1% on MMLU), increasing contrast between close-performing LLMs.

5. Restricting Shortcut Learning and Advancing Benchmark Methodology

Extensions and related work demonstrate the benchmark's flexibility and capacity for nuanced analysis of LLM reasoning:

MMLU-Pro+ (Taghanaki et al., 3 Sep 2024):

  • Introduces multi-correct answer items by augmenting two-thirds of MMLU-Pro with "Both X and Y are correct" options.
  • Shortcut Selection Ratio (SSR): Measures how often models stick to their original (possibly wrong) single-answer choice even when a "Both" correct option is offered.
    • SSRwrong=Nstayed_wrongNtotal_TPP\mathrm{SSR}_\mathrm{wrong} = \frac{N_\mathrm{stayed\_wrong}}{N_\mathrm{total\_TPP}}
    • SSRpartial=Nstayed_partialNtotal_TPP\mathrm{SSR}_\mathrm{partial} = \frac{N_\mathrm{stayed\_partial}}{N_\mathrm{total\_TPP}}
  • Correct Pair Identification Ratio (CPIR): Measures the ability to distinguish true from false answer pairs:
    • CPIR=NTPPNPFPP+NCFPP\mathrm{CPIR} = \frac{N_\mathrm{TPP}}{N_\mathrm{PFPP}+N_\mathrm{CFPP}}
  • Results highlight major reasoning gaps and anchoring biases in current SOTA LLMs.

Base-Rate and Label Bias Handling (Moore et al., 17 Jun 2024):

  • The Nvr-X-MMLU protocol (also proposed as “MMLU-Pro”) systematically permutes correct answer positions to measure and nullify base-rate token guessing biases, using the minimum accuracy across splits as the definitive score. This exposes models’ dependence on label-position heuristics and corrects for conflation between test-taking strategy and reasoning ability.

Synthetic Benchmark Generation (Yuan et al., 2 Feb 2025):

  • The BenchMaker approach creates cost-effective, LLM-generated benchmarks that mirror human MMLU-Pro in discriminative power (Pearson r=0.967r=0.967 for model accuracy rankings) and robustness (r=0.982r=0.982 under rephrased assessment demands), supporting rapid, unbiased expansion of benchmark suites for tracking LLM progress.

6. Multilingual and Specialized Variants

MMLU-Pro’s structure supports native-language and domain-adapted benchmarks:

IndicMMLU-Pro (KJ et al., 27 Jan 2025):

  • Adapts the full MMLU-Pro structure into nine Indic languages via aligned pipelines (IndicTrans2, back-translation validation, and expert review).
  • Supplies culturally and linguistically validated tasks, maintaining statistical parity and quality thresholds (chrF++, BLEU, TER).
  • Baseline results show SOTA models (GPT-4o, multilingual LLMs) outperform traditional Indic and multilingual BERTs, with persistent performance gaps that reflect model design and data distribution.

MMLU-ProX (Xuan et al., 13 Mar 2025):

  • Translates MMLU-Pro into 13 (and extended to 29) languages using multi-stage LLM plus expert validation.
  • Ensures consistent question semantics and technical accuracy, enabling direct model comparisons across language groups.
  • Empirical findings reveal large (20–35%) accuracy gaps between high- and low-resource languages, highlighting LLMs’ limitations in global generalization.

Mobile-MMLU-Pro (Bsharat et al., 26 Mar 2025):

  • Distills high-difficulty, mobile-relevant items via two-stage model ensemble filtering and strong-model consensus.
  • Adds deployment metrics (latency, energy, memory) on-device, and enforces order-invariance, targeting realistic mobile-user scenarios while maintaining discriminative difficulty.

7. Impact, Limitations, and Future Directions

MMLU-Pro and its descendants are widely adopted by the LLM research community as the principal reference for “hard” language understanding evaluation. Key aspects include:

  • Resolution of Leaderboard Stagnation: Prevents premature ceiling effects found in MMLU, sharpening performance gaps for tracking progress.
  • Greater Faith in Model Advances: Chain-of-thought is now reliably beneficial, shifting the evaluation focus from factual surface retrieval to stepwise, compositional reasoning.
  • Richness of Evaluation: Metrics such as SSR and CPIR reveal not only accuracy but the model’s flexibility, anchoring bias, and ability to break from shortcuts.
  • Dataset Stability: Robustness to prompt style and systematic removal of noisy/trivial questions support consistent rankings and fair benchmarks.
  • Room for Growth: SOTA models score ≲73% (English); headroom remains for substantial development.
  • Ecosystem Expansion: MMLU-Pro structure propagates into language-specific (IndicMMLU-Pro), mobile-optimized (Mobile-MMLU-Pro), and global/multilingual (MMLU-ProX) benchmarks.

Key limitations and open challenges include addressing residual label and answer-length biases, further improving multi-language alignment, and maintaining regular dataset refreshes to forestall overfitting. Future MMLU-Pro extensions advocate more systematic inclusion of multi-answer items, adversarial distractor generation, and ensemble scoring across prompt and context variations (Taghanaki et al., 3 Sep 2024).


References:

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MMLU-Pro Benchmark.