MMLU-Pro Dataset Benchmark
- MMLU-Pro is an advanced multi-task language understanding benchmark that demands multi-step deductive, quantitative, and logical reasoning.
- It expands the answer set from four to ten options using GPT-4-Turbo generated distractors and rigorous multi-tier filtering to remove trivial content.
- Empirical results reveal significant accuracy drops compared to earlier benchmarks, highlighting its increased discriminative power and diagnostic stability.
The MMLU-Pro dataset is an advanced, reasoning-oriented multi-task language understanding benchmark designed to supersede prior datasets such as MMLU in its ability to assess the capabilities of LLMs across a comprehensive range of domains and reasoning styles. It was specifically created to mitigate saturation effects seen with earlier benchmarks, systematically eliminate trivial content, and demand a higher standard of multi-step deductive, quantitative, abductive, and theorem-based reasoning from models. MMLU-Pro achieves greater discriminative power through question reformulation, rigorous filtering, and a substantial expansion in answer option complexity, setting a new standard for robust, reliable evaluation of LLMs’ true language understanding (Wang et al., 2024).
1. Dataset Composition and Construction
MMLU-Pro is built as an extension and reengineering of the original MMLU benchmark. It merges 57 fine-grained subject areas spanning STEM, humanities, social sciences, and specialized disciplines into 14 broader disciplines, such as Mathematics, Physics, Chemistry, Law, Engineering, Psychology, History, Economics, Biology, Business, Computer Science, Health, Philosophy, and “Other.” The final corpus contains 12,032 validated multiple-choice questions.
The dataset incorporates three major new sources to diversify and raise the challenge level:
- STEM Website: ≈4,083 college-level, reasoning-rich problems with detailed solutions.
- TheoremQA: 598 questions emphasizing theorem-driven reasoning.
- SciBench: 541 advanced science exam items focusing on chain-of-thought problem solving.
Triviality and noise were systematically addressed via a multi-tier filtering process. Initially, MMLU questions that could be solved by at least five out of eight strong open-source LLMs (≥60%) were eliminated, excising 42.2% of the base set. Subsequent dual-stage expert reviews removed ambiguous, context-dependent, ill-posed, or otherwise unreliable items, and ensured that false negative distractors were excised.
A critical innovation is the expansion of the answer set from four to ten options per question, reducing the random-guess baseline to 10%. For 83% of items, GPT-4-Turbo was used to generate six additional high-plausibility distractors, each validated through human annotation.
2. Reasoning Types and Question Taxonomy
MMLU-Pro significantly elevates the proportion and complexity of reasoning demanded. It encompasses:
- Quantitative Reasoning: Multi-step calculations typical of undergraduate STEM curricula.
- Theorem-Based Reasoning: Requiring the application and synthesis of formal mathematical theorems.
- Abductive and Causal Reasoning: Inferring mechanisms or explanations from observed evidence.
- Logical Deduction and Definition Application: Common in law and philosophy.
- Scientific Problem Solving: Curriculum-aligned, typically requiring explicit reasoning chains.
These enhancements ensure that CoT-style (Chain-of-Thought) solutions offer tangible improvements on MMLU-Pro, unlike classic MMLU where knowledge retrieval dominated.
3. Quantitative Properties and Evaluation Metrics
MMLU-Pro’s design enables fine-grained, reproducible, and cross-modal evaluation. Key metrics include:
- Accuracy:
- Prompt Sensitivity: Measures variance across 24 prompt styles:
Per-discipline counts ensure broad subject coverage, with, for example, 1,351 mathematics questions, 1,299 in physics, 1,140 in law, and so on.
The reformulated distractor set and question pool offer robust resistance to pattern exploitation, with an average of 9.47 answer choices per item.
4. Empirical Benchmarks and Discrimination Capacity
MMLU-Pro produces a substantial absolute drop in accuracy (16%–33%) compared to original MMLU across leading LLMs:
| Model | MMLU (CoT) | MMLU-Pro (CoT) | Δ (pp) |
|---|---|---|---|
| GPT-4o | 88.7% | 72.6% | 16.1 |
| GPT-4-Turbo | 86.5% | 63.7% | 22.8 |
| Phi-3-medium | 79.4% | 55.7% | 23.7 |
| Llama-3-8B | 62.7% | 35.4% | 27.3 |
| Gemma-7B | 62.4% | 33.7% | 28.7 |
Prompt sensitivity is also reduced by half versus MMLU (2% vs. 4–5%), enhancing test–retest stability. Notably, CoT prompting has a pronounced positive effect, with gains of 4–19 percentage points, confirming the intended increase in reasoning depth and the inadequacy of direct answer strategies for this dataset.
MMLU-Pro thus achieves greater separation between models—accuracy gaps between top LLMs expand from ~2pp (MMLU) to ~9pp (MMLU-Pro)—and also provides more “headroom” for future model improvement.
5. Practical Considerations and Recommendations
MMLU-Pro is positioned as a stress test for substantive reasoning abilities. Recommended usages include:
- Employing both direct answer and CoT prompting styles to reveal models’ strengths and weaknesses.
- Adopting the rich distractor format in training/fine-tuning regimes to immunize against answer-pattern bias.
- Using aggregate and per-discipline metrics, including prompt sensitivity, to assess model robustness and susceptibility to superficial cues.
Researchers are encouraged to leverage the dataset’s diversity (discipline range, reasoning style, distractor quality) for developing specialized evaluation tools, conducting curriculum learning experiments, and tracking targeted advancements in higher-order and domain-specific reasoning.
6. Extensions, Variants, and Limitations
MMLU-Pro serves as the foundation for further advanced benchmarks:
- MMLU-Pro+ expands the core with multi-correct answer settings, analyzing shortcut learning and higher-order reasoning (Taghanaki et al., 2024).
- MMLU-ProX extends evaluation cross-lingually with parallel questions in 13 languages, supporting nuanced studies of multilingual model performance and disparity (Xuan et al., 13 Mar 2025).
Noted limitations of MMLU-Pro include its exclusive multiple-choice format (with ten-way options), absence of open-ended response evaluation, and the risk of domain imbalance despite rigorous distractor validation. Planned future enhancements seek to address these issues by incorporating adversarially-filtered questions and alternative evaluation formats.
7. Significance and Impact
MMLU-Pro constitutes a new standard for LLM evaluation, delivering a uniquely challenging, wide-coverage, reasoning-focused benchmark that addresses major shortcomings of prior benchmarks in discriminative power, artifact resistance, and diagnostic stability. It has become a recommended evaluation tool for tracking progress in large-scale language modeling, particularly in challenging areas such as multi-step, theorem-based, and causal reasoning, and is a critical substrate for follow-up multilingual and shortcut-learning research.