MMLU-Redux: Refined LLM Benchmark Suite
- MMLU-Redux is an error-corrected benchmark suite that systematically addresses flawed question formulations to enhance LLM evaluation accuracy.
- It employs a manual error taxonomy and annotation framework, quantifying error rates and correcting mislabelled answers across subjects.
- The refined datasets enable more reliable model ranking and performance re-evaluation in diverse domains including industrial, multilingual, and mobile contexts.
The term "MMLU-Redux" denotes a suite of rigorous, error-corrected benchmarks derived from the original Massive Multitask Language Understanding (MMLU) dataset, with the aim of addressing and mitigating systematic flaws in widely adopted LLM evaluation standards. The term now also extends, by analogy, to related efforts targeting industrial, multilingual, and mobile contexts—exemplified by KMMLU-Redux (Korean) and Mobile-MMLU (mobile intelligence)—each applying dataset denoising protocols, error annotation taxonomies, and contamination controls to yield cleaner evaluation sets with improved reliability for model assessment across diverse domains (Gema et al., 2024, Hong et al., 11 Jul 2025, Bsharat et al., 26 Mar 2025).
1. Origin and Motivation
The MMLU benchmark, introduced by Hendrycks et al. (ICLR 2021), encompasses 57 distinct academic and professional subjects, each represented by four-choice multiple-choice questions drawn from textbooks, exams, or online educational resources. Typical evaluation employs few-shot prompting and reports performance as Exact Match (EM), defined as the fraction of questions for which the model’s top prediction matches the annotated ground truth. Despite widespread adoption, subsequent analysis uncovered pervasive dataset issues, notably errors in question/option formulation and mislabelled answer keys, that distort LLM performance measurement and systematically penalize certain models (Gema et al., 2024).
2. Error Taxonomy and Annotation Framework
"MMLU-Redux" is constructed through manual error annotation by domain experts, guided by a hierarchical taxonomy:
- Type 1: Question Assessment
- Bad Question Clarity
- Bad Options Clarity
- Type 2: Ground-Truth Verification
- No Correct Answer
- Multiple Correct Answers
- Wrong Ground Truth
Each item undergoes sequential evaluation for presentation clarity and answer key validity. Questions failing any criterion are flagged, categorized, and—where possible—corrected by consulting the original provenance (textbook, exam, or website). This systematic protocol ensures transparent error identification and correction, setting a new standard for dataset hygiene in LLM benchmarks (Gema et al., 2024).
3. Quantitative Error Analysis
A random sample of 3,000 questions (100 per subject from 30 MMLU subjects) underwent manual review. The observed error rates are:
- Overall error rate: (approximately one in ten questions contains a significant error).
- Subject-level variability: Certain categories are highly problematic—Virology exhibits (33% Wrong Ground-Truth), Logical Fallacies 26%, College Chemistry 25%, Professional Law 18%, Business Ethics 14%.
- Global estimate: A broader estimate across MMLU suggests of all questions are erroneous.
| Subject | Error Rate (%) |
|---|---|
| Virology | 57 |
| Logical Fallacies | 26 |
| College Chemistry | 25 |
| Professional Law | 18 |
| Business Ethics | 14 |
| Overall (30 subs.) | 9 |
These findings reveal substantial risk of spurious benchmarking and misinterpretation of LLM capabilities when relying on error-ridden benchmarks (Gema et al., 2024).
4. Construction and Release of MMLU-Redux
MMLU-Redux was built by selecting a balanced subset of MMLU subjects encompassing both STEM and humanities, prioritizing those with the highest suspected error rates. For each, 100 questions were sampled, re-annotated by 14 domain experts, and assigned error flags and corrections as required. The released dataset contains original questions, published labels, error metadata, and—when needed—corrected answers. It is available for further community auditing and expansion, serving both as an improved benchmark and as a gold-standard resource for developing error detection methodologies (Gema et al., 2024).
5. Model Performance Re-evaluation
Re-evaluation of leading LLMs on MMLU-Redux yields significant insights:
- Discrepancies between Original and Adjusted EM: Scores computed only on validated ("OK") questions are consistently higher than those measured on the original, error-prone sets. For instance, in the Virology subset, Palm-X v3 improves from 56% (Original EM) to 93% (Adjusted EM), jumping in rank from 4th to 1st.
- Rank Instability: Rank-swaps of up to three positions were observed in several domains, demonstrating distortion of leaderboards and misrepresentation of model capabilities.
- Item-level performance: On flagged erroneous questions, models performed 30–40 points lower than on validated items, except in rare cases where memorization of a wrong key paradoxically boosted erroneous question scores.
| Subject | Model | Original EM → Adjusted EM (Rank → Rank) |
|---|---|---|
| Virology | Palm-X v3 | 56% → 93% (4 → 1) |
| Claude 3 Opus | 54% → 88% (9 → 2) | |
| GPT-4o | 56% → 91% (1 → 1) | |
| College Chemistry | GPT-4 Turbo | 60% → 72% (2 → 1) |
| Llama 3 70B | 56% → 67% (8 → 8) | |
| Logical Fallacies | GPT-4 Turbo | 90% → 96% (1 → 3) |
| Gemini 1.5 Pro | 88% → 96% (7 → 5) |
Results advocate for comprehensive dataset auditing, as even minimal error frequencies can materially alter LLM ranking and performance attribution (Gema et al., 2024).
6. Implications and Best Practices for Benchmark Design
Key recommendations derived from the MMLU-Redux experience:
- Routine systematic auditing with explicit error taxonomies should precede widespread adoption of any benchmark.
- Provenance preservation: Retain original source documentation for all items to aid downstream verification and extension.
- Multi-stage annotation pipelines: Engage both subject matter experts and cross-validators for accuracy and consistency.
- Release explicit error taxonomies: Enable end-users to filter, correct, or audit faulty benchmark items efficiently.
- Augmented pre-screening: Explore (but do not solely rely on) semi-automated LLM-based error flagging (current best models achieve only F2 in error detection).
- Open community benchmarks: Public datasets should remain open for ongoing correction and be positioned as evolving resources (Gema et al., 2024).
A plausible implication is that benchmarking protocols will increasingly require external validation, cross-lingual and cross-modal coverage, and adaptation to new use-cases as LLMs diversify.
7. Extensions: KMMLU-Redux and Mobile-MMLU
The principles of MMLU-Redux have inspired domain- and context-specific denoised benchmarks:
- KMMLU-Redux (Hong et al., 11 Jul 2025): Focused on Korean National Technical Qualification (KNTQ) exams, it applies stringent error-removal (manual and GPT-4o-assisted annotation, filtering of easy questions, decontamination) to yield a 2,587-item, 14-domain expert-level test set. This denoising procedure removes 38.6% of trivially easy items and 7.66% of critical errors from the original KMMLU. Industrially relevant, the cleaned set upholds strong label fidelity (manual review, official source PDFs), tight contamination controls, and fine-grained domain statistics.
- Mobile-MMLU (Bsharat et al., 26 Mar 2025): Designed for mobile AI evaluation, it features 16,186 four-way multiple-choice questions across 80 mobile-centric domains, with Mobile-MMLU-Pro providing a more challenging 9,497-question subset. Construction emphasizes real-world mobile relevance (leveraging WikiHow, Stack Exchange, Reddit, and LLM refinement), order-invariant question design, explicit reporting of mobile-specific efficiency metrics (latency, energy, RAM), and privacy-by-design principles (required on-device inference, no server data leaks). This line of work underscores the need for benchmarks tailored to deployment environment, resource constraints, and application-specific reliability.
These Redux-family benchmarks establish a rigorous paradigm for future dataset curation across languages, industries, and device-specific scenarios. Their collective methodology scales from academic to industrial to consumer-grade LLM deployments, solidifying Redux as a foundational approach to robust LLM evaluation (Gema et al., 2024, Hong et al., 11 Jul 2025, Bsharat et al., 26 Mar 2025).