MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark (2406.01574v6)

Published 3 Jun 2024 in cs.CL

Abstract: In the age of large-scale LLMs, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.

PDF HTML Abstract

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

The paper "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" addresses certain limitations in the existing MMLU benchmark. Given the rapid progress in the performance of LLMs, the initial MMLU benchmark has faced saturation, making it increasingly challenging to distinguish between the capabilities of different models. This paper introduces MMLU-Pro, designed to elevate the benchmark's difficulty and discriminative power significantly.

Motivation and Contribution

As LLMs like GPT-4, Gemini, and Claude continue to improve, MMLU's ability to measure their progress has plateaued. By achieving high scores on MMLU, recent models have reached a performance level where incremental improvements are difficult to measure. The MMLU-Pro dataset seeks to address this by integrating more complex, reasoning-focused questions and expanding the choice set from four to ten options. It also aims to enhance the robustness of model evaluations by decreasing sensitivity to prompt variations.

Dataset Enhancements and Construction

MMLU-Pro spans 14 disciplines, including fields like mathematics, physics, chemistry, law, and psychology, resulting in over 12,000 questions. The dataset was curated with three major improvements:

Expanded Options: Each question in MMLU-Pro now has ten options compared to the original four in MMLU. This significantly reduces the probability of correct guesses and increases the challenge.
Complexity and Reasoning: The dataset features a higher proportion of college-level exam questions that require deeper reasoning and comprehension.
Noise Reduction: A two-round expert review process was conducted to filter out trivial or erroneous questions, further ensuring the dataset's quality.

The questions were sourced from a variety of platforms, including the original MMLU, STEM websites, TheoremQA, and SciBench. These enhancements were complemented by advanced verification methodologies to ensure accuracy and validity.

Empirical Results

Several experiments were conducted with more than 50 LLMs, including both open-source and closed-source models such as GPT-4o, Claude-3-Opus, and LLaMA-3. The results can be summarized as follows:

Performance Drop: Models experienced a significant drop in accuracy when evaluated on MMLU-Pro compared to MMLU, with performance declines ranging from 16% to 33%. For instance, the top model, GPT-4o, showed an accuracy of 72.6% on MMLU-Pro.
Robustness: Sensitivity to prompt variations decreased dramatically in MMLU-Pro. The variability in model scores caused by different prompts reduced from 4-5% in MMLU to approximately 2% in MMLU-Pro, suggesting greater stability.
Chain of Thought (CoT) Reasoning: Models utilizing CoT reasoning performed significantly better on MMLU-Pro than direct answering. This demonstrates that MMLU-Pro effectively incorporates more complex reasoning tasks.

Error Analysis

A detailed error analysis of GPT-4o, the top-performing model, revealed that 39% of errors were due to flawed reasoning processes, 35% were due to a lack of specific domain expertise, and 12% stemmed from computational errors. This analysis underscores the benchmark's ability to pinpoint specific areas needing improvement, making MMLU-Pro a more informative tool for model diagnostics.

Comparison with MMLU

MMLU-Pro improves upon MMLU in several key aspects:

Difficulty Level: Expanded options and more complex questions increase the benchmark's difficulty, ensuring that models have room for improvement.
Reasoning Requirement: CoT methodology shows substantial performance boosts, reflecting the dataset's focus on higher cognitive functions.
Robustness: Reduced variability in scores due to prompt phrasing signifies a more reliable benchmark.

Implications and Future Directions

MMLU-Pro's introduction represents a significant step forward in benchmark design for LLMs. By effectively raising the bar for model evaluation, it pushes the boundaries of what these models can achieve. Practically, it offers a more discriminative tool for assessing model performance, helping researchers and developers to fine-tune models more efficiently and identifying distinct areas for improvement.

Theoretically, MMLU-Pro sets a precedent for future benchmarks, emphasizing the need for rigorous, diversified, and challenging evaluation criteria. Future developments in AI will likely see the integration of multi-modal capabilities, and benchmarks like MMLU-Pro could pave the way for evolving standards in model assessment.

In conclusion, MMLU-Pro provides a significant enhancement over the original MMLU benchmark, offering a more robust, challenging, and reliable tool for evaluating advanced LLMs. It holds promise for driving further advancements in AI by addressing key areas of model performance and resilience.