Are We Done with MMLU? (2406.04127v2)

Published 6 Jun 2024 in cs.CL and cs.AI

Abstract: Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux.

PDF HTML Abstract

Analysis of "Are We Done with MMLU?"

The paper, "Are We Done with MMLU?", critically examines the reliability of the Massive Multitask Language Understanding (MMLU) benchmark in evaluating LLMs. This benchmark, despite its wide adoption, demonstrates significant shortcomings in its ground truth quality, which the authors argue undermines the reliability of LLM evaluations.

Identification of Errors in MMLU

The authors highlight that a substantial portion of the MMLU dataset contains errors. Specifically, the Virology subset is noted for having 57% of its questions incorrectly annotated. These errors range from simple parsing mistakes to more complex issues such as context and interpretation errors. This observation raises serious concerns about the validity of performance metrics derived from the MMLU benchmark.

Introduction of MMLU-Redux

To address the identified issues, the authors introduce MMLU-Redux, a meticulously re-annotated subset comprising 3,000 questions across 30 subjects. This subset was manually curated by 14 experts who employed a novel error taxonomy to categorize and correct errors. The error taxonomy categorizes issues into two primary groups: Question Assessment and Ground Truth Verification, with further subdivisions such as Bad Question Clarity, Bad Options Clarity, No Correct Answer, Multiple Correct Answers, and Wrong Ground Truth.

Impact on Model Evaluation

The re-evaluation of prominent LLMs using MMLU-Redux revealed significant discrepancies in performance metrics compared to those originally reported. For instance, Palmyra X v3, which ranked fourth in the original MMLU benchmark for the Virology subset, ascended to the first rank when evaluated with the corrected instances in MMLU-Redux. This indicates that previous rankings based on MMLU's erroneous data can be misleading, thus advocating for a revision of the MMLU dataset.

Automatic Detection of Dataset Errors

The paper also explores the feasibility of automating the error detection process. Several methods such as zero-shot prompting, few-shot prompting, Chain of Thought (CoT) prompting, and retrieval-augmented prompting (RAG) were evaluated. Despite these efforts, the results indicated that even the best-performing strategy — few-shot CoT with Claude-3-Opus achieving an F2 score of 40.29 — falls short of reliably automating this process.

Future Implications and Speculations

The findings underscore the necessity for ongoing scrutiny and reassessment of benchmarks like MMLU. The immediate implication is the availability of MMLU-Redux as a more reliable subset, paving the way for future contributions to enhance the quality of MMLU and similar datasets. Furthermore, the paper illustrates that current automated methods are insufficient for error detection, suggesting that future developments should focus on more sophisticated techniques or the integration of human-in-the-loop systems to achieve higher accuracy.

Conclusion

In conclusion, the paper "Are We Done with MMLU?" provides a comprehensive analysis highlighting critical deficiencies in the MMLU benchmark. Through their meticulous re-annotation effort, the authors present MMLU-Redux, a more reliable subset for evaluating LLMs. The disparity in model performance between the original and corrected datasets emphasizes the need for stringent dataset validation protocols. While the paper contributes significantly to our understanding of dataset quality, it also opens avenues for further research to enhance automatic error detection methods, highlighting an essential area for future advancements in AI dataset curation.