Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks (2103.14749v4)

Published 26 Mar 2021 in stat.ML, cs.AI, and cs.LG

Abstract: We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of at least 3.3% errors across the 10 datasets, where for example label errors comprise at least 6% of the ImageNet validation set. Putative label errors are identified using confident learning algorithms and then human-validated via crowdsourcing (51% of the algorithmically-flagged candidates are indeed erroneously labeled, on average across the datasets). Traditionally, machine learning practitioners choose which model to deploy based on test accuracy - our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets. Surprisingly, we find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by just 5%. Test set errors across the 10 datasets can be viewed at https://labelerrors.com and all label errors can be reproduced by https://github.com/cleanlab/label-errors.

Citations (480)

Summary

  • The paper reveals pervasive label errors in test sets, with an average rate of at least 3.3% and up to 10% in certain datasets, confirmed by crowdsourcing.
  • It demonstrates that lower capacity models can outperform higher capacity ones on corrected labels, challenging traditional benchmark evaluations.
  • The study advocates for improved data curation and robust model development to enhance the reliability of ML evaluation practices.

Analysis of Label Errors in Benchmark Datasets

The paper "Pervasive Label Errors in Test Sets" by Curtis G. Northcutt, Anish Athalye, and Jonas Mueller examines the prevalence and implications of label errors in widely-used datasets for computer vision, NLP, and audio processing. This detailed paper identifies label errors across ten benchmark datasets and evaluates their impact on ML model performance, providing substantial insights into both dataset reliability and model selection.

Overview of Findings

The authors applied confident learning algorithms to estimate label errors in test sets and validated these errors via crowdsourcing. The paper reveals an average label error rate of at least 3.3% across these datasets, with ImageNet and QuickDraw showing particularly high error rates of 6% and 10%, respectively. Notably, over 51% of algorithmically-flagged candidates were confirmed to be mislabeled.

The findings highlight a pivotal concern: ML benchmarks are often susceptible to inaccuracies from mislabeled data. The authors document that lower capacity models may outperform their higher-capacity counterparts in real-world scenarios where label errors are prevalent. For instance, ResNet-18 surpasses ResNet-50 on ImageNet with corrected labels with just a 6% increase in mislabeled examples in the original test set. Similarly, in CIFAR-10, VGG-11 outperforms VGG-19 under corrected labels with a 5% increase in errors.

Implications for Practitioners and Researchers

The implications of this research are twofold:

  1. Practical Implications: Machine learning practitioners should exercise caution when selecting models based solely on test accuracy, particularly when applying models in noisy, real-world environments. Ensuring models are evaluated on correctly labeled datasets could lead to better model selection outcomes.
  2. Theoretical Implications: From a research perspective, this paper suggests reconsideration of traditional benchmarking practices. Over-reliance on test accuracy derived from potentially erroneous labels may lead to misguided conclusions about model performance and robustness.

Future Directions

This research opens several avenues for future exploration:

  • Improved Data Curation: Development of more sophisticated data curation and label verification techniques could mitigate issues of noisy labels in both training and test datasets.
  • Robust Model Development: Researchers could focus on models and methodologies that are inherently robust to label noise, reducing dependency on costly data corrections.
  • Benchmark Redefinition: Rethinking benchmark datasets to emphasize label correctness over model performance on noisy data may provide a more accurate representation of model capabilities.

Overall, the paper serves as an essential contribution to the discourse on data quality in machine learning, advocating for a shift in focus from model complexity to data reliability in assessing technological advancements. This work underscores the necessity for vigilant data practices to maintain integrity in model evaluation and deployment.