Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

Published 26 Mar 2021 in stat.ML, cs.AI, and cs.LG | (2103.14749v4)

Abstract: We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of at least 3.3% errors across the 10 datasets, where for example label errors comprise at least 6% of the ImageNet validation set. Putative label errors are identified using confident learning algorithms and then human-validated via crowdsourcing (51% of the algorithmically-flagged candidates are indeed erroneously labeled, on average across the datasets). Traditionally, machine learning practitioners choose which model to deploy based on test accuracy - our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets. Surprisingly, we find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by just 5%. Test set errors across the 10 datasets can be viewed at https://labelerrors.com and all label errors can be reproduced by https://github.com/cleanlab/label-errors.

Citations (480)

Summary

  • The paper identifies that pervasive label errors in test sets can mislead performance evaluations, averaging a 3.3% error rate across benchmarks.
  • Using Confident Learning algorithms and human validation, the study quantifies the impact of label errors and demonstrates shifts in model rankings.
  • Findings suggest that lower-capacity models may outperform higher-capacity ones in noisy conditions, urging re-evaluation of model selection criteria.

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

The paper "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks" identifies significant label errors within the test sets of widely used datasets. These errors have implications for model evaluation, challenging the accuracy of machine learning benchmarks used to assess and compare models.

Introduction

Label errors are inherent in the construction of datasets used for supervised learning across various domains such as computer vision, natural language processing, and audio classification. Despite their importance in driving ML progress, these datasets often feature label inaccuracies due to automated labeling processes or crowd-sourcing errors. This paper reveals that such errors are not only present in training sets but also pervade test sets, potentially invalidating reported benchmark performances. Figure 1

Figure 1: An example label error from each category for image datasets, illustrating the diversity of label errors across different datasets.

Methodology

The paper utilizes Confident Learning (CL) algorithms to algorithmically identify potential label errors within test datasets. These errors are then confirmed through human validation via crowdsourcing. In analyzing 10 benchmarks used for ML models, the paper finds that an average of 3.3% of test set labels are incorrect, with ImageNet showing highest at 6%. These inaccuracies suggest the necessity for caution when choosing models based solely on conventional test set performance. Figure 2

Figure 2: Examples of difficult cases where Confident Learning identified potential errors, but no actual errors existed upon human verification.

Findings and Case Studies

Through empirical analysis, the paper discovers that lower capacity models are often more robust than their higher capacity counterparts with noisy real-world data. For example, ResNet-18 outperformed ResNet-50 on corrected test labels when error prevalence was increased by 6%. Similarly, VGG-11 excelled over VGG-19 under increased noise conditions in CIFAR-10 data. Figure 3

Figure 3

Figure 3: Analysis of ImageNet top-1 accuracy for models using original and corrected labels under different noise prevalence thresholds.

Discussion and Implications

The presence of label errors calls into question the stability and reliability of benchmarks, particularly in practical noise-prone environments. This challenges the traditional methods of model selection based on test accuracy alone, advocating for an evaluation based on corrected test sets. The paper suggests that ML practitioners carefully curate test set labels to more accurately reflect model performance in real-world deployments. Figure 4

Figure 4

Figure 4: The impact of errors in CIFAR-10 datasets on model accuracy under various noise thresholds.

Conclusion

The paper concludes that significant label errors exist within test datasets commonly used for model benchmarking, which can alter perceived model rankings. It advises the ML community to incorporate label correction processes prior to model evaluation, ensuring benchmarks that better reflect true model performance capabilities. Future research should focus on understanding how these errors affect model generalization and explore strategies for error mitigation in data labeling practices.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.