Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets (2312.02200v1)

Published 2 Dec 2023 in cs.CV, cs.AI, and stat.AP

Abstract: Major advancements in computer vision can primarily be attributed to the use of labeled datasets. However, acquiring labels for datasets often results in errors which can harm model performance. Recent works have proposed methods to automatically identify mislabeled images, but developing strategies to effectively implement them in real world datasets has been sparsely explored. Towards improved data-centric methods for cleaning real world vision datasets, we first conduct more than 200 experiments carefully benchmarking recently developed automated mislabel detection methods on multiple datasets under a variety of synthetic and real noise settings with varying noise levels. We compare these methods to a Simple and Efficient Mislabel Detector (SEMD) that we craft, and find that SEMD performs similarly to or outperforms prior mislabel detection approaches. We then apply SEMD to multiple real world computer vision datasets and test how dataset size, mislabel removal strategy, and mislabel removal amount further affect model performance after retraining on the cleaned data. With careful design of the approach, we find that mislabel removal leads per-class performance improvements of up to 8% of a retrained classifier in smaller data regimes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  2. Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning, pages 1062–1070. PMLR, 2019.
  3. Learning with instance-dependent label noise: A sample sieve approach. arXiv preprint arXiv:2010.02347, 2020.
  4. Mitigating memorization of noisy labels via regularization between representations. arXiv e-prints, pages arXiv–2110, 2021.
  5. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems, 35:197–211, 2022.
  6. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  7. A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936, 2022.
  8. Deep self-learning from noisy labels. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5138–5147, 2019.
  9. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3942–3951, 2021.
  10. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):590–597, 2019.
  11. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International conference on machine learning, pages 2304–2313. PMLR, 2018.
  12. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
  13. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  14. Learning multiple layers of features from tiny images. 2009.
  15. Fine-tuning can distort pretrained features and underperform out-of-distribution. ICLR, 2022.
  16. Cleannet: Transfer learning for scalable image classifier training with label noise. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5447–5456, 2018.
  17. Surgical fine-tuning improves adaptation to distribution shifts. arXiv preprint arXiv:2210.11466, 2022.
  18. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International conference on artificial intelligence and statistics, pages 4313–4324. PMLR, 2020.
  19. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  20. Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems, 33:20331–20342, 2020.
  21. Peer loss functions: Learning from noisy labels without knowing noise rates. In International conference on machine learning, pages 6226–6236. PMLR, 2020.
  22. Characterizing datapoints via second-split forgetting. Advances in Neural Information Processing Systems, 35:30044–30057, 2022.
  23. A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial intelligence review, 33:275–306, 2010.
  24. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411, 2021a.
  25. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749, 2021b.
  26. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
  27. Identifying mislabeled data using the area under the margin ranking. Advances in Neural Information Processing Systems, 33:17044–17056, 2020.
  28. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33:19920–19930, 2020.
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  30. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.
  31. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems, 29, 2016.
  32. Cleanlab: Confident Learning with Noisy Labels in Python. https://github.com/cleanlab/cleanlab, 2023. Accessed: 2023-11-17.
  33. Impact of noise in dataset on machine learning algorithms. In Machine Learning Research, pages 0–8, 2019.
  34. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  35. Dataset cartography: Mapping and diagnosing datasets with training dynamics. CoRR, abs/2009.10795, 2020.
  36. Identifying incorrect annotations in multi-label classification data. arXiv preprint arXiv:2211.13895, 2022.
  37. When does dough become a bagel? analyzing the remaining mistakes on imagenet. Advances in Neural Information Processing Systems, 35:6720–6734, 2022.
  38. Learning with noisy labels revisited: A study using real-world human annotations. In International Conference on Learning Representations, 2022.
  39. Robust early-learning: Hindering the memorization of noisy labels. In International conference on learning representations, 2021.
  40. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
  41. Meter-ml: A multi-sensor earth observation benchmark for automated methane source mapping. arXiv preprint arXiv:2207.11166, 2022a.
  42. Class noise vs. attribute noise: A quantitative study. The Artificial Intelligence Review, 22(3):177, 2004.
  43. Detecting corrupted labels without training a model to predict. In International Conference on Machine Learning, pages 27412–27427. PMLR, 2022b.
Citations (1)

Summary

  • The paper introduces SEMD, a novel method that speeds up mislabel detection while matching or surpassing traditional techniques in accuracy.
  • It systematically benchmarks over 200 experiments across varied noise levels in datasets like CheXpert and METER-ML.
  • Results demonstrate that removing mislabeled data can significantly enhance classification performance, especially in multi-label scenarios.

In recent years, the field of computer vision has seen impressive advancements, largely attributed to the use of labeled datasets. However, these datasets often contain labeling errors that can impede the performance of machine learning models. These labeling errors are especially problematic in critical areas such as medical diagnosis, where precise and accurate data labeling is crucial.

Label errors in datasets occur for many reasons, such as human error during manual labeling or inaccuracies in auto-labeling algorithms. To counter this, various automated mislabel detection techniques have been developed. Despite their potential, these methods were predominantly validated on datasets containing synthetically introduced noise, and their effectiveness on real-world data has remained largely unexplored.

An examination was conducted to evaluate over 200 experiments that benchmark these automated mislabel detection methods. The paper compared these methods on multiple datasets, considering different types of synthetically introduced and real noise, with varying noise levels. Among these methods is a new approach crafted specifically for this paper, called Simple and Efficient Mislabel Detector (SEMD), which performed either comparably or better than existing techniques while being substantially faster.

In an applied context, SEMD was tested on CheXpert, a dataset containing chest X-rays, and METER-ML, a multi-sensor image dataset labeled for methane emissions, which both come with their own challenges in terms of labeling errors. The findings indicate that removing mislabeled data using SEMD can lead to significant improvements in classification accuracy, especially in datasets that are not exceedingly large.

In scenarios of multi-label tasks, where an example could be associated with multiple labels, the paper explored different strategies for mislabel detection and removal. These strategies ranged from per-image to per-label approaches, and the optimal performance was often achieved by combining these strategies and tailoring them to the specific needs of the task at hand.

The paper is both extensive and detailed, offering a comprehensive analysis across a variety of settings—synthetic and real noise levels, dataset sizes, and differing removal strategies. Furthermore, it contributes to the field by proposing an effective and efficient approach for mislabel detection that is well-suited to the complexities of real-world datasets. The insights provided by this paper facilitate practitioners in designing better data-cleaning methods, thus improving the robustness and accuracy of the resulting machine learning models.

X Twitter Logo Streamline Icon: https://streamlinehq.com