Towards Reliable Dermatology Evaluation Benchmarks (2309.06961v2)
Abstract: Benchmark datasets for digital dermatology unwittingly contain inaccuracies that reduce trust in model performance estimates. We propose a resource-efficient data-cleaning protocol to identify issues that escaped previous curation. The protocol leverages an existing algorithmic cleaning strategy and is followed by a confirmation process terminated by an intuitive stopping criterion. Based on confirmation by multiple dermatologists, we remove irrelevant samples and near duplicates and estimate the percentage of label errors in six dermatology image datasets for model evaluation promoted by the International Skin Imaging Collaboration. Along with this paper, we publish revised file lists for each dataset which should be used for model evaluation. Our work paves the way for more trustworthy performance assessment in digital dermatology.
- Emerging Properties in Self-Supervised Vision Transformers. 2021.
- Analysis of the ISIC image datasets: Usage, benchmarks and recommendations. Medical Image Analysis, 2022.
- A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020.
- Learning with Instance-Dependent Label Noise: A Sample Sieve Approach, 2021.
- Disparities in dermatology AI performance on a diverse, curated clinical image set. Science Advances, 2022.
- SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis. 2023.
- ImageNet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition, 2009.
- Automated Identification of Label Errors in Large Electrocardiogram Datasets. In 2022 Computing in Cardiology (CinC), 2022.
- Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115–118, February 2017. ISSN 1476-4687. 10.1038/nature21056. Number: 7639 Publisher: Nature Publishing Group.
- Precision-Recall-Gain Curves: PR Analysis Done Right. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- Med-node: A computer-assisted melanoma diagnosis system using non-dermoscopic images”. Expert Systems with Applications, 2015.
- Matthew Groh. Identifying the context shift between test benchmarks and production data. arXiv preprint arXiv:2207.01059, 2022.
- Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset. IEEE Computer Society, 2021.
- Towards transparency in dermatology image datasets with skin tone annotations by experts, crowds, and an algorithm. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2):1–26, 2022.
- SelfClean: A Self-Supervised Data Cleaning Strategy, 2023.
- CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison, 2019.
- Deep learning with noisy labels: exploring techniques and remedies in medical image analysis, 2020.
- Seven-point checklist and skin lesion classification using multitask multimodal neural nets. IEEE Journal of Biomedical and Health Informatics, 2019.
- Diagnostic accuracy of dermoscopy. The Lancet. Oncology, 3(3):159–165, March 2002. ISSN 1470-2045. 10.1016/s1470-2045(02)00679-4.
- CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks, April 2021. arXiv:1904.09483 [cs].
- PH2 - A dermoscopic image database for research and benchmarking. International Conference of the IEEE Engineering in Medicine and Biology Society, 2013.
- Learning with Noisy Labels. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2013.
- Confident Learning: Estimating Uncertainty in Dataset Labels. Journal of Artificial Intelligence Research, 2021a.
- Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. 2021b.
- Know your self-supervised learning: A survey on image-based generative and discriminative training. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. Survey Certification.
- PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data in Brief, 2020.
- Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning. International Journal of Human-Computer Studies, 2022.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library, 2019.
- DSM-5 field trials in the United States and Canada, Part II: test-retest reliability of selected categorical diagnoses. The American Journal of Psychiatry, 2013.
- Deep Learning is Robust to Massive Label Noise. 2018.
- “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15, Yokohama Japan, May 2021. ACM. ISBN 978-1-4503-8096-6. 10.1145/3411764.3445518.
- Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- A Benchmark for Automatic Visual Classification of Clinical Skin Disease Images. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, volume 9910. Springer International Publishing, Cham, 2016.
- The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1):180161, August 2018. ISSN 2052-4463. 10.1038/sdata.2018.161. Number: 1 Publisher: Nature Publishing Group.
- Domain-specific classification-pretrained fully convolutional network encoders for skin lesion segmentation. Computers in Biology and Medicine, 2019.
- Human–computer collaboration for skin cancer recognition. Nature Medicine, 2020.
- Label Errors in BANKING77. In Proceedings of the Third Workshop on Insights from Negative Results in NLP. Association for Computational Linguistics, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.