Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set (2203.08807v1)

Published 15 Mar 2022 in eess.IV, cs.AI, cs.CV, and cs.LG

Abstract: Access to dermatological care is a major issue, with an estimated 3 billion people lacking access to care globally. AI may aid in triaging skin diseases. However, most AI models have not been rigorously assessed on images of diverse skin tones or uncommon diseases. To ascertain potential biases in algorithm performance in this context, we curated the Diverse Dermatology Images (DDI) dataset-the first publicly available, expertly curated, and pathologically confirmed image dataset with diverse skin tones. Using this dataset of 656 images, we show that state-of-the-art dermatology AI models perform substantially worse on DDI, with receiver operator curve area under the curve (ROC-AUC) dropping by 27-36 percent compared to the models' original test results. All the models performed worse on dark skin tones and uncommon diseases, which are represented in the DDI dataset. Additionally, we find that dermatologists, who typically provide visual labels for AI training and test datasets, also perform worse on images of dark skin tones and uncommon diseases compared to ground truth biopsy annotations. Finally, fine-tuning AI models on the well-characterized and diverse DDI images closed the performance gap between light and dark skin tones. Moreover, algorithms fine-tuned on diverse skin tones outperformed dermatologists on identifying malignancy on images of dark skin tones. Our findings identify important weaknesses and biases in dermatology AI that need to be addressed to ensure reliable application to diverse patients and diseases.

PDF Abstract

Disparities in Dermatology AI Performance

The paper "Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set" thoroughly investigates the performance discrepancies of dermatology AI algorithms when evaluated on a diverse dataset, particularly focusing on differential accuracy across various skin tones and uncommon diseases. It highlights how previously established AI models exhibit a substantial performance degeneration when applied to the new Diverse Dermatology Images (DDI) dataset, which was meticulously curated to encompass a wide spectrum of skin tones and confirmed pathological lesions.

The DDI dataset comprises 656 images with varying skin tones classified under the Fitzpatrick skin type guide, specifically enabling analysis between FST V-VI (darker tones) and FST I-II (lighter tones). Upon evaluation, the paper finds that top-tier algorithms such as ModelDerm, DeepDerm, and HAM 10000, which had historically achieved high ROC-AUC scores ranging from 0.88 to 0.94, showcased marked declines when tested on the DDI dataset, with ROC-AUC scores dropping to between 0.56 and 0.67. Particularly, these models performed less effectively on images representing darker skin tones and rare diseases, with decreased sensitivity in detecting malignancies—ModelDerm dropping to a sensitivity of 0.12 and DeepDerm to 0.23 on dark skin tones, revealing a direct link between algorithmic training failures and data diversity.

Moreover, the paper extends to compare these AI algorithms with dermatologists using consensus labeling techniques, observing that clinicians also have reduced accuracy on visual labels compared to ground truth biopsy annotations, particularly for dark skin tones. This introduces implications regarding the reliability of visual consensus labeling as a method for dermatology dataset curation and highlights inherent difficulties in capturing diagnostic nuances in these populations.

The authors propose fine-tuning AI models with the DDI data as a remedial approach to bridge these performance gaps. By doing so, enhancements were noted across all skin tones, effectively narrowing the discrepancy and yielding performance comparable—or superior—to that of dermatologists on darker skin tones. For instance, post-fine-tuning, DeepDerm achieved ROC-AUC improvements to 0.74 for FST V-VI.

Theoretical and practical implications of this research are profound, advocating for the inclusion of diverse data in training sets to prevent exacerbating existing healthcare disparities. Underrepresentation of darker skin tones and uncommon conditions in AI training data risks perpetuating biases, thereby impeding equitable medical diagnostics. The authors' release of the DDI dataset serves as a progressive effort, facilitating advancements in the field while underscoring the importance of dataset transparency and diversity.

Future developments in AI should account for the disparities identified herein, with considerations for task adaptation and domain-specific tuning as clinical applications evolve. Model developers are encouraged to incorporate fairness-aware training methods and more diverse datasets to enhance algorithmic equity in dermatological care, aiming to alleviate racial and ethnic disparities prevalent in clinical practice.

Ultimately, the research achieves a crucial examination of the pitfalls within current dermatology AI models, prompting valuable discourse on modeling practices and paving the way for more inclusive and effective AI systems in healthcare.

PDF Markdown Bookmark Chat (Pro)

Authors (19)

Roxana Daneshjou (19 papers)
Kailas Vodrahalli (14 papers)
Roberto A Novoa (3 papers)
Melissa Jenkins (2 papers)
Weixin Liang (33 papers)
Veronica Rotemberg (6 papers)
Justin Ko (22 papers)
Susan M Swetter (2 papers)
Elizabeth E Bailey (2 papers)
Olivier Gevaert (22 papers)
Pritam Mukherjee (20 papers)
Michelle Phung (4 papers)
Kiana Yekrang (3 papers)
Bradley Fong (3 papers)
Rachna Sahasrabudhe (3 papers)
Johan A. C. Allerup (1 paper)
Utako Okata-Karigane (1 paper)
James Zou (232 papers)
Albert Chiou (3 papers)

Citations (182)

View on Semantic Scholar

Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set (2203.08807v1)

Disparities in Dermatology AI Performance

Related Papers