Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset (2104.09957v1)

Published 20 Apr 2021 in cs.CV

Abstract: How does the accuracy of deep neural network models trained to classify clinical images of skin conditions vary across skin color? While recent studies demonstrate computer vision models can serve as a useful decision support tool in healthcare and provide dermatologist-level classification on a number of specific tasks, darker skin is underrepresented in the data. Most publicly available data sets do not include Fitzpatrick skin type labels. We annotate 16,577 clinical images sourced from two dermatology atlases with Fitzpatrick skin type labels and open-source these annotations. Based on these labels, we find that there are significantly more images of light skin types than dark skin types in this dataset. We train a deep neural network model to classify 114 skin conditions and find that the model is most accurate on skin types similar to those it was trained on. In addition, we evaluate how an algorithmic approach to identifying skin tones, individual typology angle, compares with Fitzpatrick skin type labels annotated by a team of human labelers.

Authors (8)

Matthew Groh (20 papers)
Caleb Harris (2 papers)
Luis Soenksen (1 paper)
Felix Lau (6 papers)
Rachel Han (3 papers)
Aerin Kim (13 papers)
Arash Koochek (4 papers)
Omar Badri (3 papers)

Citations (154)

View on Semantic Scholar

Summary

Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset

The paper "Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset" presents a critical exploration of biases in dermatological datasets and their impact on machine learning model performance. The research focuses on the evaluation of deep neural networks (DNNs) in dermatology, particularly in relation to varying skin tones, and is centered around the newly introduced Fitzpatrick 17k dataset. The dataset is a significant contribution to the field, containing 16,577 clinical images labeled with Fitzpatrick skin types, addressing the underrepresentation of darker skin tones in existing datasets.

Contributions and Findings

The key contributions of the paper include:

Fitzpatrick 17k Dataset: The dataset comprises annotated images with labels for 114 different skin conditions sourced from open-source dermatology databases. Furthermore, the dataset is enriched with skin type labels based on the Fitzpatrick scale, advancing the paper of algorithmic fairness in skin condition classification.
Analysis of Skin Type Distribution: The paper highlights the imbalance in skin type distribution, revealing a predominance of lighter skin types. Only 2,168 images are of the darkest skin types (Fitzpatrick types 5 and 6), compared to 7,755 images from the lightest skin types (Fitzpatrick types 1 and 2).
Model Training and Evaluation: Utilizing a model based on VGG-16 architecture, the paper evaluates the accuracy of classifying skin conditions across different Fitzpatrick skin types. An evident disparity is observed, with models performing better on skin types they were primarily trained on, underlining the importance of diverse training data for robust model performance.
Comparative Analysis of Skin Tone Annotation Methods: The paper contrasts human annotations of Fitzpatrick skin types with a computational approach using Individual Typology Angle (ITA). The paper finds that although ITA is a promising technique, it exhibits limitations in consistency and accuracy compared to human labeling, especially when evaluating images with the darkest and lightest skin tones.

Implications and Future Directions

The findings underscore the necessity for balanced and diverse training datasets to mitigate biases in machine learning applications in dermatology. The introduction of the Fitzpatrick 17k dataset is a step towards ensuring that DNNs in dermatological applications can be evaluated fairly across different populations, potentially reducing disparities in healthcare outcomes.

Future research should explore methods to address skin condition variable appearances across different skin types, which contribute to performance disparities. This requires an emphasis on developing datasets and annotations that truly reflect the diverse manifestations of skin conditions.

Furthermore, the paper suggests improvements for computational methods, such as ITA, which could help automate skin type annotation in large datasets and contribute to more refined models in the future.

In closing, this research highlights an essential dimension in the evaluation of AI systems in healthcare, advocating for the integration of fairness and representativity in dataset curation and model evaluation. This line of inquiry not only improves the academic rigor of AI research but is also crucial for ethical and equitable healthcare delivery.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - mattgroh/fitzpatrick17k (119 stars)