Fitzpatrick17k Dataset Overview
- Fitzpatrick17k dataset is a comprehensive collection of 16,577 clinical images labeled with both diagnosis and Fitzpatrick Skin Type, enabling fairness and diagnostic AI studies.
- The dataset employs detailed annotation protocols and embedding-based cleaning to address issues like duplicates and label inconsistencies.
- Model evaluations reveal significant performance disparities across skin types, underscoring the need for balanced clinical imaging benchmarks.
Fitzpatrick17k is a large-scale, publicly available dermatology image dataset annotated for both skin condition and Fitzpatrick Skin Type (FST), supporting research in diagnostic deep learning and fairness evaluations across skin tones. Developed to address critical gaps in representation for darker skin types in medical image datasets, Fitzpatrick17k consists of 16,577 clinical photographs sourced from two specialized web atlases. The dataset's annotation protocol and subsequent audits have facilitated its use in benchmarking the performance and bias of machine learning models in dermatological imaging, fostering significant advances and scrutiny in algorithmic fairness, data quality practices, and subgroup-aware clinical AI.
1. Dataset Composition and Annotation
Fitzpatrick17k comprises 16,577 color clinical photographs, each labeled with both a dermatological diagnosis (among 114 distinct categories) and an FST label (ranging from Type I to VI). The dataset is constructed from two sources: DermaAmin (12,672 images) and Atlas Dermatologico (3,905 images) (Groh et al., 2021). FST assignment follows a six-point scale, with per-image consensus labels generated by 2ā5 Scale AI annotators using a dynamic agreement protocol weighted by past accuracy against a 312-image dermatologist-defined gold standard.
Skin-type distribution is skewed, with ~47% of images labeled as light skin types (I+II), ~37% as medium (III+IV), and ~13% as dark (V+VI) (Groh et al., 2021). Every image also has an accompanying diagnosis label, and about 3% of images remain with unknown FST due to annotator disagreement. Inter-annotator agreement on FST is moderate; exact consensus with the gold standard is lowest for intermediate types and highest for Type VI. A dermatologist audit estimates that about 3.4% of diagnosis labels are erroneous, aligning with error rates in similar computer vision benchmarks.
Table: Skin Type Distribution in Fitzpatrick17k
| FST Type | Approx. Count |
|---|---|
| 1 (Lightest) | ~3,200 |
| 2 | ~5,200 |
| 3 | ~2,600 |
| 4 | ~1,800 |
| 5 | ~1,500 |
| 6 (Darkest) | ~1,100 |
| Unknown/ā1 | ~1,200 |
The dataset is provided with CSV annotation files, listing image IDs, diagnosis, FST consensus and per-annotator votes, and unknown flag status. Scripts are included for reproducibility of calculations and evaluations (Groh et al., 2021).
2. Identified Data Quality Issues
Systematic analysis revealed substantial quality concerns in the original release. Using the fastdup library, 6,622 image pairs (similarity ā„0.90) and 1,425 (ā„0.95) were flagged as duplicates, with manual inspection affirming over 98% as true duplicates (Abhishek et al., 2024). Some clusters contained up to 10 near-identical images, with duplicates indiscriminately distributed across candidate train/test partitions. This introduces feature-level data leakage, artificially inflating model performance by allowing memorization of near-identical examples (Abhishek et al., 2024).
Label inconsistency is present in both diagnosis and skin type: dozens of high-similarity duplicate pairs contain conflicting diagnosis or FST annotations, including some with more than a single-step discrepancy in FST. Label errors are further confirmed by dermatologist audits and label clustering analysis.
Additionally, the absence of a rigorously segregated held-out test partition in early benchmarks enabled overoptimistic model assessment. Initial splits repurposed a "validation" set for both model selection and reporting, a protocol thatācoupled with data leakageāundermines reproducibility and generalization estimates (Abhishek et al., 2024).
3. Data Cleaning, Partitioning, and Best Practices
To correct these issues, a cleaning protocol combining embedding-based duplicate detection, clustering, and manual review was implemented. Union-find clustering on high-similarity pairs was used to collapse duplicate clusters, retaining only the highest-resolution member when all labels were consistent, or discarding the entire cluster when label conflict was detected. Erroneous images were further removed based on similarity outlier scores relative to their nearest neighbors (Abhishek et al., 2024).
Post-cleaning, the revised dataset (Fitzpatrick17kāC) contains 11,394 images. The recommended split stratifies by diagnosis into 70% train, 10% validation, and 20% test, ensuring robust model assessment. Best practices established for future use include patient- or image-cluster-level stratification, exclusive test-set holdout for final evaluation, audit of label consistency by examining high-similarity pairs, and detailed per-FST and per-diagnosis reporting to expose any latent or explicit subgroup biases (Abhishek et al., 2024).
4. Model Training Protocols and Evaluation Benchmarks
The Fitzpatrick17k dataset supports both discriminative and generative modeling. For classification, transfer learning pipelines utilize a VGG-16 backbone (pre-trained on ImageNet) with a custom FC classifier for 114-way or aggregated taxonomy. Inputs undergo resizing, color/jitter augmentation, and normalization, with a weighted random sampler addressing disease label imbalance. Model evaluation employs stratified holdouts by skin type, source, and diagnosis, as well as controlled FST training/testing experiments. Top-1 accuracy on the full 114-way random holdout is approximately 20.2% (random baseline 0.9%), increasing to 62.4% for 3-class tasks (Groh et al., 2021).
Generative fairness studies implement a variational autoencoder with residual blocks and perceptual loss (using VGG19), optimized using Adam with a batch size of 64 and trained for 15 epochs (López-Pérez et al., 20 Jan 2025). Training sets are composed using various light/dark FST ratios for controlled subgroup fairness assessment.
Fairness and subgroup accuracy evaluations consistently reveal substantial performance disparities aligned with skin type representation. For instance, when training exclusively on light (FST 1ā2) data, MSE on dark skin (FST 5ā6) is 0.04ā0.05 higher than on light, while a balanced split reduces but does not eliminate this gap (López-PĆ©rez et al., 20 Jan 2025). Classification models generalize best to skin types seen during training; models trained only on FST VāVI exhibit severely degraded accuracy when tested on lighter skin types and vice versa (Groh et al., 2021).
5. Measurement of Fairness and Uncertainty
Fairness metrics primarily involve metrics such as the mean squared error gap (Ī_MSE = MSE_dark ā MSE_light) in generative settings and per-skin-type or per-diagnosis classification accuracy in discriminative tasks. No formal statistical parity tests (e.g., KolmogorovāSmirnov) are reported. Visual evaluation corroborates that generative models trained on predominantly light skin data yield reconstructions of dark skin lesions with reduced sharpness and tonal fidelity, even when representation is increased.
Algorithmic alternatives to human labeling of FST, such as the Individual Typology Angle (ITA), exhibit high variability within each Fitzpatrick category and lower concordance with human annotations (53ā70% within ±1 class), suggesting limited utility as a direct fairness proxy (Groh et al., 2021).
Conventional uncertainty quantification, such as the mean Ļ of VAE latent variables, is ineffective at signaling subgroup failure modes; average confidence does not correlate with accuracy differentials between light and dark skin (López-PĆ©rez et al., 20 Jan 2025). This motivates the exploration of more advanced uncertainty estimation and out-of-distribution detection mechanisms in future work.
6. Dataset Limitations and Implications for Fairness in Clinical AI
The principal limitations of Fitzpatrick17k stem from persistent class imbalance and annotation challenges. Darker skin tones (FST 4ā6) remain severely underrepresented (~20% of images) and are covered in fewer skin conditions, mirroring real-world data acquisition disparities (Groh et al., 2021). Ambiguity exists at the FST 3ā4 boundary, often requiring manual exclusion to avoid confounded group definitions in fairness studies. Consensus-based annotation can also introduce label noise, with diagnostic and skin-type mislabeling present even in high-similarity duplicate pairs (Abhishek et al., 2024).
Performance benchmarks and fairness analyses reveal that model performance, particularly for both generative and discriminative models, is strongly correlated with representation frequency. Notably, even after balancing for skin type in model training, residual bias supporting lighter skin types persists, indicating that representation alone is insufficient to eliminate subgroup disparities (López-Pérez et al., 20 Jan 2025).
Key recommendations include the enrichment of the dataset with more images from underrepresented FSTs, improved and clinically validated annotation protocols, and the development of uncertainty quantification techniques that are sensitive to subgroup-specific failure. The persistent gap in generative and classification performance for darker skin images underscores the ethical and clinical urgency for larger, more balanced data sources and fairness-aware algorithmic development.
7. Data Release, Access, and Ongoing Community Impact
Fitzpatrick17k annotations, containing complete label and metadata information, are publicly available at https://github.com/mattgroh/fitzpatrick17k (Groh et al., 2021). Accompanying documentation and scripting resources support reproducible experiments and foster continued community engagement in data auditing and benchmarking. Subsequent efforts, such as the Fitzpatrick17kāC cleaned dataset, advance the baseline for rigorous dermatology AI evaluation by eliminating duplicates, erroneous annotations, and data leakage, and by advocating for robust model assessment methodologies (Abhishek et al., 2024).
The dataset has become an established reference for investigating algorithmic fairness, model generalizability, and representation bias in clinical image analysis. Its documented shortcomings catalyze ongoing improvements in medical dataset construction, labeling, and evaluation, with implications extending to the broader domain of high-stakes healthcare AI.