Fitzpatrick17k: Skin Diagnosis Dataset
- Fitzpatrick17k is an open-source clinical dermatology image benchmark with 16,577 photos across 114 skin conditions and Fitzpatrick skin types (IāVI).
- It enables studies on diagnostic accuracy and fairness by addressing demographic imbalances and evaluating deep learning model performance across diverse skin tones.
- Rigorous data curationāincluding duplicate removal, outlier filtering, and stratified splittingāenhances reliability and ensures leakage-free evaluations.
The Fitzpatrick17k dataset is an open-source, large-scale clinical dermatology image benchmark designed to facilitate the study of both diagnostic accuracy and demographic fairness in deep learningābased skin condition classification. It consists of 16,577 clinical photographs, each labeled with one of 114 skin disease classes and annotated by Fitzpatrick skin type (IāVI), supporting research into model performance across diverse skin tones. The dataset has been the basis for several influential fairness, benchmarking, and data quality studies in dermatological AI (Groh et al., 2021, Abhishek et al., 2024, Aayushman et al., 2024).
1. Dataset Construction and Annotation Protocols
Fitzpatrick17k was constructed from two major online clinical atlases:
- DermaAmin: 12,672 images.
- Atlas Dermatologico: 3,905 images. All images are āin the wildā clinical photos, not limited to dermoscopic or histopathologic settings, and include 114 distinct skin condition labels (each with at least 53 examples). Labels are further subsumed under three major clinical categories: malignant, benign, and non-neoplastic (Aayushman et al., 2024, Groh et al., 2021).
Fitzpatrick skin type (FST) annotations were provided per image. Each image was reviewed by 2ā5 non-expert annotators under a dynamic consensus protocol, with annotation weights determined using a gold-standard subset of 312 images labeled by a board-certified dermatologist. Exact-match accuracy of these annotators (vs. dermatologist ground truth) ranged from 25ā59% depending on FST, with off-by-one agreement between 71ā85%, indicating moderate reliability (see table below) (Groh et al., 2021).
| Fitzpatrick Type | Exact Accuracy | Off-by-One Accuracy |
|---|---|---|
| 1 | 49% | 79% |
| 2 | 38% | 84% |
| 3 | 25% | 71% |
| 4 | 26% | 71% |
| 5 | 34% | 85% |
| 6 | 59% | 83% |
Diagnosis labels were inherited from atlas metadata and not rigorously histopathology confirmed. In a subset (504/16,577), two board-certified dermatologists provided an independent review, with 69% clear diagnostic agreement and a 3.4% outright mislabel rate (Abhishek et al., 2024).
2. Skin Type Distribution and Representation Bias
The datasetās Fitzpatrick distribution is heavily imbalanced:
- Types IāII (lightest): ~7,760 images (~47%)
- Types IIIāIV (mid): ~6,089 images (~37%)
- Types VāVI (darkest): ~2,170 images (~13%)
- Unknown: 565 images (~3%)
A more granular breakdown (weighted averages) yields:
- Type I: 2,960 (~18%)
- Type II: 4,810 (~29%)
- Type III: 3,310 (~20%)
- Type IV: 2,780 (~17%)
- Type V: 1,530 (~9%)
- Type VI: 640 (~4%)
This pronounced skew results in about 3.6 times more images of Types IāII than VāVI, leading to critical evaluation of fairness and performance disparities for underrepresented groups (Groh et al., 2021, Aayushman et al., 2024).
3. Data Quality Issues and Curation Approaches
Systematic analysis uncovered several quality concerns impacting the reliability of Fitzpatrick17k benchmarks (Abhishek et al., 2024):
- Duplicate Images: Using deep feature clustering (fastdup, cleanvision), over 6,600 image pairs at cosine similarity ā„0.90 were flagged, with 1,400+ at ā„0.95 (98.4% confirmed true duplicates, Cohenās Īŗ=0.87). Duplicates arose from near-exact copies, crop/zoom variants, and other clinical redundancies.
- Label Inconsistencies: Duplicate clusters sometimes featured different diagnosis or skin type labels; 2,498 such diagnosis-discordant pairs for similarity ā„0.90, as well as over 4,000 pairs with skin type mismatch ā„1.
- Outlier Content: Outlier analysis based on nearest-neighbor cosine distances revealed non-skin images, radiographs, histopathology slides, and other irrelevant cases.
- Data Partition Leakage: The commonly used ātrainā and āvalidationā partitions shared all images, violating standard traināvalidationātest separation, causing models to āseeā their evaluation samples during training.
To address these, a rigorous cleaning protocol was adopted:
- Merge all duplicates (retain one per homogeneous cluster; remove entire cluster if label conflicts observed).
- Outlier removal using low skin-similarity in embedding space.
- Redefine partitions to yield āFitzpatrick17k-Cā (Clean):
- 11,394 images remain: 7,975 train (70%), 1,139 validation (10%), 2,280 test (20%), fully disjoint by image and patient and stratified by diagnosis.
4. Benchmarking Methodologies and Preprocessing Pipelines
Benchmarking with Fitzpatrick17k involves both classification and fairness evaluations under various deep learning regimes:
- Image preprocessing: Images are resized to 224Ć224Ć3, normalized by ImageNet mean and standard deviation. Data augmentations include random crop (scale 0.8ā1.0), ±15° rotations, horizontal and vertical flips (Aayushman et al., 2024, Groh et al., 2021).
- Label representations: Disease names and āeudermic skinā are embedded once using OpenAIās text-embedding-3-large model (Aayushman et al., 2024).
- Sampling strategies: Balanced random samplers weight samples ā 1/frequency; partitions ensure stratified holdouts by diagnosis, skin type, and source site.
- Modeling protocols: Standard architectures include ImageNet-pretrained VGG-16, with later works such as PatchAlign deploying transformer-based patch encodings and cross-domain text-image alignment.
5. Performance Disparities and Fairness Metrics
Substantial performance disparities have been documented:
- Baseline results: Standard deep neural nets (VGG-16) achieve overall top-1 accuracy of 20.2% (114-way; random holdout), with highest per-class accuracy on better-represented mid-tone types (e.g., Type V: 28.9%) and lowest on both lightest (Type I: 15.8%) and darkest (Type VI: 15.5%) skin types (Groh et al., 2021).
- Fairness-focused approaches: The PatchAlign framework introduces a Graph Optimal Transport (GOT) loss for cross-domain (image patch ā text label) alignment, and an MGOT extension that predicts patch-based masks to down-weight non-lesion regions. The learning objective combines cross-entropy, confusion loss (to purge FST signals), and GOT/MGOT alignment:
For representative metrics on Fitzpatrick17k (Aayushman et al., 2024):
| Split | Model | Accuracy | EOM (True Positive Parity) | DPM (Demographic Parity) |
|---|---|---|---|---|
| In-domain | FairDisCo | 85.1% | 68.1% | 48.3% |
| In-domain | PatchAlign | 88.6% | 74.8% | 55.5% |
| Out-domain | FairDisCo | 79.5ā71.5% | 64.6% (hardest) | ā |
| Out-domain | PatchAlign | 84.0ā77.6% | 75.1% (hardest) | ā |
PatchAlign outperformed FairDisCo by +3.5 percentage points (pp) in in-domain accuracy (+6.7pp EOM) and +6.2pp in out-domain setups, showing improved fairness and accuracy in the presence of FST imbalance.
6. Data-Driven Recommendations and Best Practices
Analyses across versions of Fitzpatrick17k and its cleaned derivative have emphasized several critical dataset usage recommendations (Abhishek et al., 2024):
- Use subject-level or lesion-level splits to avoid duplicate/near-duplicate leakage across partitions.
- Ensure a strictly held-out test set for all final evaluations; eliminate overlap with early stopping or validation samples.
- Employ deep embeddingābased methods (such as fastdup at ) for duplicate detection, combined with manual review.
- Leverage embedding-based outlier filtering to remove non-clinical and spurious images.
- Quantitative reports should include per-FST metrics (accuracy, macro-F1), not only overall outcomes, and must supplement these with fairness-oriented measures (e.g., Equality of Opportunity, Demographic Parity).
- Diagnosis and FST labels should be periodically (re-)verified by clinical experts, with tolerance protocols (e.g., off-by-one FST) to characterize annotator agreement.
This suggests that robust data curation and partitioning protocols are imperative for reproducible, fair, and clinically relevant dermatological image analysis benchmarks.
7. Impact, Limitations, and Ongoing Evolution
Fitzpatrick17k and its cleaned variant (āFitzpatrick17k-Cā) have become canonical for evaluating both mainstream diagnostic accuracy and subgroup robustness in dermatology AI (Groh et al., 2021, Abhishek et al., 2024, Aayushman et al., 2024). Key impacts and caveats include:
- Explicit demonstration of demographic fairness gaps in current deep learning models, especially for underrepresented skin types (VāVI).
- Rigorous documentation and replicable benchmarks enabling comparative studies and the quantification of real-world bias.
- Limitations persist due to diagnosis label noise, non-expert FST labeling, and persistent underrepresentation of the darkest skin tones, which can impair generalizability.
- The dataset is an evolving resource, with the Fitzpatrick17k-C release offering improved benchmarking validity by removing duplicates, correcting mislabels, and introducing stratified, leakage-free evaluation splits (Abhishek et al., 2024).
Fitzpatrick17k thus serves as both a foundational benchmark and a continual case study in best practices for equity, validity, and transparency in medical imaging AI.