PatchAlign Framework for Dermatology Image Alignment
- The paper introduces PatchAlign Framework, which aligns image patches to improve fairness and diagnostic accuracy in clinical dermatology.
- It employs advanced patch-level consensus and embedding-based duplicate detection to minimize annotation noise and counteract data leakage.
- Experimental results reveal enhanced model performance, particularly reducing disparities in diagnostic outcomes for underrepresented skin types.
The Fitzpatrick17k dataset is a large-scale, annotated collection of clinical dermatology images built to support machine learning research in skin-condition diagnosis and fairness analyses across diverse skin tones. It addresses a critical gap in publicly available medical image data by providing per-image Fitzpatrick Skin Type (FST) labels and spanning a wide range of dermatological conditions, enabling rigorous investigation of algorithmic performance disparities and representation bias in clinical AI.
1. Dataset Composition and Annotation
Fitzpatrick17k consists of 16,577 color clinical photographs collected from two public web-based atlases: DermaAmin (12,672 images) and Atlas Dermatologico (3,905 images) (Groh et al., 2021, Abhishek et al., 2024). Each image is assigned two primary annotations: a skin-condition label (covering 114 distinct diagnostic categories such as "psoriasis," "melanoma," "eczema," etc.) and a Fitzpatrick skin type label (FST ā {1ā6}). The Fitzpatrick classification follows the canonical six-point dermatological scale (Type I: very fair, through Type VI: deeply pigmented) (Groh et al., 2021).
Skin-type annotation was performed via the Scale AI platform, leveraging a consensus workflow in which 2ā5 human raters labeled each image. Annotator votes were weighted by historical performance against a dermatologist-vetted gold standard. This process resulted in 72,277 total annotations, with consensus labels produced for 16,012 images and 565 instances (3.4%) marked "unknown" due to disagreement (Groh et al., 2021). Inter-annotator agreement rates showed exact match frequencies varying by skin type (e.g., 49% for Type I, 59% for Type VI), with "off by one" agreements reaching 71ā85% (Groh et al., 2021).
The skin-type distribution reveals pronounced class imbalance:
| Fitzpatrick Type | Image Count (approx.) | Coverage (%) |
|---|---|---|
| I (lightest) | 3,200 | ~19% |
| II | 5,200 | ~31% |
| III | 2,600 | ~16% |
| IV | 1,800 | ~11% |
| V | 1,500 | ~9% |
| VI (darkest) | 1,100 | ~7% |
| Missing ("ā1") | 1,200 | ~7% |
Types IVāVI (darker) are strongly underrepresented, with Types IāII comprising nearly half the non-missing-labeled images (López-PĆ©rez et al., 20 Jan 2025, Groh et al., 2021, Abhishek et al., 2024). Each image includes FST and condition labels in a publicly released CSV, with associated annotation metadata and scripts for reproducibility (Groh et al., 2021).
Per the original release and subsequent audits, label quality is limited: a gold-standard subset (n = 504) reviewed by dermatologists showed only 69% of labels unambiguously correct, and a 3.4% explicit error rate (Abhishek et al., 2024, Groh et al., 2021). Coverage across all conditions skews towards lighter types; for example, Type VI is represented for only 89 of 114 diagnostic categories (Groh et al., 2021).
2. Data Quality: Duplicates, Leakage, and Label Error
Independent assessments have identified critical data quality challenges in Fitzpatrick17k that directly impact benchmark validity (Abhishek et al., 2024):
- Duplicate Images:
Approximately 1,425 pairs (cosine similarity ā„ 0.95 in embedding space) were confirmed as near-duplicates, with an additional 6,622 pairs at a lower threshold (ā„ 0.90). Union-find clustering yielded 2,297 duplicate clusters, some up to 10 images, largely unfiltered in the original dataset.
- Label Inconsistencies:
Within these duplicate sets, 93 image pairs were found to have conflicting diagnosis labels, and hundreds more contained FST discrepancies of at least one type; a subset disagreed by >1 FST unit (Abhishek et al., 2024).
- Train-Test Leakage:
Random or stratified splits in the presence of duplicates can result in image pairs with the same subject/lesion being present in both training and test partitions. This feature-level leakage enables models to "memorize" test data, artificially inflating accuracy metrics (Abhishek et al., 2024).
- Partitioning Flaws:
The initial Fitzpatrick17k benchmarks did not provide a strictly disjoint test set; the same "validation" split was used for checkpoint selection and reporting, violating protocols for generalization assessment (Abhishek et al., 2024).
The cumulative effect of these issues is substantial. After duplicate and erroneous-image removal, overall classifier accuracy on 114-way classification in a held-out test partition drops from 22.25% (original, inflated) to 11.48% (cleaned) (Abhishek et al., 2024).
Recommendations include applying embedding-based duplicate detection and cluster-based deduplication pipelines, reviewing label consistency within clusters, and standardizing partitions (e.g., 70:10:20 train:val:test) stratified by diagnosis (Abhishek et al., 2024).
3. Benchmarking: Model Architectures and Evaluation Strategies
Fitzpatrick17k has been utilized as a benchmark for both discriminative and generative neural network models in clinical dermatology (Groh et al., 2021, López-Pérez et al., 20 Jan 2025):
- Discriminative Models:
A VGG-16 backbone (ImageNet-pretrained) with a modified classification head was used for 114-way diagnosis, with strong data augmentation and weighted sampling to address class imbalance (Groh et al., 2021). Default splits included stratified random holdouts, source-wise splits (per atlas), and skin-type-based splits, enabling assessment of model generalization to underrepresented groups.
- Generative Models:
For fairness analysis, a convolutional VAE with ResNet-style blocks was employed. The encoder and decoder utilized stacks of residual blocks with BatchNorm and ELU activations, generating an 8Ć8Ć64 latent code. Training optimized the standard VAE evidence lower bound (ELBO) loss plus a perceptual loss (derived from VGG19 features) for sharpened reconstructions. The full objective was , trained using Adam (βā=0.9, βā=0.999, lr=1Ć10ā»ā“, batch size 64, 15 epochs) (López-PĆ©rez et al., 20 Jan 2025).
Preprocessing included image resizing/cropping (commonly 128Ć128 for generative models or 224Ć224 for classification), normalization, and, in fairness studies, exclusion of ambiguous FSTs (typically FST 3ā4) to compare "light" (FST 1ā2) and "dark" (FST 5ā6) cohort outcomes (López-PĆ©rez et al., 20 Jan 2025).
4. Performance Disparities and Fairness Analysis
Model performance exhibits marked dependence on both the skin-type distribution in the training set and the FST of test images (Groh et al., 2021, López-Pérez et al., 20 Jan 2025):
- Discriminative Tasks:
Exact accuracy on the full 114-class task (stratified holdout): 20.2%, with per-type accuracy varying from 15.5% (Type VI) to 28.9% (Type V) (Groh et al., 2021). Median per-condition accuracy: 20.0%. When models are trained on Types IāII and tested on Types VāVI, top-1 accuracy drops to 9.0%, reflecting poor transfer to darker skin with sparse representation (Groh et al., 2021).
- Generative Tasks:
In VAE-based reconstruction, MSE (mean-squared error) is lowest for skin types overrepresented in training. For example, in "100% light" training, MSE_dark ā 0.06ā0.07 vs. MSE_light ā 0.02; balanced (50/50) training reduces but does not eliminate the performance gap (MSE_dark ā 0.05, MSE_light ā 0.03) (López-PĆ©rez et al., 20 Jan 2025). Even under balanced training, residual bias persists (Ī_MSE ā 0.02).
VAE uncertainty measures (average latent Ļ) do not reliably flag regions of fairness failure; ĻĢ(z) shows no systematic difference between light and dark cohorts (López-PĆ©rez et al., 20 Jan 2025).
- Visual Quality:
Reconstructions of darker-skinned lesions often appear blurrier or exhibit color shifting, particularly when such images are underrepresented in the training split (López-Pérez et al., 20 Jan 2025).
- Proxy Skin-Tone Labeling:
Alternative approaches using Individual Typology Angle (ITA) from color metrics achieve only moderate concordance with human-annotated FSTs (overall ±1 agreement: 60ā70%), with high intra-category variance limiting their utility in fairness-critical settings (Groh et al., 2021).
5. Limitations, Best Practices, and Recommendations
The Fitzpatrick17k dataset's structure and curation introduce several well-documented limitations:
- Severe Class Imbalance:
FST 4ā6 (darker skin) are systematically underrepresented, impacting both model training and condition coverage (e.g., only 89 of 114 conditions are represented for Type VI) (Groh et al., 2021, López-PĆ©rez et al., 20 Jan 2025).
- Annotation Ambiguities and Noise:
Annotator disagreement and evidence of label errors (unambiguously correct in only 69% of gold-standard reviews) necessitate future upgrades in labeling workflowsāsuch as increased expert reviews and leveraging clinical metadata (Abhishek et al., 2024, López-PĆ©rez et al., 20 Jan 2025).
- Data Leakage and Duplicates:
Random partitioning in the presence of duplicates and lack of test set isolation undermines the validity of prior published benchmarks. Stringent deduplication, outlier detection, and robust, cluster-level stratification are advised for future users (Abhishek et al., 2024).
Best practices and corrective protocols include:
- Deduplicate using learned-embedding similarity followed by union-find clustering and cluster-level curation (Abhishek et al., 2024).
- Exclude clusters showing label discordance (in diagnosis or FST).
- Reserve strictly held-out test splits not used for model selection or validation (Abhishek et al., 2024).
- Report all key metrics (accuracy, AUC, F1) stratified by FST and diagnosis (Abhishek et al., 2024).
- Expand representation of FST 4ā6 and supplement dataset with richer clinical context (López-PĆ©rez et al., 20 Jan 2025).
- Improve uncertainty quantification tools (e.g., hierarchical VAEs, ensembles) to capture subgroup failure modes (López-Pérez et al., 20 Jan 2025).
6. Summary of Impact and Future Directions
Fitzpatrick17k is the principal large-scale, richly annotated dermatological dataset publicly available with FST labeling. Analyses across several studies reveal that it enables quantitative evaluation of fairness in both discriminative and generative models but also reproduces and sometimes amplifies existing clinical representation biases (Groh et al., 2021, López-Pérez et al., 20 Jan 2025). Even after balanced sampling, generative models (e.g., VAE with perceptual loss) have increased error reconstructing darker skin, and their built-in uncertainties do not signal subpopulation risk (López-Pérez et al., 20 Jan 2025).
A plausible implication is that richer, more balanced datasets are needed for rigorous fairness assessment in medical imaging, and that improved annotation protocols, stratified evaluation, and sophisticated uncertainty estimation must become standard practice in the field. Ongoing dataset refinement, including deduplication, error correction, and balanced expansion, is crucial for producing trustworthy, generalizable dermatological AI benchmarks (Abhishek et al., 2024, López-Pérez et al., 20 Jan 2025, Groh et al., 2021).