DALL·E 2 Inpainting: Mechanisms & Applications
- DALL·E 2 Inpainting is a generative AI method that fills selected image regions with contextually consistent content using diffusion models.
- It employs an iterative refinement process that leverages both global context and local detail to maintain visual coherence.
- The technique enhances creative workflows by allowing precise image editing, error correction, and imaginative content expansion.
The Fitzpatrick17k Dataset is a large-scale, open-source benchmark for analyzing performance and fairness properties of deep learning algorithms in clinical skin-condition classification across diverse skin tones. It is specifically designed to address disparities in algorithmic dermatology and to foster equitable computer vision models for skin disease diagnosis.
1. Dataset Composition and Annotation Protocol
Fitzpatrick17k consists of 16,577 clinical photographs representing 114 distinct skin-condition classes. These conditions span three broad categories: malignant, benign, and non-neoplastic diseases. Image sources include DermaAmin (12,672 images) and Atlas Dermatologico (3,905 images), with all photographs taken "in the wild" rather than under controlled dermoscopic or histopathologic conditions (Groh et al., 2021, Abhishek et al., 2024).
Skin tone labeling follows the Fitzpatrick I–VI scale, a categorical representation of skin types widely used in dermatology. Non-expert annotators assigned skin-type labels to each image; a dynamic consensus aggregation was employed, enhanced by a gold-standard reference set of 312 images labeled by board-certified dermatologists. In total, 72,277 individual skin-type assignments yielded consensus at the image level, with exact-match accuracy against dermatologists ranging from 25% (Type III/IV) to 59% (Type VI), and “off-by-one” concordance rates between 71% and 85%. Diagnosis annotations are inherited from atlas metadata; a sub-study found that 69.0% of dermatologist-verified samples were "clearly diagnostic," while 3.4% were confirmed as mislabeled (Groh et al., 2021, Abhishek et al., 2024).
Table: Fitzpatrick17k Skin-Type Breakdown
| Fitzpatrick Type | Approximate Image Count | Percentage of Dataset |
|---|---|---|
| I (very fair) | ~3,200 | ~17.8–19.3% |
| II (fair) | ~2,900–4,800 | ~17.5–29.0% |
| III (medium) | ~3,000–3,300 | ~18.1–19.9% |
| IV (olive) | ~2,780–3,300 | ~16.8–19.9% |
| V (brown) | ~1,530–2,600 | ~9.2–15.7% |
| VI (dark) | ~640–2,200 | ~3.8–13.3% |
Reported counts and percentages vary slightly between sources due to annotation and stratification protocols.
2. Data Quality: Duplicates, Leakage, and Label Issues
Subsequent analysis uncovered significant data quality concerns (Abhishek et al., 2024). Deep-feature cosine similarity (e.g., 960-dimensional fastdup embeddings) revealed extensive duplication: at similarity thresholds , 6,622 candidate pairs were flagged; at , 1,425 pairs, of which 98.4% were true duplicates (Cohen’s ). Clustering revealed 2,297 duplicate groups, some as large as 10 images.
Diagnosis and skin-type label inconsistencies emerged in duplicate clusters: at , 2,498 diagnosis-mismatched pairs and 4,030 skin-type disagreement pairs (). Furthermore, images unrelated to skin (animals, plants, radiographs) were detected using outlier embedding scores.
Partitioning flaws were noted: the original split used identical “validation” and “test” partitions, violating standard evaluation protocols and inflating performance.
3. Cleaning and Standardization: Fitzpatrick17k-C
A rigorous data cleaning pipeline produced Fitzpatrick17k-C, a revised benchmark. All duplicates were merged via transitive closure; only highest-resolution samples with consistent diagnosis and FST labels were retained per cluster. Clusters with conflicting annotations were removed; outliers were excluded by embedding-based heuristics (Abhishek et al., 2024).
Table: Fitzpatrick17k-C Partition Statistics
| Partition | Image Count | Percentage | Characteristics |
|---|---|---|---|
| Train | 7,975 | 70% | Disjoint at image/patient level |
| Validation | 1,139 | 10% | Stratified on diagnosis |
| Test | 2,280 | 20% | Stratified on diagnosis |
All splits are non-overlapping and designed to prevent information leakage.
4. Preprocessing and Algorithmic Benchmarks
Preprocessing employs random cropping (scale 0.8–1.0), random rotations (±15°), horizontal/vertical flipping, resizing to 224×224×3, and normalization via ImageNet statistics. Balanced sampling based on class frequency and targeted data augmentation mitigate class imbalance and under-representation of Fitzpatrick Types V and VI (Groh et al., 2021, Aayushman et al., 2024).
Text label representations for the 114 disease names plus "eudermic skin" are generated using OpenAI's text-embedding-3-large model. In PatchAlign experiments, alignment between image-patch features and text-label embeddings is performed via Graph Optimal Transport (GOT) regularization, extended by Masked GOT (MGOT) to reduce irrelevant patch noise (Aayushman et al., 2024).
PatchAlign’s loss functions integrate cross-entropy, a confusion loss for removing skin-type dependency, and GOT/MGOT alignment:
5. Performance, Fairness, and Skin-Tone Bias
Benchmarks reveal substantial performance disparities across skin types. VGG-16 trained on the original splits achieves overall top-1 accuracy of 20.2% (random holdout; 114 classes). Per-type accuracies range from 15.5% (Type VI) to 28.9% (Type V), confirming that models perform best on skin tones prevalent in the training set (Groh et al., 2021).
PatchAlign achieves superior results to FairDisCo and standard baselines:
- In-domain accuracy (80/20 split): PatchAlign 88.6%, FairDisCo 85.1% (Δ+3.5 pp).
- Out-domain accuracy (leave-two-skin-types-out): PatchAlign Δ+6.2% average boost.
- True-positive-rate parity (Equality of Opportunity): PatchAlign 74.8% vs. FairDisCo 68.1% (Δ+6.7 pp).
- Predictive Quality Disparity and Demographic Parity also show improvement.
Bias in representation—Types I–II comprise ~47% of images versus ~13% for Types V–VI—directly translates to prediction disparities. Even with “fair” algorithms, true-positive parity and accuracy lag for darker skin tones, emphasizing the need for improved sampling and unbiased modeling (Groh et al., 2021, Abhishek et al., 2024, Aayushman et al., 2024).
6. Towards Robust, Equitable Dermatology AI
The Fitzpatrick17k dataset exposes critical challenges in skin disease image analysis: severe class and skin-type imbalance, annotation noise, duplication, and test-leakage, each impacting the reliability and fairness of learned models. Corrective steps—cleaning duplicates, establishing non-overlapping validated splits, embedding-based outlier filtering, and standardizing fairness reporting (per-FST accuracy, F1, EOM, DPM)—now define best practice for medical image benchmarks (Abhishek et al., 2024).
Research using Fitzpatrick17k has motivated new algorithmic solutions, notably cross-domain alignment with rich clinical text embeddings (PatchAlign) and domain-adaptation techniques purging skin-type information, capable of boosting accuracy and fairness metrics beyond conventional convolutional networks.
7. Access, Use, and Recommendations
Fitzpatrick17k is publicly available at https://github.com/mattgroh/fitzpatrick17k. The cleaned Fitzpatrick17k-C splits and the PatchAlign implementation (https://github.com/aayushmanace/PatchAlign24) provide robust baselines for future reproducible work.
Recommended best practices include:
- Subject-level or lesion-level splitting to prevent train/test leakage.
- Systematic duplicate removal via feature embedding similarity.
- Annotation refinement via “off-by-one” tolerances and expert review.
- Reporting all results using standardized, split-defined partitions and fairness metrics.
This benchmark now serves as the referential foundation for advances in unbiased, clinically valid deep learning for dermatology, and continues to underpin both technical and equity-focused innovations in medical AI (Groh et al., 2021, Abhishek et al., 2024, Aayushman et al., 2024).