SkinGenBench: Synthetic Skin Image Benchmark
- SkinGenBench is a comprehensive biomedical image synthesis benchmark that standardizes protocols, curated datasets, and metrics for synthetic dermoscopic image augmentation.
- It compares state-of-the-art generative models, including StyleGAN2-ADA and diffusion approaches, focusing on fine marker-level synthesis and diagnostic improvements.
- The benchmark integrates diverse datasets and robust preprocessing pipelines to advance melanoma diagnosis, skin-tone recognition, and overall clinical AI reliability.
SkinGenBench is a comprehensive biomedical image synthesis benchmark designed to rigorously evaluate generative models and preprocessing strategies for synthetic skin (dermoscopic) image augmentation and diagnosis. By establishing principled protocols, curated datasets, and standardized metrics for both realism and diagnostic utility—across melanoma subtyping, skin-tone diversity, and lesion annotation—SkinGenBench has become a cornerstone for reproducible and fair experimentation in medical image generation, data augmentation, and downstream clinical AI (Pritam et al., 19 Dec 2025, Lu, 13 Sep 2025).
1. Benchmark Scope and Dataset Foundations
SkinGenBench encompasses two primary biomedical image analysis domains: dermoscopic skin lesion synthesis for melanoma diagnosis and skin-tone recognition/generation for fairness benchmarking. The benchmark consolidates large, curated datasets:
- Lesion diagnosis data: Aggregates HAM10000 (10,015 images) and MILK10K (5,421 images) for a five-class stratification: Nevus (NV), Basal Cell Carcinoma (BCC), Benign Keratosis Like (BKL), Melanoma (MEL), and Squamous Cell Carcinoma (SCC), totaling 14,116 samples and quantifying severe melanoma underrepresentation (~11%) (Pritam et al., 19 Dec 2025).
- Skin-tone diversity: Integrates the TrueSkin dataset (7,299 images), which provides six visual-perception classes mapped to the Fitzpatrick skin type scale, ensuring coverage across the spectrum and balancing via synthetic augmentation (Lu, 13 Sep 2025).
The dataset design emphasizes realism, annotation quality, and demographically relevant diversity. Protocols for lighting, camera, demographic balancing, and annotator consensus ensure high-fidelity ground truth for both recognition and generation tasks.
2. Generative Model Architectures and Preprocessing Approaches
SkinGenBench systematically compares state-of-the-art generative models and preprocessing strategies:
- Generative Models:
- GANs: StyleGAN2-ADA with an 8-layer mapping network, 512-d latent space, progressive 4×4×256 synthesis, and Adaptive Discriminator Augmentation (ADA), achieves superior mode coverage and low overfitting (Pritam et al., 19 Dec 2025). Earlier conditional GANs (pix2pixHD) using semantic plus instance maps (lesion+marker maps, superpixels) enable fine-grained, marker-coherent skin lesion synthesis for clinical annotation tasks (Bissoto et al., 2019).
- Diffusion Models: Denoising Diffusion Probabilistic Models (DDPMs) with U-Net backbones, 1,000 cosine-scheduled timesteps, and AdamW optimization. Enhanced with domain-specific LoRA fine-tuning, Stable Diffusion 2.0–based methods enable one-pass, prompt-driven joint generation of images and binary masks (Xu, 26 Jul 2025).
- Preprocessing Pipelines:
- Pipeline A: Basic geometric augmentation (resize to 256×256, random flips/rotations).
- Pipeline B: Advanced artifact removal (Dullrazor for hair/ruler mark inpainting) followed by geometric augmentations (Pritam et al., 19 Dec 2025).
Empirical evidence indicates that model architecture choice exerts greater influence on synthesis fidelity and utility than preprocessing complexity. The conditional GAN with semantic/instance control (Bissoto et al., 2019) is distinguished for fine marker-level synthesis, while StyleGAN2-ADA outperforms both diffusion models and older GANs on global fidelity and downstream task uplift (Pritam et al., 19 Dec 2025).
3. Evaluation Metrics and Protocols
SkinGenBench employs a multidimensional set of quantitative metrics:
- Image Quality:
- Fréchet Inception Distance (FID): Measures feature distribution alignment; lower values (e.g., StyleGAN2-ADA FID ≈ 65.5) indicate closer realism.
- Kernel Inception Distance (KID): Polynomial-kernel-based feature discrepancy; KID ≈ 0.055 for best GANs.
- Inception Score (IS): Evaluates sample diversity and clarity; lower than natural image IS but consistent with medical data (Pritam et al., 19 Dec 2025).
- Feature Analysis: 2,048-d ResNet-50 activations are projected via t-SNE; Euclidean centroid distances quantify synthetic-real dispersion (e.g., GT–GN = 34.5–48.5) (Pritam et al., 19 Dec 2025).
- Downstream Diagnostic Utility:
- Macro-F1, accuracy, and ROC-AUC evaluated on five standard classifiers (EffNet-B0, ResNet18/50, VGG16, ViT-B/16). Synthetic augmentation yields 3–5 pp global F1 gains and 8–15% absolute improvement in melanoma F1/ROC-AUC (A1: 0.7401→A2: 0.8831 for ViT-B/16); all improvements are statistically significant () (Pritam et al., 19 Dec 2025).
- Segmentation and Mask Overlap (for models generating masks): Dice coefficient, IoU, LPIPS, and MS-SSIM (Xu, 26 Jul 2025).
- Skin-Tone Fairness and Semantic Control (from TrueSkin integration) (Lu, 13 Sep 2025):
- Overall/per-class accuracy, MSE between predicted and true tone index.
- Fairness: Statistical Parity Difference (SPD) and Equal Opportunity Difference (EOD).
- Generative alignment: Generation Skin-tone Accuracy (GSA), FID, CLIP-Score for text/image matching.
- Contextual bias assessment: Attribute-conditional disparity (), bias amplification scores (BAS).
All protocols are detailed in open-source repositories for maximal reproducibility.
4. Quantitative Results and Comparative Analysis
SkinGenBench findings decisively map the landscape of synthetic image augmentation:
| Model / Pipeline | FID | KID | IS | Melanoma F1 (ViT-B/16) | Macro-F1 (EffNet-B0) |
|---|---|---|---|---|---|
| StyleGAN2-ADA, Adv. | 65.5 | 0.055 | 2.77 | 0.8831 | 0.7977 |
| DDPM, Adv. | 90.2 | 0.077 | 2.45 | 0.8678 | 0.7504 |
| GAN, Basic | 79.4 | 0.066 | 3.22 | — | — |
| DDPM, Basic | 83.0 | 0.068 | 2.50 | — | — |
- StyleGAN2-ADA demonstrates the tightest clustering to real data and highest diagnostic uplift across pipelines.
- Advanced artifact removal delivers only marginal metric gains (e.g., FID reduction ~15 for GANs), with possible attenuation of clinically relevant textures.
- For prompt-driven diffusion synthesis (SkinDualGen), FID remains below 100, and hybrid training (50% synthetic + 50% real) achieves +8–15 pp uplift in classification/segmentation performance across architectures (Xu, 26 Jul 2025).
- GAN-based marker-controlled synthesis further augments classifier AUC up to 84.7% when combined with real and PGAN data (Bissoto et al., 2019).
- In skin-tone tasks, TrueSkin-trained recognition models reach 74.18% accuracy (vs. ~40–50% for LMMs); SDXL diffusion generation models fine-tuned on TrueSkin data raise skin-tone accuracy from 61.1% to 64.8% (Lu, 13 Sep 2025).
5. Tasks, Protocols, and Best Practices
SkinGenBench operationalizes benchmarking through explicit task delineations:
- Recognition: Six-way skin-tone classification (EffNet-B1), with per-class and cross-dataset generalization, evaluated on accuracy, MSE, SPD, and EOD (Lu, 13 Sep 2025).
- Generation:
- Skin-tone-conditioned image synthesis (e.g., SDXL diffusion, prompt injection via JoyCaption).
- Lesion/mask joint synthesis (e.g., SkinDualGen’s four-channel output for simultaneous image-mask sampling).
- Marker-aware lesion synthesis from semantic and instance maps (pix2pixHD pipeline) (Bissoto et al., 2019).
- Downstream clinical impact: Assessments of classifier and segmenter performance given training under pure, synthetic, or hybrid regimes.
Protocols urge the use of standardized prompt templates and random seed reporting, explicit code and weight release, multi-annotator consensus, and the adoption of weighted or ordinal losses where task structure warrants.
6. Limitations, Bias, and Future Directions
SkinGenBench studies identify multiple axes for further advancement:
- Model bias: Generative models show contextual spurions (e.g., hairstyle, environment) influencing skin-tone synthesis, with attribute-conditional bias quantifying the effects (Lu, 13 Sep 2025). Recognition models collapse intermediate tones; diffusion and GANs struggle with rare lesion types and boundary fidelity.
- Dataset limitations: Predominantly institution-specific training data, potential deficiencies in global generalization, and reliance on expensive manual marker annotation (Bissoto et al., 2019).
- Preprocessing: Artifact removal can suppress subtle but diagnostically important features; suggested use is conservative.
- Fairness and causal generalization: Open research includes disentangling lighting from intrinsic tone, causal modeling of domain shifts, and benchmarking under extreme acquisition conditions (Lu, 13 Sep 2025, Pritam et al., 19 Dec 2025).
- Future benchmarks: Expansion to 3D modalities (CT/MRI), multiclass masks, boundary-aware loss regularization, privacy-preserving synthesis, and the systematization of automated, self-supervised annotation mechanisms are proposed.
7. Integration and Reproducibility
All key component methods and datasets are open-source:
- SkinGenBench framework and pretrained checkpoints: https://github.com/adarsh-crafts/SkinGenBench (Pritam et al., 19 Dec 2025).
- SkinDualGen code and LoRA weights: https://github.com/JaspinXu/SkinDualGen (Xu, 26 Jul 2025).
- TrueSkin dataset and code: as per (Lu, 13 Sep 2025).
- GAN-skin-lesion (pix2pixHD benchmark pipeline): https://github.com/alceubissoto/gan-skin-lesion (Bissoto et al., 2019).
The benchmark establishes a vigorous ecosystem for standardized assessment, with recommended “real + synthetic ratio” sweeps, rigorous evaluation protocols, and clear best-practice guidelines for medical-image generative modeling and diagnostic AI.