Papers
Topics
Authors
Recent
2000 character limit reached

SkinGenBench: Synthetic Skin Image Benchmark

Updated 26 December 2025
  • SkinGenBench is a comprehensive biomedical image synthesis benchmark that standardizes protocols, curated datasets, and metrics for synthetic dermoscopic image augmentation.
  • It compares state-of-the-art generative models, including StyleGAN2-ADA and diffusion approaches, focusing on fine marker-level synthesis and diagnostic improvements.
  • The benchmark integrates diverse datasets and robust preprocessing pipelines to advance melanoma diagnosis, skin-tone recognition, and overall clinical AI reliability.

SkinGenBench is a comprehensive biomedical image synthesis benchmark designed to rigorously evaluate generative models and preprocessing strategies for synthetic skin (dermoscopic) image augmentation and diagnosis. By establishing principled protocols, curated datasets, and standardized metrics for both realism and diagnostic utility—across melanoma subtyping, skin-tone diversity, and lesion annotation—SkinGenBench has become a cornerstone for reproducible and fair experimentation in medical image generation, data augmentation, and downstream clinical AI (Pritam et al., 19 Dec 2025, Lu, 13 Sep 2025).

1. Benchmark Scope and Dataset Foundations

SkinGenBench encompasses two primary biomedical image analysis domains: dermoscopic skin lesion synthesis for melanoma diagnosis and skin-tone recognition/generation for fairness benchmarking. The benchmark consolidates large, curated datasets:

  • Lesion diagnosis data: Aggregates HAM10000 (10,015 images) and MILK10K (5,421 images) for a five-class stratification: Nevus (NV), Basal Cell Carcinoma (BCC), Benign Keratosis Like (BKL), Melanoma (MEL), and Squamous Cell Carcinoma (SCC), totaling 14,116 samples and quantifying severe melanoma underrepresentation (~11%) (Pritam et al., 19 Dec 2025).
  • Skin-tone diversity: Integrates the TrueSkin dataset (7,299 images), which provides six visual-perception classes mapped to the Fitzpatrick skin type scale, ensuring coverage across the spectrum and balancing via synthetic augmentation (Lu, 13 Sep 2025).

The dataset design emphasizes realism, annotation quality, and demographically relevant diversity. Protocols for lighting, camera, demographic balancing, and annotator consensus ensure high-fidelity ground truth for both recognition and generation tasks.

2. Generative Model Architectures and Preprocessing Approaches

SkinGenBench systematically compares state-of-the-art generative models and preprocessing strategies:

  • Generative Models:
  • Preprocessing Pipelines:
    • Pipeline A: Basic geometric augmentation (resize to 256×256, random flips/rotations).
    • Pipeline B: Advanced artifact removal (Dullrazor for hair/ruler mark inpainting) followed by geometric augmentations (Pritam et al., 19 Dec 2025).

Empirical evidence indicates that model architecture choice exerts greater influence on synthesis fidelity and utility than preprocessing complexity. The conditional GAN with semantic/instance control (Bissoto et al., 2019) is distinguished for fine marker-level synthesis, while StyleGAN2-ADA outperforms both diffusion models and older GANs on global fidelity and downstream task uplift (Pritam et al., 19 Dec 2025).

3. Evaluation Metrics and Protocols

SkinGenBench employs a multidimensional set of quantitative metrics:

  • Image Quality:
    • Fréchet Inception Distance (FID): Measures feature distribution alignment; lower values (e.g., StyleGAN2-ADA FID ≈ 65.5) indicate closer realism.
    • Kernel Inception Distance (KID): Polynomial-kernel-based feature discrepancy; KID ≈ 0.055 for best GANs.
    • Inception Score (IS): Evaluates sample diversity and clarity; lower than natural image IS but consistent with medical data (Pritam et al., 19 Dec 2025).
  • Feature Analysis: 2,048-d ResNet-50 activations are projected via t-SNE; Euclidean centroid distances quantify synthetic-real dispersion (e.g., GT–GN = 34.5–48.5) (Pritam et al., 19 Dec 2025).
  • Downstream Diagnostic Utility:
    • Macro-F1, accuracy, and ROC-AUC evaluated on five standard classifiers (EffNet-B0, ResNet18/50, VGG16, ViT-B/16). Synthetic augmentation yields 3–5 pp global F1 gains and 8–15% absolute improvement in melanoma F1/ROC-AUC (A1: 0.7401→A2: 0.8831 for ViT-B/16); all improvements are statistically significant (p<0.01p<0.01) (Pritam et al., 19 Dec 2025).
  • Segmentation and Mask Overlap (for models generating masks): Dice coefficient, IoU, LPIPS, and MS-SSIM (Xu, 26 Jul 2025).
  • Skin-Tone Fairness and Semantic Control (from TrueSkin integration) (Lu, 13 Sep 2025):
    • Overall/per-class accuracy, MSE between predicted and true tone index.
    • Fairness: Statistical Parity Difference (SPD) and Equal Opportunity Difference (EOD).
    • Generative alignment: Generation Skin-tone Accuracy (GSA), FID, CLIP-Score for text/image matching.
    • Contextual bias assessment: Attribute-conditional disparity (Δ(a,k)\Delta(a, k)), bias amplification scores (BAS).

All protocols are detailed in open-source repositories for maximal reproducibility.

4. Quantitative Results and Comparative Analysis

SkinGenBench findings decisively map the landscape of synthetic image augmentation:

Model / Pipeline FID KID IS Melanoma F1 (ViT-B/16) Macro-F1 (EffNet-B0)
StyleGAN2-ADA, Adv. 65.5 0.055 2.77 0.8831 0.7977
DDPM, Adv. 90.2 0.077 2.45 0.8678 0.7504
GAN, Basic 79.4 0.066 3.22
DDPM, Basic 83.0 0.068 2.50
  • StyleGAN2-ADA demonstrates the tightest clustering to real data and highest diagnostic uplift across pipelines.
  • Advanced artifact removal delivers only marginal metric gains (e.g., FID reduction ~15 for GANs), with possible attenuation of clinically relevant textures.
  • For prompt-driven diffusion synthesis (SkinDualGen), FID remains below 100, and hybrid training (50% synthetic + 50% real) achieves +8–15 pp uplift in classification/segmentation performance across architectures (Xu, 26 Jul 2025).
  • GAN-based marker-controlled synthesis further augments classifier AUC up to 84.7% when combined with real and PGAN data (Bissoto et al., 2019).
  • In skin-tone tasks, TrueSkin-trained recognition models reach 74.18% accuracy (vs. ~40–50% for LMMs); SDXL diffusion generation models fine-tuned on TrueSkin data raise skin-tone accuracy from 61.1% to 64.8% (Lu, 13 Sep 2025).

5. Tasks, Protocols, and Best Practices

SkinGenBench operationalizes benchmarking through explicit task delineations:

  • Recognition: Six-way skin-tone classification (EffNet-B1), with per-class and cross-dataset generalization, evaluated on accuracy, MSE, SPD, and EOD (Lu, 13 Sep 2025).
  • Generation:
    • Skin-tone-conditioned image synthesis (e.g., SDXL diffusion, prompt injection via JoyCaption).
    • Lesion/mask joint synthesis (e.g., SkinDualGen’s four-channel output for simultaneous image-mask sampling).
    • Marker-aware lesion synthesis from semantic and instance maps (pix2pixHD pipeline) (Bissoto et al., 2019).
  • Downstream clinical impact: Assessments of classifier and segmenter performance given training under pure, synthetic, or hybrid regimes.

Protocols urge the use of standardized prompt templates and random seed reporting, explicit code and weight release, multi-annotator consensus, and the adoption of weighted or ordinal losses where task structure warrants.

6. Limitations, Bias, and Future Directions

SkinGenBench studies identify multiple axes for further advancement:

  • Model bias: Generative models show contextual spurions (e.g., hairstyle, environment) influencing skin-tone synthesis, with attribute-conditional bias Δ(a,k)\Delta(a, k) quantifying the effects (Lu, 13 Sep 2025). Recognition models collapse intermediate tones; diffusion and GANs struggle with rare lesion types and boundary fidelity.
  • Dataset limitations: Predominantly institution-specific training data, potential deficiencies in global generalization, and reliance on expensive manual marker annotation (Bissoto et al., 2019).
  • Preprocessing: Artifact removal can suppress subtle but diagnostically important features; suggested use is conservative.
  • Fairness and causal generalization: Open research includes disentangling lighting from intrinsic tone, causal modeling of domain shifts, and benchmarking under extreme acquisition conditions (Lu, 13 Sep 2025, Pritam et al., 19 Dec 2025).
  • Future benchmarks: Expansion to 3D modalities (CT/MRI), multiclass masks, boundary-aware loss regularization, privacy-preserving synthesis, and the systematization of automated, self-supervised annotation mechanisms are proposed.

7. Integration and Reproducibility

All key component methods and datasets are open-source:

The benchmark establishes a vigorous ecosystem for standardized assessment, with recommended “real + synthetic ratio” sweeps, rigorous evaluation protocols, and clear best-practice guidelines for medical-image generative modeling and diagnostic AI.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SkinGenBench.