Papers
Topics
Authors
Recent
2000 character limit reached

SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis (2512.17585v1)

Published 19 Dec 2025 in eess.IV, cs.CV, and cs.LG

Abstract: This work introduces SkinGenBench, a systematic biomedical imaging benchmark that investigates how preprocessing complexity interacts with generative model choice for synthetic dermoscopic image augmentation and downstream melanoma diagnosis. Using a curated dataset of 14,116 dermoscopic images from HAM10000 and MILK10K across five lesion classes, we evaluate the two representative generative paradigms: StyleGAN2-ADA and Denoising Diffusion Probabilistic Models (DDPMs) under basic geometric augmentation and advanced artifact removal pipelines. Synthetic melanoma images are assessed using established perceptual and distributional metrics (FID, KID, IS), feature space analysis, and their impact on diagnostic performance across five downstream classifiers. Experimental results demonstrate that generative architecture choice has a stronger influence on both image fidelity and diagnostic utility than preprocessing complexity. StyleGAN2-ADA consistently produced synthetic images more closely aligned with real data distributions, achieving the lowest FID (~65.5) and KID (~0.05), while diffusion models generated higher variance samples at the cost of reduces perceptual fidelity and class anchoring. Advanced artifact removal yielded only marginal improvements in generative metrics and provided limited downstream diagnostic gains, suggesting possible suppression of clinically relevant texture cues. In contrast, synthetic data augmentation substantially improved melanoma detection with 8-15% absolute gains in melanoma F1-score, and ViT-B/16 achieving F1~0.88 and ROC-AUC~0.98, representing an improvement of approximately 14% over non-augmented baselines. Our code can be found at https://github.com/adarsh-crafts/SkinGenBench

Summary

  • The paper presents SkinGenBench as a benchmark assessing how preprocessing pipelines and generative models affect synthetic dermoscopic image fidelity and diagnostic performance.
  • The study compares StyleGAN2-ADA and DDPMs, showing that GAN-generated images improve class coherence and boost melanoma detection metrics by up to 15% F1-score.
  • The analysis indicates that while advanced preprocessing marginally refines image quality, preserving clinical texture cues is crucial for reliable diagnostics.

SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis

Introduction

The study titled "SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis" (2512.17585) introduces SkinGenBench, a comprehensive benchmark designed to evaluate the interaction between preprocessing complexity and generative model choice in the context of synthetic dermoscopic image augmentation and melanoma diagnosis. Melanoma remains a critical public health challenge, with early detection significantly improving survival rates. The paper leverages a dataset comprised of 14,116 dermoscopic images from HAM10000 and MILK10K, encompassing five types of skin lesions: Nevus (NV), Basal Cell Carcinoma (BCC), Benign Keratosis Like (BKL), Melanoma (MEL), and Squamous Cell Carcinoma (SCC) (Figure 1).

The study evaluates two prominent generative paradigms, StyleGAN2-ADA and Denoising Diffusion Probabilistic Models (DDPMs), under varying preprocessing conditions. The focus of this research is to understand how preprocessing and generative model choices influence image quality and downstream diagnostic performance. Figure 1

Figure 1: The types of skin lesions used in this study and their distribution in the curated dataset. Namely Nevus (NV, 52.60\%), Basal Cell Carcinoma (BCC, 21.43\%), Benign Keratosis Like (BKL, 11.60\%), Melanoma (MEL, 11.03\%), and Squamous Cell Carcinoma (SCC, 3.34\%).

Methodology

The study's experimental design is depicted in Figure 2. It begins with two preprocessing pipelines, basic and advanced, applied to the curated dataset. The basic pipeline involves geometric augmentations such as rotations and flips, while the advanced pipeline incorporates artifact removal techniques aimed at eliminating common artifacts like hair and ruler marks using the Dullrazor algorithm (Figure 3). The objective is to provide cleaner input data for training generative models. Figure 2

Figure 2: Overall Design of the experimental study.

Figure 3

Figure 3: Stages of the Dullrazor algorithm. Stages of the algorithm are: (a) Original image with artifacts. (b) Blackhat mask. (c) Binary mask. (d) Final image free of artifacts.

StyleGAN2-ADA uses a style-based architecture to generate high-fidelity images through adversarial training, while DDPMs are deployed to generate diverse samples by iteratively denoising a latent space representation. Each generative model was trained on datasets processed through both pipelines, and the synthetic images were used to augment existing melanoma datasets. These augmented datasets were then employed to train deep learning classifiers, including ResNet18, ResNet50, VGG16, ViT-B/16, and EfficientNet-B0, to evaluate the impact of synthetic data on melanoma diagnosis.

Results

The results demonstrated that the choice of generative architecture significantly impacts image fidelity and diagnostic utility, with StyleGAN2-ADA achieving superior results in terms of Fréchet Inception Distance (FID) and Kernel Inception Distance (KID), indicating a closer alignment with real data distributions. Conversely, DDPMs generated images with higher variance and diversity but at the cost of perceptual fidelity and class anchoring.

Evaluation of preprocessed and synthetic data via t-SNE embeddings further revealed distinct clustering patterns (Figure 4), confirming the superiority of GAN-generated images in maintaining class coherence. Advanced preprocessing only yielded marginal improvements, suggesting that aggressive artifact removal could suppress clinically relevant texture cues. Figure 4

Figure 4: t-SNE embeddings for the basic (left) and advanced (right) pipelines, showing the distributions of GT, GN, and DF samples.

The incorporation of synthetic data resulted in significant improvements in melanoma detection, with F1-scores increasing by 8–15% and ViT-B/16 achieving an F1-score of approximately 0.88 and ROC-AUC of approximately 0.98, representing a 14% improvement over baseline models without synthetic augmentation.

Discussion

The findings highlight that augmenting dermoscopic datasets with GAN-generated synthetic images substantially benefits classification models, particularly for underrepresented classes like melanoma. Transformer models, specifically ViT-B/16, exhibited robust performance improvements, cementing their role in contemporary diagnostic pipelines.

Further analysis of visual interpretability through Grad-CAM highlighted inherent differences in classifier focus, with ViT-B/16 exhibiting diffuse activation patterns potentially attributable to attention-based image interpretation. These results underscore the potential of GAN-based augmentation strategies in clinical dermatology, providing a balanced improvement across realism, diversity, and diagnostic performance.

Conclusion

The study effectively establishes SkinGenBench as a robust benchmark for assessing generative models and preprocessing methodologies in synthetic dermoscopic image augmentation. With implications extending to clinical melanoma detection, the research affirms the need for careful selection of preprocessing and generative strategies to maximize image fidelity and diagnostic accuracy. Future work may explore more advanced diffusion models and incorporate multi-institutional datasets to further generalize findings across diverse populations.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.