- The paper introduces a distribution-based pixel-level measurement pipeline to quantify skin pigment contrast in dermoscopic images.
- It evaluates state-of-the-art segmentation networks (UNet, DeepLabV3, DINOv2) on HAM10000 and ISIC2017 using metrics like IoU, Dice, and AUC.
- Findings reveal that intra-image pigment contrast, not global skin tone, robustly predicts segmentation errors, guiding improvements in fairness and robustness.
Quantitative Analysis of Skin Color Effects on Skin Lesion Segmentation
Introduction
The paper "Exploring the Impact of Skin Color on Skin Lesion Segmentation" (2603.29694) presents an empirical study on the relationship between skin color—including its quantitative gradations and local contrast—and the performance of contemporary skin lesion segmentation networks. Unlike previous work that discretizes skin tone into coarse categories, this research proposes a pixel-level, distribution-based measurement pipeline elucidating how continuous variations in skin pigmentation and, crucially, lesion-skin contrast, modulate the performance of UNet, DeepLabV3, and DINOv2 architectures within the widely used HAM10000 and ISIC2017 dermoscopic datasets.
Methods: From Pixel-wise Segmentation to Distribution-based Skin Tone Assessment
The study develops an analytic methodology that computes pixel-wise Individual Typology Angle (ITA), a CIELab-based colorimetric proxy, after semantic segmentation and artifact removal. This allows distinct ITA distributions to be constructed for the entire image, skin-only regions, and lesion-only regions. Pairwise Wasserstein distances between these empirical distributions then operationalize nuanced measurements of both absolute skin tone (relative to a fixed reference) and intra-image pigment contrast (e.g., skin-vs-lesion).
Figure 1: The pipeline for measuring and comparing the distributions of skin and lesion color via pixel-wise ITA extraction in the context of lesion segmentation.
This approach sharply contrasts with Fitzpatrick scalar grouping, which fails to preserve within-image variation and can obscure local pigment differences that impact boundary detectability during segmentation. Six distributional comparison patterns are defined, ranging from global skin-type proxies (Patterns 1–3) to intra-image contrast signals (Patterns 4–6). Segmentation performance is then assessed across nine metrics, including IoU, Dice, Sensitivity, Specificity, and AUC.
Experimental Design
Three segmentation networks are considered: UNet, DeepLabV3 (ResNet-50 backbone), and self-supervised DINOv2 (ViT-B14 backbone), each trained and validated on HAM10000 and ISIC2017, with robust artifact removal and standardized resizing. The models are evaluated on both overall accuracy and per-class breakdowns.
Figure 2: Per-class segmentation accuracy and error rates for HAM10000, illustrating class-dependent variability but consistent trends across architectures.
Skin and lesion color distributions are visualized and compared for each class, revealing narrow ranges for skin-only and lesion-only ITA, but substantially wider ranges for contrast-based patterns.
Figure 3: Distribution of measured skin and lesion color (as ITA or contrast distance) by disease class, highlighting the increased discriminative power from contrast-based patterns.
Results
All three segmentation architectures achieve high accuracy on both datasets (IoU > 0.88 on HAM, IoU > 0.73 on ISIC; mean Pixel Accuracy above 0.93 on HAM). DINOv2 achieves slightly higher scores, but error profiles are aligned across architectures. Notably, performance discrepancies across disease classes do not correlate simply with class frequency, highlighting the influence of intrinsic lesion characteristics over sample count.
Skin Tone Distributions and Limitations
HAM10000 and ISIC2017, when analyzed via Fitzpatrick or global mean ITA, show a dominance of lighter skin types (Fitzpatrick I–II), restricting statistical power for assessing effects in darker skin tones. However, contrast-sensitive WD patterns exhibit distributional breadths not apparent from scalar skin color measures, motivating their use for auditing performance variation.
Crucially, global skin color metrics (mean ITA, Fitzpatrick) exhibit negligible correlation with segmentation performance (Spearman ∣ρ∣<0.2). In contrast, intra-image pigment contrast (as in lesion-vs-skin WD; Patterns 4–6) demonstrates statistically significant and robust correlation with multiple segmentation metrics across all tested architectures.
Figure 4: Range of Spearman correlations between segmentation metrics and various skin color distance measures, showing strong associations only for contrast-based (e.g., Pattern 6) signals.
For instance, reduced lesion-skin pigment contrast predicts increased error rates (FNR, FPR), reduced accuracy, and lower AUC across both datasets, with correlation coefficients up to ∣ρ∣≈0.66 for critical metrics (Pattern 6; Table CI). Notably, this effect remains strong in malignant subgroups (MEL, BCC), with bootstrap-derived confidence intervals indicating statistical reliability.
These results empirically ground and quantify the long-hypothesized effect where low-contrast lesions—regardless of global skin tone—present ambiguous or indistinct boundaries, causing systematic segmentation failures.
Figure 5: Disease-class-specific absolute Spearman correlation heatmaps (DINOv2) highlighting the strength of contrast-driven performance variations.
Qualitative Validation
Qualitative exemplars reinforce these trends: high-contrast lesions are more cleanly segmented, while low-contrast lesions lead to significant boundary ambiguity and missegmentation.
Figure 6: Image samples with pre- and post-segmentation visualization under Pattern 6, illustrating the impact of skin-lesion contrast on segmentation output.
Discussion and Implications
This study comprehensively deconstructs the presumed link between “skin tone bias” and segmentation fairness in dermoscopic pipelines. The evidence presented indicates:
- Global skin tone is poorly predictive of segmentation quality when using state-of-the-art models and commonly available data distributions, counter to assertions based on classification-oriented fairness audits.
- Lesion-skin pigment contrast is the principal determinant of segmentation error, with low-contrast lesions at highest risk for missegmentation due to hard boundaries and increased ambiguity.
- Discretization of skin tone (via Fitzpatrick or scalar ITA) is intrinsically lossy for segmentation tasks and should not be relied upon for fairness auditing or bias mitigation.
- The distribution-based, within-image contrast metric provides a technically meaningful, model-agnostic audit signal for deployment monitoring, clinical triage safety, and the targeted development of robust segmentation pipelines for “difficult” cases.
Future research and clinical pipelines should emphasize contrast-stratified reporting, targeted augmentation for low-contrast lesion enrichment, and the development of adaptive feature fusion or uncertainty-triggered review mechanisms. Broader validation on datasets with more diverse pigment representation is necessary to generalize these findings.
Conclusion
This study supplies definitive, dataset-anchored evidence that in modern lesion segmentation, intra-image pigment contrast, not global skin tone, is the dominant driver of accuracy and error. Methods embracing pixel-level, distributional pigment measurement expose previously overlooked sources of algorithmic vulnerability and inform technical and clinical risk management. These insights support next-generation practices in auditing, data curation, and fairness-aware design for AI-driven dermatological systems, with immediate implications for safety, robustness, and the reliable use of such pipelines in diverse patient populations.