Fitzpatrick Skin Type Scale Overview
- Fitzpatrick Skin Type (FST) scale is a six-category dermatological taxonomy that classifies skin based on its reaction to UV exposure, guiding clinical assessments and research.
- Manual FST annotation, often conducted via consensus protocols, shows moderate inter-rater reliability and is affected by lighting, context, and subjective judgments.
- Automated FST estimation methods using ITA and neural networks achieve high within-one-category agreement but reveal bias in darker types and challenges in calibration.
The Fitzpatrick Skin Type (FST) scale is a six-category dermatological taxonomy originally devised to quantify human skin’s reactivity to ultraviolet (UV) radiation, with subsequent widespread adoption for phenotypic skin tone annotation in clinical, algorithmic fairness, and computer vision studies. Despite its prevalence as a reporting standard, FST’s subjective origins, coarse quantization, and limited alignment with objective colorimetric measures have significant methodological and fairness implications.
1. Definition and Clinical Basis
The FST scale, introduced by Thomas B. Fitzpatrick in 1975 and formalized in 1988, stratifies human skin into six discrete types according to baseline pigmentation and propensity to sunburn or tan under UV exposure (Krishnapriya et al., 2021, Benčević et al., 10 Feb 2026, Shah et al., 14 Sep 2025):
| FST Type | UV Response | Common Descriptor |
|---|---|---|
| I | Always burns, never tans | Very fair (pale white) |
| II | Burns easily, tans minimally | Fair (white) |
| III | Sometimes mild burn, tans uniformly | Medium (cream/beige) |
| IV | Burns minimally, always tans well | Olive/moderate brown |
| V | Rarely burns, tans profusely | Brown (dark brown) |
| VI | Never burns, deeply pigmented | Black/very dark brown |
Clinically, dermatologists assign FST through in-person assessment combining sunburn/tan history and non-lesional skin inspection, which is considered the gold standard (Benčević et al., 10 Feb 2026). FST was designed to stratify phototherapy risk, not as an objective measure of constitutive skin color (Cook et al., 2024).
2. Manual Annotation Protocols and Inter-Rater Reliability
Manual FST assignment in image datasets is typically performed by clinical experts or trained raters using the canonical six-point scale, often presented with exemplar reference images to improve anchoring (Krishnapriya et al., 2021, Groh et al., 2021, Groh et al., 2022). Recent protocols emphasize dynamic consensus—where multiple independent raters’ decisions are fused via majority voting and accuracy-weighted aggregation—to maximize reliability (Groh et al., 2021, Groh et al., 2022).
Quantitative evaluations demonstrate that manual annotation displays moderate inter-rater variability, even with clinical exemplars and color correction:
- Exact agreement among three raters is 30–36%; two-out-of-three agreement reaches ~90%; within-one-category concordance is ≥89% (Krishnapriya et al., 2021).
- In gold-standard evaluations versus board-certified dermatologists, crowd raters’ exact accuracy on FST is 38–59%, but off-by-one agreement is 71–85% depending on FST type (Groh et al., 2021).
- Dynamic consensus with expert review identifies problematic cases and brings crowd-expert agreement in line with expert-expert reliability, with Pearson ρ ≈ 0.84–0.88 (Groh et al., 2022).
Manual assignment is susceptible to context effects, background lighting, device differences, and subjective application of category boundaries (Cook et al., 2024, Krishnapriya et al., 2021).
3. Automated FST Estimation and Colorimetric Mapping
Automated methods for skin type estimation leverage pixel-level colorimetry and machine learning. The most widely-adopted approach utilizes the Individual Typology Angle (ITA), a continuous feature from CIE-LAB color space (Benčević et al., 6 Apr 2025, Benčević et al., 10 Feb 2026, Groh et al., 2021, Groh et al., 2022):
where is CIE-LAB lightness and is blue–yellow chromaticity. ITA is binned into six FST intervals using empirically-derived thresholds; for the Kinyanjui et al. mapping (Groh et al., 2022):
Neural network-based models for FST prediction commonly employ ordinal regression (e.g., CORAL heads on EfficientNet or VGG backbones) to mimic the ordered nature of FST categories (Benčević et al., 10 Feb 2026, Benčević et al., 6 Apr 2025). These models are pre-trained on large-scale clinical and synthetic datasets annotated by humans, fine-tuned on real images with colorimeter references or expert FST labels, and evaluated, e.g., by Cohen’s κ, mean absolute error, and within-one-category accuracy.
In validated settings, automated ITA-based FST predictions agree with expert consensus within one category in 84–97% of cases, and CIELAB regression models predict ITA with intraclass correlation coefficients (ICC₃) >93% against colorimeter measurements (Benčević et al., 10 Feb 2026, Krishnapriya et al., 2021). However, agreement with clinical experts for discrete FST assignment is consistently lower (Pearson ρ ≈ 0.52–0.57) than human inter-rater agreement, and is sensitive to segmentation, lighting, and calibration artifacts (Groh et al., 2022, Benčević et al., 6 Apr 2025).
4. Statistical Properties and Biases of the FST Scale
Empirical studies reveal that FST categories correspond only coarsely to measured skin color or objective pigmentation:
- In self-assignments, FST exhibits low colorimetric sensitivity: a one-step FST difference corresponds to ≈14.7 units of CIE-LAB , compared to 7.4 for MST (palette-based) and 4.9 for CST (colorimetric) (Cook et al., 2024).
- Regression of FST against measured , hue, chroma, and self-identified race yields (), much lower than for palette-based alternatives. FST is systematically influenced by hue, chroma, and race beyond alone; for equal lightness, White-identifying individuals select lighter FST categories than Black participants by ≈4.7 units (Cook et al., 2024).
- Annotators utilize only 66% of the FST range in self-assessment (types III–V dominate), reflecting limited practical differentiation.
The scale over-represents granularity in lighter types (I–III), while collapsing variation in darker skin into two undifferentiated categories (V–VI). This asymmetry risks label bias in algorithmic contexts, especially in downstream fairness evaluations (Shah et al., 14 Sep 2025).
5. Impact on AI Fairness and Dataset Composition
The FST scale is the de facto standard for stratifying dataset diversity and assessing algorithmic fairness in dermatology and face recognition (Groh et al., 2021, Benčević et al., 10 Feb 2026, Shah et al., 14 Sep 2025, Krishnapriya et al., 2021). However, large public datasets are severely imbalanced:
- In the Fitzpatrick 17k dataset, skin types V–VI collectively comprise 13.7% of images, with types I–IV dominating.
- In ISIC 2020 and MILK10k dermatoscopic benchmarks, types V–VI account for <1% of annotated samples when algorithmic estimators are applied (Benčević et al., 10 Feb 2026).
Quantitative fairness audits show that AI lesion classifiers are most accurate on FST categories present in training data, and that reducing FST granularity in lighter types (1/2 vs. 3/4) reduces both accuracy and fairness (Shah et al., 14 Sep 2025). The fairness gap (max–min difference in accuracy or error rates across FSTs) can be reduced with protected-group ERM (training models per FST group), but only if granularity is preserved (Shah et al., 14 Sep 2025).
Weak concordance between FST and actual colorimetric skin tone amplifies the uncertainty of “bias” estimates in algorithmic evaluations. Many studies now advocate reporting uncertainty intervals (e.g., ±1 FST step) or transitioning to continuous measures of pigmentation (e.g., ITA, CST) for robust fairness assessment (Cook et al., 2024, Krishnapriya et al., 2021, Benčević et al., 10 Feb 2026).
6. Methodological Limitations and Alternative Scales
The main criticisms and known limitations of FST center on three axes (Cook et al., 2024, Krishnapriya et al., 2021, Shah et al., 14 Sep 2025):
- Coarseness and Subjectivity: Only six categories are available, with weak anchoring to device-independent pigmentation; substantial intra- and inter-rater variability persists despite consensus protocols and color-corrected exemplars.
- Bias: Assignment is influenced by context, chromaticity, and self-identified race, introducing systematic misplacement as a function of non-purposeful factors.
- Uneven Spacing: Greater nuance exists in lighter types compared to darker, embedding structural bias in annotation and subsequent algorithmic training and testing.
Alternatives such as the Monk Skin Tone (MST) and Colorimetric Skin Tone (CST) scales provide finer granularity and improved alignment with measured and chromaticity (Cook et al., 2024). CST, in particular, partitions skin tone into ten palette-based swatches generated to represent even incremental steps and quadratic chroma/hue variation, delivering in self-rating and vastly reduced race-dependent misclassification.
A plausible implication is that future research in AI fairness should transition to continuous, colorimetrically-calibrated scales and augment all annotation pipelines with explicit uncertainty quantification and bias correction protocols.
7. Practical Guidelines and Directions for Future Work
Best practices emerging from recent studies include:
- Use dynamic consensus protocols with explicit expert review to maximize reproducibility and identify ambiguous images needing expert arbitration (Groh et al., 2022, Groh et al., 2021).
- Deploy calibration objects or rely on in-person assessment for gold-standard labels; avoid inferring FST from uncontrolled images without calibration or white-balancing (Krishnapriya et al., 2021, Benčević et al., 10 Feb 2026).
- Prefer segmentation-based or color quantization pipelines for automated colorimetry, and validate model-based predictions against physical colorimeter or spectrophotometric measurements (Benčević et al., 6 Apr 2025, Benčević et al., 10 Feb 2026).
- Evaluate uncertainty intervals and report both exact and ±1-category concordance for FST, particularly in fairness and benchmarking studies (Krishnapriya et al., 2021, Groh et al., 2021).
- Collect and curate datasets spanning the full FST and ITA range to support robust algorithmic audits; address the underrepresentation of darker types through targeted acquisition or synthetic augmentation (Benčević et al., 10 Feb 2026, Shah et al., 14 Sep 2025).
Expanding the adoption of continuous, colorimetrically grounded alternatives (i.e., ITA and CST) and reporting uncertainty within all FST-based analyses remain important future directions, alongside rigorous evaluation of lighting, device, and algorithmic sources of error in skin-tone annotation.