DermoBench: Clinical Benchmarking in Dermatology
- DermoBench is a multi-format benchmarking suite that rigorously evaluates AI models in dermatology using expert-reviewed clinical protocols.
- It encompasses diverse tasks including chain-of-thought reasoning, diagnostic narrative assessment, and fairness testing across multiple clinical axes.
- Dedicated cleaning protocols and contamination controls ensure high data integrity, enhancing reproducibility and performance reliability.
DermoBench is a rigorously designed, multi-format benchmarking suite developed to enable reproducible, clinically grounded evaluation of computer vision and multimodal LLMs (MLLMs) for dermatological reasoning. Across its lineage, DermoBench frameworks share a commitment to expert-reviewed ground truth, explicit multi-dimensional clinical rubrics, and high-fidelity, contamination-controlled test splits. Principal versions serve complementary purposes ranging from chain-of-thought (CoT) explanation assessment (Shen et al., 19 Nov 2025), robust diagnostic narrative evaluation with LLM-based judgers (Shen et al., 12 Nov 2025), to multi-task morphologyādiagnosisāreasoning pipelines and fairness quantification (Ru et al., 5 Jan 2026). Additionally, data cleaning protocols published as DermoBench methodology (Grƶger et al., 2023) address trustworthiness and performance stability in dermoscopic model evaluation.
1. Definitions and Rationale
DermoBench was established as an expert-verified benchmark for assessing the quality and safety of algorithm-generated dermatologic interpretations. In the SkinGPT-R1 framework (Shen et al., 19 Nov 2025), DermoBench consists of 3,000 imageāCoT pairs, each locked to a gold reference narrative and scored by board-certified dermatologists according to six dimensions: Accuracy, Safety, Medical Groundedness, Clinical Coverage, Reasoning Coherence, Description Precision. Its clinical motivation derives from the need to move beyond short-answer correctness toward explicit, multi-step, evidence-based reasoning pathways traceable to typical diagnostic workflows.
Subsequent extensions, such as those presented in DermoGPT (Ru et al., 5 Jan 2026), formalize DermoBench as a unified, multi-axis evaluation suite comprising 11 subtasks distributed over four clinical axes (Morphology, Diagnosis, Reasoning, Fairness). These include open-ended description, structured attribute extraction, multiple-choice questions (MCQAs), chain-of-thought reasoning, and fairness testing by Fitzpatrick skin type.
A DermoBench data cleaning protocol (Grƶger et al., 2023) focuses on eliminating duplicates, irrelevant samples, and label errors from ISIC-recommended evaluation splits to yield reliable model comparisons.
2. Dataset Construction and Composition
Multiple versions of DermoBench have been constructed to serve independent benchmarking objectives:
- SkinGPT-R1 DermoBench (Shen et al., 19 Nov 2025): Created from a ācertifiedā slice of 3,000 cases hand-audited and corrected by board-certified dermatologists, with zero overlap with training data. Each sample comprises an image, a gold reference CoT narrative, and full dimension scores.
- MLLM Narrative DermoBench (Shen et al., 12 Nov 2025): Contains 4,000 real-world dermatology images sourced from clinical and public corpora, spanning 12 major diagnostic categories. Certified diagnostic narratives are authored via independent dual annotation and consensus adjudication. Metadata includes age, sex, site, and Fitzpatrick skin type.
- DermoGPT Benchmark (Ru et al., 5 Jan 2026): Integrates 3,600 expert-verified open-ended tasks and 30,399 closed-ended MCQA pairs, split across 900 core cases and broader pools for in-distribution and out-of-distribution assessment. All open-ended instances undergo line-by-line revision for factual fidelity.
- Cleaning Protocol (Grƶger et al., 2023): Applies algorithmic and human curation to six ISIC datasets, producing revised test-set manifests for each, ensuring removal of irrelevant and duplicate images, with expert-verified corrections.
| Version/Protocol | Images/Cases | Annotation Mode |
|---|---|---|
| SkinGPT-R1 DermoBench | 3,000 | Certified narrative, CoT |
| MLLM Narrative DermoBench | 4,000 | Dual dermatologist review |
| DermoGPT Multi-axis | 900 core, 33,999 tasks total | Expert and MCQA |
| Cleaning Protocol | 6 datasets (varied N) | Algorithmic + expert flag |
3. Evaluation Rubrics and Scoring Protocols
DermoBench employs strictly defined evaluation rubrics tailored to clinical reasoning and narrative generation:
- Six Dimensions (SkinGPT-R1 and MLLM Narrative Benchmarks):
- Accuracy: Diagnostic correctness
- Safety: Absence of harmful or misleading advice
- Medical Groundedness: Alignment with established dermatologic knowledge
- Clinical Coverage: Completeness of clinical description and recommendations
- Reasoning Coherence: Logical, stepwise justification from findings to diagnosis
- Description Precision: Terminological specificity and language clarity
Each is scored on a discrete 1ā5 scale, with standardized sub-criteria mapping scores (AāE) to clinical quality gradations.
The overall score for a case is calculated as the arithmetic mean across dimensions:
- DermoGPT Multi-axis Metrics (Ru et al., 5 Jan 2026):
- Closed-ended MCQA accuracy:
- Hierarchical similarity (auxiliary):
- Fairness:
( indexes Fitzpatrick skin groups) - Open-ended LLM-as-a-Judge score (0ā100):
with recall-like and precision-penalty computed via atomic claim support/contradiction matching against the reference. - Morphological semantics (PMI-weighted Tversky):
4. Benchmarking Protocols and Data Integrity
Evaluator Workflow (SkinGPT-R1) (Shen et al., 19 Nov 2025): Each candidate system CoT is scored against the locked reference by a DermEval-calibrated human in standardized prompts; inference settings are normalized.
LLM-based Judging (MLLM Benchmarks) (Shen et al., 12 Nov 2025): Scoring is performed by an LLM-based judge (DermEval), combining image embeddings with narrative evaluation to output six-dimensional ratings.
Leakage Prevention (DermoGPT) (Ru et al., 5 Jan 2026): Patient-level split, perceptual-hash deduplication (Hamming ā¤2), and prompt/template isolation are strictly enforced; gold references for evaluation are never exposed during training.
Cleaning Protocol Steps (DermoBench Cleaning) (Grƶger et al., 2023):
- Stage 1: Self-supervised Vision Transformer (ViT-Tiny) pretraining, latent embedding extraction, clustering and outlier/duplicate ranking, label error scoring via embedding anomaly metric .
- Stage 2: Human confirmation, requiring consecutive āno-issueā annotations per item (), with conservative unanimous agreement by three experts.
- Empirical results indicate up to 3% near-duplicates and 1.2% label errors in all surveyed datasets; cleaning substantially increases performance reliability.
5. Task Structure and Clinical Axes
DermoBenchās subtasks, as formalized by DermoGPT (Ru et al., 5 Jan 2026), evaluate distinct components of dermatologic reasoning:
- Axis I: Morphology
- Open-ended detailed and morph-grounded descriptions, structured attribute extraction (using schemas like Derm7pt, SkinCon), and attribute MCQAs.
- Axis II: Diagnosis
- Fine- and coarse-grained MCQAs, hierarchical root-to-leaf classification, and out-of-distribution assessment.
- Axis III: Reasoning
- Chain-of-thought diagnosis, morphology-anchored reasoning and diagnosis, free-form text, and JSON-structured outputs.
- Axis IV: Fairness
- Stratified MCQAs to measure accuracy stability across skin types.
All open-ended tasks receive gold-standard expert references; closed-ended items use standardized distractor sampling to mirror clinical confusability.
6. Empirical Results and Impact
Across implementations, DermoBench demonstrates high measurement reliability and exposes model strengths and failure modes:
- SkinGPT-R1 (Shen et al., 19 Nov 2025): Achieves the highest overall score (4.031/5) across all six dimensionsāa 41% gain over the Vision-R1 baselineāwith marked improvements in accuracy, clinical coverage, reasoning, and precision.
- MLLM Narrative Benchmarks (Shen et al., 12 Nov 2025): Expert and automatic judge ratings show close per-dimension alignment (mean deviations ā¤0.30 of 5), with strongest agreement in safety and accuracy.
- DermoGPT (Ru et al., 5 Jan 2026): Human performance baselines are reported for all subtasks (e.g., T1.1: 73.36, T3.1 CoT: 82.15), enabling quantification of the humanāAI gap and identification of systematic weaknesses in attribute extraction and reasoning coherence.
- Cleaning Protocol (Grƶger et al., 2023): Cleaning yields shifts in AUROC up to point and AUPRG up to points, underscoring the necessity of rigorous artifact removal for stable model selection.
A plausible implication is that the inclusion of explicit, multi-axis rubrics and human expert grounding can substantially enhance both the reproducibility and clinical relevance of model performance estimates in dermatology.
7. Usage Guidelines and Extension
Adoption of DermoBench requires adherence to published cleaned manifests, dimension-specific rubrics, and isolation protocols to prevent cross-contamination and leakage. Evaluation scripts use provided CSV manifest files and standardized metrics (e.g., AUROC, AUPRG) (Grƶger et al., 2023), and open-ended outputs are scored by LLM-judge workflows or human experts as defined by the version in use. The DermoBench suite, along with supporting open-source data (where available), forms the foundation for next-generation dermatologic model assessment, supporting both general-purpose medical VLMs and specialty-trained MLLMs (Ru et al., 5 Jan 2026).