DermCoT Corpus: Dermatologic Diagnosis Dataset

Updated 13 April 2026

DermCoT corpus is a standardized dermatologic dataset comprising image-chain-of-thought narratives that facilitate explicit, step-wise reasoning for skin diagnosis.
It includes 10,000 training cases and 3,000 board-certified test cases curated through a rigorous automated and human review process.
The dataset employs a structured three-layer annotation (Observation, Reasoning, Diagnosis) to support explainable vision language model research in dermatology.

The DermCoT corpus is a standardized dermatologic dataset composed of image–chain-of-thought (CoT) narrative pairs, specifically designed to facilitate explicit, step-wise reasoning for skin condition diagnosis. Developed to support SkinGPT-R1 and broader vision LLM (VLM) research in dermatology, DermCoT provides a curated benchmark emphasizing clinical correctness, safety, and explainability across a diverse range of skin pathologies (Shen et al., 19 Nov 2025).

1. Composition, Scale, and Curation

DermCoT comprises 13,000 total cases, divided into:

10,000 DermEval-filtered training cases: These are automatically constructed and selected via a standardized VLM-driven process incorporating rigorous automated and human-aligned scoring.
3,000 certified test cases: Independently reviewed and corrected by board-certified dermatologists, these are held out for evaluation and are distinct from the training set.

Sampling practices enforce balance by diagnosis category and anatomic region, although per-condition frequency distributions are not reported. All data are derived from the DermNet image repository, leveraging its ground-truth diagnostic labels. No proprietary clinical write-ups are included; all text is synthesized and subsequently audited.

2. Data Generation, Filtering, and Certification

Automated Pipeline

The 15,000 candidate training cases are generated via a three-stage pipeline:

Stage 1: A pretrained “observation-only” VLM (Gemini-2.5) generates image captions constrained to morphological and anatomical descriptions, intentionally omitting any diagnostic claims.
Stage 2: A label-aware drafting model (OpenAI O1) constructs a hierarchical reasoning draft, linking visual findings to logical inferences that culminate in the established diagnosis.
Stage 3: A normalization component rewrites outputs into a canonical three-layer CoT form: Observation → Reasoning → Diagnosis.

Filtering Procedure

Cases are scored by DermEval, a six-dimensional LLaVA-based evaluator reflecting clinician-defined criteria. Only instances with a mean DermEval score of at least 4.5/5 are retained (10,000 total), subject to sampling balance constraints. Certified test cases are not subject to this selection, being reviewed independently.

Certification Process

The certified test set consists of 3,000 entries, each subjected to rigorous clinical audit. Board-certified dermatologists review, correct, or remove items as necessary, with all cases judged according to the six-dimensional DermBench rubric. This set is locked prior to any model tuning or evaluation.

3. Annotation Schema and Chain-of-Thought Standardization

Each DermCoT narrative is structured in a standardized three-layer CoT format:

Observation: Structured description of anatomical site, primary/secondary morphology, distribution, color, and surface alterations.
Reasoning: Sequential, evidence-first logical progression connecting observations to candidate differentials.
Diagnosis: Final diagnostic conclusion, calibrated to available findings.

Clinician annotation guidelines mandate use of controlled dermatologic vocabulary, explicit evidentiary logic, hierarchical structuring, and avoidance of unsupported claims. Layer lengths are typically constrained to approximately 3–5 sentences to maintain narrative focus.

4. Evaluation Framework and Scoring

DermEval provides per-case, six-dimensional scoring:

Dimension	Definition	Score Range
Accuracy	Correctness of observations and diagnosis	1–5
Safety	Absence of harmful or misleading recommendations	1–5
Medical Groundedness	Factual alignment with dermatologic knowledge	1–5
Clinical Coverage	Completeness (findings, differentials, follow-up)	1–5
Reasoning Coherence	Logical, internally consistent progression	1–5
Description Precision	Clarity and correctness of terminology	1–5

A common overall metric is the mean of the six scores:

$\text{OverallScore} = \frac{1}{6}\sum_{i=1}^6 s_i$

where $s_i$ is the score for the $i$ th criterion. Standard inter-rater reliability statistics (e.g., Cohen’s κ) are not reported. DermEval is trained to align with physician ratings.

5. Data Structure and Representation

DermCoT samples are presented in structured JSON, with each entry comprising:

Unique image identifier and anatomical site
Diagnosis label
CoT narrative partitioned into observation, reasoning, and diagnosis sections
Six-dimensional DermEval scores
Train/test split indicator

Sample entry:

{
  "image_id": "DN12345",
  "anatomic_site": "nose",
  "diagnosis_label": "Papulopustular rosacea",
  "CoT_narrative": {
    "observation": "Close-up of the nasal dorsum showing diffuse erythema, telangiectasias, and scattered pustules on an edematous background.",
    "reasoning": "Erythema plus pustules and telangiectasia in this distribution strongly suggests rosacea; differential includes acneiform drug reaction or lupus, but lack of comedones and photodistribution favors rosacea.",
    "diagnosis": "Papulopustular rosacea"
  },
  "dermeval_scores": {
    "Accuracy": 5,
    "Safety": 5,
    "MedicalGroundedness": 5,
    "ClinicalCoverage": 5,
    "ReasoningCoherence": 5,
    "DescriptionPrecision": 5
  },
  "split": "train"
}

6. Example Narratives and Clinical Fidelity

Certified cases exhibit high dermatologic specificity and reasoning quality. Two illustrative CoT examples include:

Papulopustular rosacea on nose: Observation details morphological features and distribution, reasoning distinguishes from acne and lupus based on comedones and photodistribution, culminating in a precise diagnosis.
Superficial basal cell carcinoma on lower leg: Observation describes plaque morphology and border, reasoning considers differential with melanoma and SCC, resolved through features such as lack of pigmentation.

Both cases receive perfect scores across all six DermEval dimensions, demonstrating the intended level of narrative clarity and clinical rigor.

7. Limitations, Best Practices, and Usage Guidance

DermCoT is restricted to images and diagnoses as cataloged in DermNet; generalizability to external data (e.g., photographs from varied devices, non-curated patient populations) is untested. Potential dataset biases may arise from disproportionate skin tone, device, or anatomical site representation.

Recommended use includes augmentation with additional cohorts representing greater skin-type and geographic diversity, performance stratification across subpopulations, and active monitoring for generalization errors. The corpus remains a curated slice of dermatologic practice and should be situated within broader clinical validation pipelines (Shen et al., 19 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SkinGPT-R1: Adapter-Only Dual Distillation for Efficient Dermatology Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DermCoT Corpus.

DermCoT Corpus: Dermatologic Diagnosis Dataset

1. Composition, Scale, and Curation

2. Data Generation, Filtering, and Certification

Automated Pipeline

Filtering Procedure

Certification Process

3. Annotation Schema and Chain-of-Thought Standardization

4. Evaluation Framework and Scoring

5. Data Structure and Representation

6. Example Narratives and Clinical Fidelity

7. Limitations, Best Practices, and Usage Guidance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DermCoT Corpus: Dermatologic Diagnosis Dataset

1. Composition, Scale, and Curation

2. Data Generation, Filtering, and Certification

Automated Pipeline

Filtering Procedure

Certification Process

3. Annotation Schema and Chain-of-Thought Standardization

4. Evaluation Framework and Scoring

5. Data Structure and Representation

6. Example Narratives and Clinical Fidelity

7. Limitations, Best Practices, and Usage Guidance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research