Papers
Topics
Authors
Recent
Search
2000 character limit reached

DermCoT Corpus: Dermatologic Diagnosis Dataset

Updated 13 April 2026
  • DermCoT corpus is a standardized dermatologic dataset comprising image-chain-of-thought narratives that facilitate explicit, step-wise reasoning for skin diagnosis.
  • It includes 10,000 training cases and 3,000 board-certified test cases curated through a rigorous automated and human review process.
  • The dataset employs a structured three-layer annotation (Observation, Reasoning, Diagnosis) to support explainable vision language model research in dermatology.

The DermCoT corpus is a standardized dermatologic dataset composed of image–chain-of-thought (CoT) narrative pairs, specifically designed to facilitate explicit, step-wise reasoning for skin condition diagnosis. Developed to support SkinGPT-R1 and broader vision LLM (VLM) research in dermatology, DermCoT provides a curated benchmark emphasizing clinical correctness, safety, and explainability across a diverse range of skin pathologies (Shen et al., 19 Nov 2025).

1. Composition, Scale, and Curation

DermCoT comprises 13,000 total cases, divided into:

  • 10,000 DermEval-filtered training cases: These are automatically constructed and selected via a standardized VLM-driven process incorporating rigorous automated and human-aligned scoring.
  • 3,000 certified test cases: Independently reviewed and corrected by board-certified dermatologists, these are held out for evaluation and are distinct from the training set.

Sampling practices enforce balance by diagnosis category and anatomic region, although per-condition frequency distributions are not reported. All data are derived from the DermNet image repository, leveraging its ground-truth diagnostic labels. No proprietary clinical write-ups are included; all text is synthesized and subsequently audited.

2. Data Generation, Filtering, and Certification

Automated Pipeline

The 15,000 candidate training cases are generated via a three-stage pipeline:

  1. Stage 1: A pretrained “observation-only” VLM (Gemini-2.5) generates image captions constrained to morphological and anatomical descriptions, intentionally omitting any diagnostic claims.
  2. Stage 2: A label-aware drafting model (OpenAI O1) constructs a hierarchical reasoning draft, linking visual findings to logical inferences that culminate in the established diagnosis.
  3. Stage 3: A normalization component rewrites outputs into a canonical three-layer CoT form: Observation → Reasoning → Diagnosis.

Filtering Procedure

Cases are scored by DermEval, a six-dimensional LLaVA-based evaluator reflecting clinician-defined criteria. Only instances with a mean DermEval score of at least 4.5/5 are retained (10,000 total), subject to sampling balance constraints. Certified test cases are not subject to this selection, being reviewed independently.

Certification Process

The certified test set consists of 3,000 entries, each subjected to rigorous clinical audit. Board-certified dermatologists review, correct, or remove items as necessary, with all cases judged according to the six-dimensional DermBench rubric. This set is locked prior to any model tuning or evaluation.

3. Annotation Schema and Chain-of-Thought Standardization

Each DermCoT narrative is structured in a standardized three-layer CoT format:

  1. Observation: Structured description of anatomical site, primary/secondary morphology, distribution, color, and surface alterations.
  2. Reasoning: Sequential, evidence-first logical progression connecting observations to candidate differentials.
  3. Diagnosis: Final diagnostic conclusion, calibrated to available findings.

Clinician annotation guidelines mandate use of controlled dermatologic vocabulary, explicit evidentiary logic, hierarchical structuring, and avoidance of unsupported claims. Layer lengths are typically constrained to approximately 3–5 sentences to maintain narrative focus.

4. Evaluation Framework and Scoring

DermEval provides per-case, six-dimensional scoring:

Dimension Definition Score Range
Accuracy Correctness of observations and diagnosis 1–5
Safety Absence of harmful or misleading recommendations 1–5
Medical Groundedness Factual alignment with dermatologic knowledge 1–5
Clinical Coverage Completeness (findings, differentials, follow-up) 1–5
Reasoning Coherence Logical, internally consistent progression 1–5
Description Precision Clarity and correctness of terminology 1–5

A common overall metric is the mean of the six scores:

OverallScore=16i=16si\text{OverallScore} = \frac{1}{6}\sum_{i=1}^6 s_i

where sis_i is the score for the iith criterion. Standard inter-rater reliability statistics (e.g., Cohen’s κ) are not reported. DermEval is trained to align with physician ratings.

5. Data Structure and Representation

DermCoT samples are presented in structured JSON, with each entry comprising:

  • Unique image identifier and anatomical site
  • Diagnosis label
  • CoT narrative partitioned into observation, reasoning, and diagnosis sections
  • Six-dimensional DermEval scores
  • Train/test split indicator

Sample entry:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
  "image_id": "DN12345",
  "anatomic_site": "nose",
  "diagnosis_label": "Papulopustular rosacea",
  "CoT_narrative": {
    "observation": "Close-up of the nasal dorsum showing diffuse erythema, telangiectasias, and scattered pustules on an edematous background.",
    "reasoning": "Erythema plus pustules and telangiectasia in this distribution strongly suggests rosacea; differential includes acneiform drug reaction or lupus, but lack of comedones and photodistribution favors rosacea.",
    "diagnosis": "Papulopustular rosacea"
  },
  "dermeval_scores": {
    "Accuracy": 5,
    "Safety": 5,
    "MedicalGroundedness": 5,
    "ClinicalCoverage": 5,
    "ReasoningCoherence": 5,
    "DescriptionPrecision": 5
  },
  "split": "train"
}

6. Example Narratives and Clinical Fidelity

Certified cases exhibit high dermatologic specificity and reasoning quality. Two illustrative CoT examples include:

  • Papulopustular rosacea on nose: Observation details morphological features and distribution, reasoning distinguishes from acne and lupus based on comedones and photodistribution, culminating in a precise diagnosis.
  • Superficial basal cell carcinoma on lower leg: Observation describes plaque morphology and border, reasoning considers differential with melanoma and SCC, resolved through features such as lack of pigmentation.

Both cases receive perfect scores across all six DermEval dimensions, demonstrating the intended level of narrative clarity and clinical rigor.

7. Limitations, Best Practices, and Usage Guidance

DermCoT is restricted to images and diagnoses as cataloged in DermNet; generalizability to external data (e.g., photographs from varied devices, non-curated patient populations) is untested. Potential dataset biases may arise from disproportionate skin tone, device, or anatomical site representation.

Recommended use includes augmentation with additional cohorts representing greater skin-type and geographic diversity, performance stratification across subpopulations, and active monitoring for generalization errors. The corpus remains a curated slice of dermatologic practice and should be situated within broader clinical validation pipelines (Shen et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DermCoT Corpus.