DermoInstruct: A Dermatology MLLM Corpus

Updated 12 January 2026

DermoInstruct is a morphology-anchored instruction corpus that integrates richly annotated image–text trajectories with structured clinical and dermoscopic data.
It synthesizes diverse datasets and applies a soft concept bottleneck to capture detailed morphological features and diagnostic reasoning in a unified framework.
The corpus supports state-of-the-art model training using both supervised fine-tuning and reinforcement learning to achieve high diagnostic accuracy and fairness.

DermoInstruct is a large-scale, morphology-anchored instruction corpus and modeling paradigm for dermatological reasoning with multimodal LLMs (MLLMs). Designed to reflect the clinical diagnostic workflow, DermoInstruct synthesizes richly annotated image–text trajectories that couple visual evidence with structured morphological attributes, stepwise reasoning, and open-domain QA, facilitating high-fidelity training and benchmarking for dermatology-focused MLLMs (Ru et al., 5 Jan 2026). DermoInstruct builds on and extends methodologies established in prior resources such as DermaSynth (Yilmaz et al., 31 Jan 2025) and VL-MedGuide (Yu et al., 8 Aug 2025), incorporating advances in data diversity, annotation rigor, and clinically grounded supervision.

1. Corpus Construction and Scope

DermoInstruct comprises 211,243 distinct images sourced from fourteen public dermatology datasets. These include both clinical (e.g., DermNet, Fitzpatrick17k, MIDAS, PAD-UFES-20, PASSION, PUMCH, SCIN, SD-198) and dermoscopic (e.g., ISIC Archive, HAM10000, BCN20000, Derm12345) modalities, enabling broad coverage of practice-relevant phenotypes. After strict deduplication—using patient-level splits and perceptual hashing (pHash, Hamming distance ≤ 2)—the corpus contains approximately 82,000 dermoscopic and 129,243 clinical/smartphone images.

Images are preprocessed according to stage-specific area constraints, with normalization to [0,1] in RGB and center cropping applied for aspect ratios exceeding 1.2. These procedures are designed to meet the requirements of both supervised fine-tuning and RL-based objectives.

The corpus supports 772,675 multi-turn instruction trajectories, divided across training and evaluation subsets (646,018 and 126,657, respectively) (Ru et al., 5 Jan 2026).

2. Annotation Schema and Morphological Anchoring

DermoInstruct adopts a soft concept bottleneck paradigm: each lesion is mapped to a binary morphology vector $\mathbf m\in\{0,1\}^F$ , where $F$ is the number of curated morphological attributes. Clinical images receive SkinCon-based annotations (48 features), while dermoscopic images receive attributes according to the expanded seven-point checklist. Each attribute’s clinical relevance is quantified via Pointwise Mutual Information (PMI) with respect to diagnostic outcomes:

$\mathrm{PMI}(m_f;y) = \log\frac{\hat p(m_f=1,y)+\epsilon}{\hat p(m_f=1)\,\hat p(y)+\epsilon}, \quad \epsilon=10^{-5}$

These PMI scores are normalized into weights $w_f(y)$ for use in reward shaping and similarity metrics.

Lesions are also mapped to a hierarchical ontology (nine superclasses, 325 fine-grained subclasses). Hierarchical similarity between true and predicted class labels is calculated by the Wu–Palmer method:

$S_{\rm hier} = \frac{2\,\mathrm{depth}(\mathrm{LCA}(p_{\rm pred},p_{\rm gt}))}{|p_{\rm pred}|+|p_{\rm gt}|}$

Annotation entries typically contain detailed free-text descriptions, JSON-formatted morphological features, stepwise chain-of-thought reasoning, and a final diagnosis attributable to a recognized ontology node.

3. Task Formats and Instruction Trajectories

DermoInstruct instruction trajectories span five principal formats, each reflecting different stages of the clinical reasoning pipeline:

Free-text morphology (T1.1): Models are prompted to generate unstructured descriptive reports.
Structured attribute generation (T1.2): Annotation follows JSON schemas for morphological features.
Chain-of-Thought reasoning (T3.1, T3.2): Given image and attributes (JSON), models perform stepwise inference and produce textual rationales plus a final diagnosis.
Diagnosis VQA – Flat and Hierarchical (T2.1–2.4, T2.3): Models answer multiple-choice questions or engage in sequential taxonomic classification.

Sampling across these formats ensures heterogeneity; task assignment per image is randomized or distributionally targeted per experimental requirements. All five formats are subsumed under the multi-task training regime.

Pseudocode for Instruction Generation

for image in DermoInstruct.images:
  choose task t ∼ Uniform({1,…,5})
  if t in {1,2}:
    prompt ← sample_morphology_prompt(t)
    response ← VLM.generate(image,prompt)
    store (image,prompt,response)
  elif t in {3,4}:
    json ← generate_morph_JSON(image)
    prompt ← CoT_prompt(json,candidate_diagnoses)
    response ← VLM.generate(image,json,prompt)
    store (image,json,prompt,response)
  else:  # flat/hierarchical diagnosis
    if flat:
      options ← sample_distractors(ontology,image,true_label)
      prompt ← MCQA_prompt(options)
      answer ← VLM.answer(image,prompt)
      store (image,prompt,answer)
    else:  # hierarchical
      path ← ontology_path(true_label)
      for level,opts in path.levels:
        prompt ← hierarchical_prompt(level,opts)
        ans  ← VLM.answer(image,prompt)
        store (image,prompt,ans)
        if ans≠path[level]:
          prompt←correction_prompt(ans,path[level])
          …  # interactive correction

(Ru et al., 5 Jan 2026)

4. Diversity, Balance, and Bias Assessment

DermoInstruct’s diagnostic label distribution conforms to realistic clinical prevalence patterns: the top 15 fine-grained classes cover approximately 30% of images, while the nine superclasses remain relatively balanced at 5–15% per group.

Morphology coverage is comprehensive: all 7-point dermoscopic attribute fields appear in over 50% of dermoscopy cases; all 48 SkinCon features occur at least 0.1% in the clinical subset.

Representation of Fitzpatrick skin types (I–V) is achieved through inclusion of diverse sources (e.g., Fitzpatrick17k, SCIN, PASSION). Fairness is assessed in DermoBench via MCQA-based evaluation, using the ratio $\min_k\mathrm{Acc}_k/\max_k\mathrm{Acc}_k$ to quantify skin tone predictive parity. This suggests a focus on evaluating as well as promoting equity across skin types in diagnostic reasoning.

5. Model Training and Reinforcement Learning Objectives

DermoInstruct is used to train advanced MLLMs in dermatology, notably DermoGPT (Ru et al., 5 Jan 2026), following a two-phase protocol:

Supervised Fine-Tuning (SFT): Multi-task cross-entropy over all five formats, typically in a single epoch. The base model (e.g., Qwen3-VL-8B-Instruct) receives LoRA adapters (rank 64, α=64, dropout 0.05). The vision tower and fusion merger are fine-tuned; the LLM head remains frozen.

$\mathcal L_{\rm SFT} = -\sum_{(x,y)\in\mathcal D} \log p_\theta(y\mid x,\text{prompt})$

Morphologically-Anchored Visual-Inference-Consistent (MAVIC) RL: Post-SFT, Group Relative Policy Optimization (GRPO) shapes policy with a composite reward:

$R = R_{\rm acc} + \lambda_{\rm hier} S_{\rm hier} + \lambda_{\rm morph} g(S_{\rm hier}) S_{\rm morph} + R_{\rm fmt}$

where $R_{\rm acc}$ is MCQA correctness; $S_{\rm hier}$ and $S_{\rm morph}$ reflect hierarchical and morphology alignment, respectively. $g(s)$ sigmoidally gates morphology rewards. KL-penalty regularization (β = 0.1) constrains policy shift.

This approach closely enforces consistency among visual cues, morphological interpretation, and diagnostic prediction, reflecting clinical workflow.

DermoInstruct synthesizes and expands upon approaches established in DermaSynth and VL-MedGuide:

From DermaSynth: Incorporation of extensive, CC-BY-4.0–licensed dermatological images, metadata-grounded prompts targeting hallucination mitigation, and self-instruct loops for curriculum generation (Yilmaz et al., 31 Jan 2025).
From VL-MedGuide: Modular prompt engineering, concept perception “bottlenecks”, and explainable chain-of-thought disease reasoning (Yu et al., 8 Aug 2025).
Distinctiveness: DermoInstruct’s integrating of unified ontology (nine superclasses, 325 subclasses), dual morphology schema, explicit fairness evaluation, and reinforcement learning with visually anchored rewards distinguish it as a uniquely comprehensive corpus for dermatological LLM training (Ru et al., 5 Jan 2026).

7. Applications and Impact

DermoInstruct serves as the foundation for DermoGPT, which is trained via SFT and the MAVIC RL objective to deliver state-of-the-art performance in dermatological morphology recognition, diagnostic accuracy, clinical reasoning, and fairness (Ru et al., 5 Jan 2026). The corpus additionally underpins the DermoBench suite—a benchmarking protocol with 11 tasks across Morphology, Diagnosis, Reasoning, and Fairness axes, including a subset of 3,600 expert-verified open-ended cases and human performance baselines.

This suggests DermoInstruct not only enables model development but also rigorous comparative evaluation, facilitating progress toward expert-level, explainable, and unbiased dermatology AI systems.