Papers
Topics
Authors
Recent
Search
2000 character limit reached

DermaVQA-DAS: Dermatological Image Benchmark

Updated 6 January 2026
  • The paper introduces DermaVQA-DAS, a benchmark framework for closed-ended QA and lesion segmentation using patient-generated data and expert-designed assessments.
  • It employs a bilingual Dermatology Assessment Schema with 36 high-level and 27 detailed subquestions to ensure clinically meaningful interpretation of lesion characteristics.
  • Experimental results show BiomedParse’s superior performance in segmentation metrics while highlighting challenges in color and combination queries for future model improvements.

DermaVQA-DAS is a benchmark and framework for dermatological image understanding grounded in patient-generated data and focused on two rigorously defined tasks: closed-ended question answering (QA) and lesion segmentation. Anchored by the expert-designed Dermatology Assessment Schema (DAS), DermaVQA-DAS introduces a structured, bilingual (English and Chinese) assessment protocol that emphasizes clinically meaningful attributes across both tasks. By extending prior work on patient-authored queries in DermaVQA, this resource addresses limitations of conventional dermatoscopic datasets by incorporating real-world clinical context, expert annotation, and unified QA–segmentation linkage (Yim et al., 30 Dec 2025).

1. Dermatology Assessment Schema (DAS) – Structure and Clinical Grounding

Central to DermaVQA-DAS is the Dermatology Assessment Schema (DAS), co-developed with board-certified dermatologists to ensure comprehensive representation of clinically salient dermatological features. DAS comprises:

  • 36 high-level assessment questions: These span key dermatological domains (e.g., “Anatomic Location of Problem,” “Primary Morphology,” “Color of Lesion”) and are structured for direct clinical relevance.
  • 27 fine-grained subquestions: Each high-level category, especially the nine most frequently occurring, branches into detailed attributes, supporting nuanced assessment.
  • Bilingual presentation: All questions and multiple-choice answer options are available in English and Chinese, fostering inclusivity and cross-lingual research applicability.

DAS enforces standardized description of lesion characteristics by regulating attributes such as anatomic location (including duplicate-slot design for multi-site documentation), size, border, distribution, and color. This structured schema directly mirrors the features routinely interrogated in clinical settings.

2. Dataset Construction and Annotation Protocols

DermaVQA-DAS instantiates its schema into two formally curated splits:

  • Closed-ended QA split: Contains 456 instances (300 train, 56 validation, 100 test), pairing a patient-generated image and free-text query with a single DAS assessment question. Annotation utilized three independent board-certified dermatologists, with consensus resolved by majority voting for each instance’s gold answer label.
  • Segmentation split: Comprises 2,474 patient-generated images, annotated with 7,448 expert binary lesion masks. Four medical annotators contributed three non-overlapping masks per image, synthesized via per-pixel majority vote to create robust ground truths. Each segmentation is indexed to DAS identifiers, providing unified linkage to the QA schema.

This dual annotation protocol ensures methodological rigor, consensus formation, and tight coupling between visual and textual features for downstream multimodal benchmarking.

3. Task Definitions and Model Input Structure

DermaVQA-DAS defines two primary tasks:

  • Closed-ended Question Answering (QA): Models receive (1) the original patient query (including both title and context) and (2) a specific DAS question prompt. Outputs consist of a selection among predefined multiple-choice answers. For queries involving multiple images, responses are aggregated using explicit rule-based schemes (e.g., union for location pairs, majority rule for color).
  • Segmentation: Models are presented with a single image and, optionally, an associated textual prompt. Outputs are binary lesion masks. Prompts can vary from default DAS questions to augmented patient query content.

Both modalities align with standardized evaluation protocols enabling rigorous cross-model comparison and assessment.

4. Benchmarking Methodologies and Evaluation Metrics

Segmentation evaluation utilizes three aggregation schemes:

  • Mean-of-Max: Aggregates performance via the average maximal overlap score per image.
  • Mean-of-Mean: Reflects average overlap across all annotators per image.
  • Majority-Vote Microscore: Assesses per-pixel agreement against a consensus mask.

Key metrics include:

  • Jaccard Index (IoU\mathrm{IoU}): IoU=PGPG\mathrm{IoU} = \frac{|P \cap G|}{|P \cup G|}, where PP is the predicted mask and GG is ground truth.
  • Dice Score: Dice=2PGP+G\mathrm{Dice} = \frac{2\,|P \cap G|}{|P| + |G|}.

Closed-ended QA is measured by average model accuracy across all benchmarked questions.

5. Experimental Results and Comparative Model Performance

Segmentation was tested with MedSAM and BiomedParse architectures. BiomedParse demonstrated superior overlap metrics in all aggregation protocols:

Model Prompt Type Jaccard (IoU) Dice Score Aggregation Scheme
BiomedParse Default (I) 0.5088 0.6133 Mean-of-Max
BiomedParse Default (I) 0.4253 0.5315 Mean-of-Mean
BiomedParse Augmented (III) ≈0.395 ≈0.566 Majority-Vote Microscore
MedSAM Any ≈0.410 ≈0.525 Mean-of-Max

Augmented prompts comprising both patient query title and content maximized performance under consensus-based microscore evaluation. MedSAM’s segmentation output remained invariant to prompt design, underscoring the importance of model architecture and prompting for clinical imaging tasks.

Closed-ended QA benchmarking among six state-of-the-art multimodal LLMs yielded strong results:

  • o3 (2025-04-16): Best overall accuracy, 0.798
  • GPT-4.1: 0.796
  • Gemini-1.5-Pro: 0.783 (outperformed Gemini 2.0-Flash at 0.768)

Performance varied substantially by DAS question type; visually salient queries (e.g., CQID025) routinely exceeded 0.90 accuracy, while color and combination attributes (CQID034, CQID036) presented broader variance and constituted unresolved challenges.

6. Dataset Release, Use Cases, and Prospective Directions

The entirety of DermaVQA-DAS—including bilingual DAS schema, expert annotations, and evaluation code—is publicly accessible (https://osf.io/72rp3). The structured linkage of QA and segmentation supports diverse research directions such as clinical decision support systems and patient-facing teledermatology assistants.

Prospective extensions include:

  • Expansion of DAS to encompass rare skin conditions and broader skin-tone categories.
  • Integration of longitudinal clinical histories.
  • Development of open-ended or interactive QA paradigms reflective of real-world clinical reasoning.

This suggests a path toward more versatile, context-aware dermatological AI systems and a comprehensive foundation for vision–language modeling throughout patient-centered care workflows (Yim et al., 30 Dec 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DermaVQA-DAS.