DermaVQA-DAS: Dermatological Image Benchmark
- The paper introduces DermaVQA-DAS, a benchmark framework for closed-ended QA and lesion segmentation using patient-generated data and expert-designed assessments.
- It employs a bilingual Dermatology Assessment Schema with 36 high-level and 27 detailed subquestions to ensure clinically meaningful interpretation of lesion characteristics.
- Experimental results show BiomedParse’s superior performance in segmentation metrics while highlighting challenges in color and combination queries for future model improvements.
DermaVQA-DAS is a benchmark and framework for dermatological image understanding grounded in patient-generated data and focused on two rigorously defined tasks: closed-ended question answering (QA) and lesion segmentation. Anchored by the expert-designed Dermatology Assessment Schema (DAS), DermaVQA-DAS introduces a structured, bilingual (English and Chinese) assessment protocol that emphasizes clinically meaningful attributes across both tasks. By extending prior work on patient-authored queries in DermaVQA, this resource addresses limitations of conventional dermatoscopic datasets by incorporating real-world clinical context, expert annotation, and unified QA–segmentation linkage (Yim et al., 30 Dec 2025).
1. Dermatology Assessment Schema (DAS) – Structure and Clinical Grounding
Central to DermaVQA-DAS is the Dermatology Assessment Schema (DAS), co-developed with board-certified dermatologists to ensure comprehensive representation of clinically salient dermatological features. DAS comprises:
- 36 high-level assessment questions: These span key dermatological domains (e.g., “Anatomic Location of Problem,” “Primary Morphology,” “Color of Lesion”) and are structured for direct clinical relevance.
- 27 fine-grained subquestions: Each high-level category, especially the nine most frequently occurring, branches into detailed attributes, supporting nuanced assessment.
- Bilingual presentation: All questions and multiple-choice answer options are available in English and Chinese, fostering inclusivity and cross-lingual research applicability.
DAS enforces standardized description of lesion characteristics by regulating attributes such as anatomic location (including duplicate-slot design for multi-site documentation), size, border, distribution, and color. This structured schema directly mirrors the features routinely interrogated in clinical settings.
2. Dataset Construction and Annotation Protocols
DermaVQA-DAS instantiates its schema into two formally curated splits:
- Closed-ended QA split: Contains 456 instances (300 train, 56 validation, 100 test), pairing a patient-generated image and free-text query with a single DAS assessment question. Annotation utilized three independent board-certified dermatologists, with consensus resolved by majority voting for each instance’s gold answer label.
- Segmentation split: Comprises 2,474 patient-generated images, annotated with 7,448 expert binary lesion masks. Four medical annotators contributed three non-overlapping masks per image, synthesized via per-pixel majority vote to create robust ground truths. Each segmentation is indexed to DAS identifiers, providing unified linkage to the QA schema.
This dual annotation protocol ensures methodological rigor, consensus formation, and tight coupling between visual and textual features for downstream multimodal benchmarking.
3. Task Definitions and Model Input Structure
DermaVQA-DAS defines two primary tasks:
- Closed-ended Question Answering (QA): Models receive (1) the original patient query (including both title and context) and (2) a specific DAS question prompt. Outputs consist of a selection among predefined multiple-choice answers. For queries involving multiple images, responses are aggregated using explicit rule-based schemes (e.g., union for location pairs, majority rule for color).
- Segmentation: Models are presented with a single image and, optionally, an associated textual prompt. Outputs are binary lesion masks. Prompts can vary from default DAS questions to augmented patient query content.
Both modalities align with standardized evaluation protocols enabling rigorous cross-model comparison and assessment.
4. Benchmarking Methodologies and Evaluation Metrics
Segmentation evaluation utilizes three aggregation schemes:
- Mean-of-Max: Aggregates performance via the average maximal overlap score per image.
- Mean-of-Mean: Reflects average overlap across all annotators per image.
- Majority-Vote Microscore: Assesses per-pixel agreement against a consensus mask.
Key metrics include:
- Jaccard Index (): , where is the predicted mask and is ground truth.
- Dice Score: .
Closed-ended QA is measured by average model accuracy across all benchmarked questions.
5. Experimental Results and Comparative Model Performance
Segmentation was tested with MedSAM and BiomedParse architectures. BiomedParse demonstrated superior overlap metrics in all aggregation protocols:
| Model | Prompt Type | Jaccard (IoU) | Dice Score | Aggregation Scheme |
|---|---|---|---|---|
| BiomedParse | Default (I) | 0.5088 | 0.6133 | Mean-of-Max |
| BiomedParse | Default (I) | 0.4253 | 0.5315 | Mean-of-Mean |
| BiomedParse | Augmented (III) | ≈0.395 | ≈0.566 | Majority-Vote Microscore |
| MedSAM | Any | ≈0.410 | ≈0.525 | Mean-of-Max |
Augmented prompts comprising both patient query title and content maximized performance under consensus-based microscore evaluation. MedSAM’s segmentation output remained invariant to prompt design, underscoring the importance of model architecture and prompting for clinical imaging tasks.
Closed-ended QA benchmarking among six state-of-the-art multimodal LLMs yielded strong results:
- o3 (2025-04-16): Best overall accuracy, 0.798
- GPT-4.1: 0.796
- Gemini-1.5-Pro: 0.783 (outperformed Gemini 2.0-Flash at 0.768)
Performance varied substantially by DAS question type; visually salient queries (e.g., CQID025) routinely exceeded 0.90 accuracy, while color and combination attributes (CQID034, CQID036) presented broader variance and constituted unresolved challenges.
6. Dataset Release, Use Cases, and Prospective Directions
The entirety of DermaVQA-DAS—including bilingual DAS schema, expert annotations, and evaluation code—is publicly accessible (https://osf.io/72rp3). The structured linkage of QA and segmentation supports diverse research directions such as clinical decision support systems and patient-facing teledermatology assistants.
Prospective extensions include:
- Expansion of DAS to encompass rare skin conditions and broader skin-tone categories.
- Integration of longitudinal clinical histories.
- Development of open-ended or interactive QA paradigms reflective of real-world clinical reasoning.
This suggests a path toward more versatile, context-aware dermatological AI systems and a comprehensive foundation for vision–language modeling throughout patient-centered care workflows (Yim et al., 30 Dec 2025).