MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks (2505.06152v1)

Published 9 May 2025 in cs.CV and cs.AI

Abstract: Medical vision-LLMs (VLMs) have shown promise as clinical assistants across various medical fields. However, specialized dermatology VLM capable of delivering professional and detailed diagnostic analysis remains underdeveloped, primarily due to less specialized text descriptions in current dermatology multimodal datasets. To address this issue, we propose MM-Skin, the first large-scale multimodal dermatology dataset that encompasses 3 imaging modalities, including clinical, dermoscopic, and pathological and nearly 10k high-quality image-text pairs collected from professional textbooks. In addition, we generate over 27k diverse, instruction-following vision question answering (VQA) samples (9 times the size of current largest dermatology VQA dataset). Leveraging public datasets and MM-Skin, we developed SkinVL, a dermatology-specific VLM designed for precise and nuanced skin disease interpretation. Comprehensive benchmark evaluations of SkinVL on VQA, supervised fine-tuning (SFT) and zero-shot classification tasks across 8 datasets, reveal its exceptional performance for skin diseases in comparison to both general and medical VLM models. The introduction of MM-Skin and SkinVL offers a meaningful contribution to advancing the development of clinical dermatology VLM assistants. MM-Skin is available at https://github.com/ZwQ803/MM-Skin

PDF Abstract

Enhancing Dermatology Vision-LLMs with the MM-Skin Dataset

The advancement of AI in medical diagnostics, particularly in dermatology, has been gradually explored, but the specialized use of vision-LLMs (VLMs) remains underdeveloped. The research paper titled "MM-Skin: Enhancing Dermatology Vision-LLM with an Image-Text Dataset Derived from Textbooks" presents an innovative approach to address this gap by introducing MM-Skin, a robust multimodal dermatology dataset, and developing a domain-specific vision-LLM, SkinVL, tailored for comprehensive skin disease interpretation.

Dataset and Methodology

MM-Skin stands as the first large-scale multimodal dermatology dataset encompassing nearly 10,000 high-quality image-text pairs sourced from authoritative dermatology textbooks. The dataset is significant for including three key imaging modalities: clinical, dermoscopic, and pathological. To enhance the dataset's utility, the authors generated over 27,000 vision question answering (VQA) samples using LLM-facilitated reformatting of image-text pairs. This forms a dataset nine times the size of existing dermatology VQA datasets, providing a comprehensive foundation for training dermatology-specific models.

The MM-Skin dataset provides value not only through its scale but also through its detailed, profession-deriven descriptions that surpass the granularity of existing datasets, which often lack textual richness or multimodal imaging diversity.

Development of SkinVL

Through leveraging both public datasets and the comprehensive MM-Skin dataset, the authors developed SkinVL—a dermatology-specific VLM. This model is tailored to offer precise and nuanced interpretations of skin diseases, distinguishing itself from general medical VLMs through its specialized training foundation.

The model was evaluated across several tasks—VQA, supervised fine-tuning (SFT), and zero-shot classification—across eight different datasets. These evaluations showcase SkinVL's enhanced performance over both general and medical VLM baselines, establishing its effectiveness for dermatology applications. Specifically, the model's performance metrics including BLEU-4, METEOR, and ROUGE-L improved substantially, underlining its superior understanding and generalization capabilities in dermatological contexts.

Implications and Future Directions

The introduction of MM-Skin and the development of SkinVL mark a meaningful step forward in creating clinical assistant VLMs in dermatology. By offering a publicly available, large-scale, multimodal dataset coupled with a specialized VLM, this research contributes valuable resources that can underpin future data-driven advancements in dermatological AI.

The implications of this work extend beyond dermatology, presenting a potential framework for developing specialized VLMs in other medical domains, where image-text datasets are often fragmented or inaccessible. The future trajectory may involve expanding the dataset with real-world data from electronic health records or integrating novel modalities for even broader applicability.

This paper's methodology might inspire further adaptative AI systems in facial analysis or digital skin mapping technologies, leveraging machine learning models trained on diverse and rich datasets for global health improvements. Additionally, future research may focus on optimizing the model's performance in less-represented demographics or under-resourced regions to enhance global health equity with AI-driven diagnostics.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Wenqi Zeng (4 papers)
Yuqi Sun (16 papers)
Chenxi Ma (11 papers)
Weimin Tan (27 papers)
Bo Yan (98 papers)

MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks (2505.06152v1)

Enhancing Dermatology Vision-LLMs with the MM-Skin Dataset

Dataset and Methodology

Development of SkinVL

Implications and Future Directions

Related Papers

GitHub

YouTube