Biomedical Visual Instruction Tuning with Clinician Preference Alignment (2406.13173v3)

Published 19 Jun 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the resultant datasets are not explicitly aligned with domain expertise. In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18.5% relatively) and medical VQA (win rate up to 81.73%). Our instruction-following data and models are available at BioMed-VITAL.github.io.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces BioMed-VITAL, a framework designed to align the tuning of biomedical multimodal foundation models with clinician preferences through curated instruction datasets.
BioMed-VITAL employs a three-stage process: data generation using clinician-selected demonstrations, preference distillation for high-quality data selection, and visual instruction tuning.
Experimental results show BioMed-VITAL models outperform baselines on biomedical benchmarks, demonstrating data efficiency, better clinician preference alignment, and improved performance in medical VQA and visual chat.

Overview of Biomedical Visual Instruction Tuning with Clinician Preference Alignment

The paper presents BioMed-VITAL, a framework designed for enhancing the tuning of biomedical multimodal foundation models by aligning them with clinician preferences. The framework addresses the challenge of adapting general-purpose multimodal models, such as GPT-4V, to the specialized biomedical domain by creating and curating large-scale, domain-specific, instruction datasets. BioMed-VITAL is comprised of three stages: data generation with demonstrations, data selection with a preference distilled model, and visual instruction tuning.

Key Components of BioMed-VITAL

Data Generation with Diverse Expert-Selected Demonstrations:
- The framework begins by generating instruction-following data using GPT-4V, guided by diverse clinician-selected demonstrations. These demonstrations are strategically sampled to ensure diversity and relevance. Clinicians annotate this data, creating a robust set of examples that reflect clinical preferences.
Distilling Mixed Clinician Preference for Data Selection:
- Recognizing the limitations and biases present in automatically generated datasets, BioMed-VITAL incorporates a data selection phase. By distilling clinician preferences into a selection model, the process effectively ranks and filters the GPT-4V-generated data. The selection model is trained using both human-annotated preferences and model-based judgments from GPT-4V, which assess the quality of candidate datasets based on curated criteria set by clinicians.
Instruction-Tuning:
- Finally, a visual instruction tuning phase adapts the general multimodal model (LLaVA) using the selected high-quality, clinician-preferred dataset. This ensures that the tuned model can effectively handle specific biomedical tasks, such as open-ended visual chat and medical VQA, with improved performance across various assessment metrics.

Experimental Validation and Outcomes

BioMed-VITAL was empirically validated on several standard benchmarks in the biomedical domain, namely VQA-RAD, SLAKE, and PathVQA. The results demonstrated that the models trained with BioMed-VITAL consistently outperformed baseline models not utilizing clinician preference alignment. Key outcomes from the experiments include:

Data Efficiency: Models trained on top 10\% percentile of selected data from BioMed-VITAL achieved better performance than baseline models trained on larger yet unsorted datasets, indicating more efficient use of data.
Human Preference Alignment: The selection model displayed improved alignment with clinician preferences compared to judgments made by GPT-4V, underscoring the importance of clinician input in the curation process.
Open Visual Chat and Medical VQA Performance: BioMed-VITAL models showed significant improvement, achieving superior scores in open visual chat tasks and medical VQA benchmarks, with win rates reaching up to 81.73% and relative open-ended chat improvement up to 18.5%.

Implications and Future Directions

The development of a structured, clinician-aligned framework for dataset generation and selection in the biomedical domain has several implications:

Enhanced Model Adaptation: By incorporating expert preferences, BioMed-VITAL enhances the alignment of general multimodal models with domain-specific applications, paving the way for improved performance in medical diagnosis and consultation tasks.
Data-Centric Approaches in AI: This research highlights the potential of data-centric approaches over purely model-centric paradigms in achieving greater model efficacy, particularly in specialized domains.
Scalability and Extensibility: While the current framework demonstrates considerable success, further work could expand BioMed-VITAL to encompass a wider range of medical modalities and integrate more sophisticated model evaluation techniques to refine the selection model’s accuracy.

In conclusion, the BioMed-VITAL framework represents an important step forward in the tuning of multimodal models for specialized domains. It combines clinician expertise with advanced AI capabilities, thus addressing the critical need for domain-specific model adaptation in the healthcare industry. Future efforts in this direction can lead to enhanced utility of AI in clinical settings and further bridge the gap between general AI capabilities and specialized domain applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/HennyJieCC/status/1839482232928714890

https://twitter.com/CSVisionPapers/status/1804238386237948034

https://twitter.com/realmofresearch/status/1805447932994543834