RadVLM: A Multitask Conversational Vision-Language Model for Radiology (2502.03333v1)

Published 5 Feb 2025 in cs.CV and cs.AI

Abstract: The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-LLMs (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.

PDF Abstract

An Evaluation of RadVLM: A Multitask Conversational Vision-LLM for Radiology

The paper under discussion introduces RadVLM, a vision-LLM designed for comprehensive interpretation of chest X-rays (CXRs), addressing both single-task performance and multi-turn conversational capabilities in radiology. The healthcare sector increasingly confronts a burgeoning demand for CXR evaluations amidst a shortage of radiologists. Automation in CXR diagnostics, hence, emerges as a pivotal domain, promising to offload routine tasks while supplementing medical expertise. RadVLM enhances traditional approaches by integrating multiple radiological tasks with an innovative conversational interface that simulates interactive dialogue between clinicians and automated systems.

Dataset Creation and Model Architecture

The foundational dataset for RadVLM comprises over one million image-instruction pairs, derived from publicly available CXR sources and enriched to reflect diverse levels of diagnostic complexity. The dataset is systematically organized into: 1) free-text report generation, 2) abnormality classification, 3) visual grounding of anatomical regions and abnormalities, and 4) multi-turn conversational exchanges. Such stratification ensures comprehensive coverage of clinically relevant scenarios, fostering a model capable of supporting detailed radiological assessments.

RadVLM utilizes the LLaVA-OneVision-7B architecture, a robust amalgamation of a vision encoder aligned with a transformer-based, autoregressive LLM. This choice is informed by recent advancements in multimodal learning, where pre-trained models are increasingly adept at handling cross-domain nuances, especially when fine-tuned on highly specific datasets as instituted here.

Evaluation Metrics and Baseline Comparisons

For report generation, RadVLM's performance is gauged using natural language generation (NLG) metrics such as BertScore and Rouge-L, alongside domain-specific measures like RadGraph F1 and GREEN. These latter metrics offer an intricate view of the model's success in reproducing clinically significant entities and relationships within generated reports. Against well-established medical-specific models like RaDialog and CheXagent, RadVLM performs commendably, demonstrating improvements in lexical evaluations and maintaining strong clinical metrics.

In abnormality classification tasks, RadVLM outperforms its peers with a superior macro-averaged F1 score, indicating refined capability in distinguishing among various thoracic conditions. Visual grounding tasks, pivotal for spatial localization of pathologies, afford RadVLM another stage for its efficacy. Here, across anatomical, abnormality, and phrase grounding subtasks, it consistently surpasses other models like MAIRA-2 and CheXagent in precision and recall, validated by metric mAP.

Conversational Interactions and Insights

RadVLM's conversational prowess is evaluated in contextually relevant, multi-turn interactions, leveraging LLMs to simulate clinician-model dialogues. A crucial differentiator is its competence in maintaining conversational context while providing accurate, interpretative insights without veering into overconfidence or error. This ability, rated highly by GPT-4o evaluation metrics, outlines RadVLM's superiority in offering a flexible interface accommodating both specific and exploratory clinical queries.

Implications and Future Directions

The research posits RadVLM as a tangible step forward in the development of radiology AI assistants. It embodies a nuanced understanding of integrating multimodal learning and interactive AI within clinical practice. While criticallly bespoke for CXRs at present, the methodologies demonstrated here can be extrapolated to other imaging modalities with enriched datasets and tailored pre-training.

The paper further elucidates the merits of joint task training over isolated feature training, an insight pivotal in refining multitask learning approaches. It advocates for an evolving AI model rationale, termed "single-agent", versatile across interconnected medical tasks.

Future advancements could focus on augmenting RadVLM with additional clinical contexts, encompassing a broader spectrum of patient-specific data and historical imaging. This could effectively transform it into an indispensable tool in radiology, streamlining diagnostic workflows, mitigating clinician burnout, and enhancing patient care especially in underserved regions.

In summary, RadVLM substantively enriches the AI toolkit available for radiological interpretation, offering clinicians a model characterized by unprecedented versatility and conversational intelligence. As AI technologies continue to evolve, such contributions become central to reimagining the intersections between medicine and machine learning.

PDF Markdown Bookmark Chat (Pro)

Authors (15)

Nicolas Deperrois (3 papers)
Hidetoshi Matsuo (5 papers)
Samuel Ruipérez-Campillo (1 paper)
Moritz Vandenhirtz (13 papers)
Sonia Laguna (10 papers)
Alain Ryser (9 papers)
Koji Fujimoto (7 papers)
Mizuho Nishio (10 papers)
Thomas M. Sutter (11 papers)
Julia E. Vogt (44 papers)
Jonas Kluckert (1 paper)
Thomas Frauenfelder (4 papers)
Christian Blüthgen (2 papers)
Farhad Nooralahzadeh (15 papers)
Michael Krauthammer (32 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/david_ychen/status/1887715178965536797