LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation (2305.11490v5)

Published 19 May 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO. This direction of research is particularly relevant to medical imaging because medical image analysis and generation consist of reasoning based on a combination of visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM's existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks. The code is at https://github.com/hyn2028/LLM-cxr.

PDF HTML Abstract

Overview of LLM-CXR: Instruction-Finetuned LLMs for CXR Image Understanding and Generation

The paper presents LLM-CXR, a novel approach for instruction-finetuning LLMs to enhance their vision-language capabilities, specifically tailored for the domain of medical imaging. The focus is on chest X-ray (CXR) images, which require sophisticated integration of visual data and textual information for effective analysis and generation. The authors aim to significantly improve image-text alignment, enabling LLMs to excel in both understanding and generating CXR images and their associated diagnostic reports.

Core Contributions

Instruction-Finetuning for Multimodal Capabilities: The paper introduces a method for instruction-finetuning an LLM, initially pretrained exclusively on text, to perform vision-language tasks without modifying its core architecture. This approach enables the model to handle CXR images and generate appropriate text and image responses based on diverse and complex instructions.
VQ-GAN for Image Tokenization: Utilizing Vision-Quantized Generative Adversarial Networks (VQ-GAN), the approach tokenizes images into discrete representations. This allows the model to embed image tokens into the LLM's text-based token space, facilitating seamless multimodal interaction.
Comprehensive Training through Synthetic VQA: By generating synthetic vision question-answer (VQA) pairs from existing CXR text reports, the authors enhance the training set, improving the model's ability to interact and align vision-language features effectively.
Two-Stage Fine-Tuning: The model undergoes an initial broad training phase using a large dataset to establish basic image-text relations, followed by a targeted fine-tuning phase with a curated dataset to refine its performance on specific diagnostic tasks.

Experimental Results

The effectiveness of LLM-CXR is evaluated through several tasks, including CXR-to-report generation, CXR-VQA, and report-to-CXR generation. Key findings demonstrate:

Superior or comparable performance to existing models in generating accurate radiology reports from CXR images, as evidenced by AUROC and F1 scores.
Enhanced vision-language alignment, where the model consistently outperforms counterparts in image quality and relevance, measured by FID and class-specific diagnostic feature recognition.
Notable improvements over other models’ capabilities to conduct VQA tasks, highlighting the model's proficiency in interacting with complex image-derived questions.

Implications and Future Directions

LLM-CXR offers a promising framework for developing multimodal AI systems capable of sophisticated medical image reasoning. This research paves the way for:

Deployment of more integrated AI systems in healthcare settings, facilitating improved diagnostic accuracy and efficiency.
Further exploration into better aligning visual and textual modalities, particularly considering the subtleties inherent in medical imaging.
Future iterations of LLMs featuring larger training datasets or model architectures, potentially improving real-time application viability and performance.

Overall, LLM-CXR marks a significant stride in expanding the capabilities of LLMs into the domain of medical imagery, showcasing a methodology that yields both effective and efficient models with practical applications in healthcare diagnostics.

PDF Markdown Bookmark Chat (Pro)

References (37)

Authors (4)

Suhyeon Lee (21 papers)
Won Jun Kim (2 papers)
Jinho Chang (11 papers)
Jong Chul Ye (210 papers)

Citations (26)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - hyn2028/llm-cxr: Official code for "LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation" (98 stars)