Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation (2305.11490v5)

Published 19 May 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO. This direction of research is particularly relevant to medical imaging because medical image analysis and generation consist of reasoning based on a combination of visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM's existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks. The code is at https://github.com/hyn2028/LLM-cxr.

Overview of LLM-CXR: Instruction-Finetuned LLMs for CXR Image Understanding and Generation

The paper presents LLM-CXR, a novel approach for instruction-finetuning LLMs to enhance their vision-language capabilities, specifically tailored for the domain of medical imaging. The focus is on chest X-ray (CXR) images, which require sophisticated integration of visual data and textual information for effective analysis and generation. The authors aim to significantly improve image-text alignment, enabling LLMs to excel in both understanding and generating CXR images and their associated diagnostic reports.

Core Contributions

  1. Instruction-Finetuning for Multimodal Capabilities: The paper introduces a method for instruction-finetuning an LLM, initially pretrained exclusively on text, to perform vision-language tasks without modifying its core architecture. This approach enables the model to handle CXR images and generate appropriate text and image responses based on diverse and complex instructions.
  2. VQ-GAN for Image Tokenization: Utilizing Vision-Quantized Generative Adversarial Networks (VQ-GAN), the approach tokenizes images into discrete representations. This allows the model to embed image tokens into the LLM's text-based token space, facilitating seamless multimodal interaction.
  3. Comprehensive Training through Synthetic VQA: By generating synthetic vision question-answer (VQA) pairs from existing CXR text reports, the authors enhance the training set, improving the model's ability to interact and align vision-language features effectively.
  4. Two-Stage Fine-Tuning: The model undergoes an initial broad training phase using a large dataset to establish basic image-text relations, followed by a targeted fine-tuning phase with a curated dataset to refine its performance on specific diagnostic tasks.

Experimental Results

The effectiveness of LLM-CXR is evaluated through several tasks, including CXR-to-report generation, CXR-VQA, and report-to-CXR generation. Key findings demonstrate:

  • Superior or comparable performance to existing models in generating accurate radiology reports from CXR images, as evidenced by AUROC and F1 scores.
  • Enhanced vision-language alignment, where the model consistently outperforms counterparts in image quality and relevance, measured by FID and class-specific diagnostic feature recognition.
  • Notable improvements over other models’ capabilities to conduct VQA tasks, highlighting the model's proficiency in interacting with complex image-derived questions.

Implications and Future Directions

LLM-CXR offers a promising framework for developing multimodal AI systems capable of sophisticated medical image reasoning. This research paves the way for:

  • Deployment of more integrated AI systems in healthcare settings, facilitating improved diagnostic accuracy and efficiency.
  • Further exploration into better aligning visual and textual modalities, particularly considering the subtleties inherent in medical imaging.
  • Future iterations of LLMs featuring larger training datasets or model architectures, potentially improving real-time application viability and performance.

Overall, LLM-CXR marks a significant stride in expanding the capabilities of LLMs into the domain of medical imagery, showcasing a methodology that yields both effective and efficient models with practical applications in healthcare diagnostics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Roentgen: vision-language foundation model for chest x-ray generation. arXiv preprint arXiv:2211.12737, 2022.
  5. TorchXRayVision: A library of chest X-ray datasets and models. In Medical Imaging with Deep Learning, 2022. URL https://github.com/mlmed/torchxrayvision.
  6. Databricks. Free dolly: Introducing the world’s first truly open instruction-tuned llm. https://github.com/databrickslabs/dolly, 2023.
  7. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12873–12883, 2021.
  8. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Thirty-Third AAAI Conference on Artificial Intelligence, 2019.
  9. Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023.
  10. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019a.
  11. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019b.
  12. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  13. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023a.
  14. Grounding language models to images for multimodal inputs and outputs, 2023b.
  15. Mark A Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE journal, 37(2):233–243, 1991.
  16. Unified chest x-ray and radiology report generation model with multi-view chest x-rays. arXiv preprint arXiv:2302.12172, 2023.
  17. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  18. Visual instruction tuning, 2023.
  19. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  20. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  21. Med-flamingo: a multimodal medical few-shot learner. arXiv preprint arXiv:2307.15189, 2023.
  22. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  23. OpenAI. Gpt-4 technical report, 2023.
  24. Improving language understanding by generative pre-training. 2018.
  25. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  26. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  27. Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971, 2023.
  28. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  29. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pp.  23318–23340. PMLR, 2022a.
  30. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022b.
  31. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  32. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
  33. Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arxiv, 2023a. URL https://arxiv.org/abs/2308.01317.
  34. Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317, 2023b.
  35. Ernie-vilg: Unified generative pre-training for bidirectional vision-language generation. arXiv preprint arXiv:2112.15283, 2021.
  36. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  37. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Suhyeon Lee (21 papers)
  2. Won Jun Kim (2 papers)
  3. Jinho Chang (11 papers)
  4. Jong Chul Ye (210 papers)
Citations (26)