Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation (2312.02233v2)

Published 4 Dec 2023 in cs.CV

Abstract: Multimodal LLMs (MLLMs) have shown success in various general image processing tasks, yet their application in medical imaging is nascent, lacking tailored models. This study investigates the potential of MLLMs in improving the understanding and generation of Chest X-Rays (CXRs). We introduce MedXChat, a unified framework facilitating seamless interactions between medical assistants and users for diverse CXR tasks, including text report generation, visual question-answering (VQA), and Text-to-CXR generation. Our MLLMs using natural language as the input breaks task boundaries, maximally simplifying medical professional training by allowing diverse tasks within a single environment. For CXR understanding, we leverage powerful off-the-shelf visual encoders (e.g., ViT) and LLMs (e.g., mPLUG-Owl) to convert medical imagery into language-like features, and subsequently fine-tune our large pre-trained models for medical applications using a visual adapter network and a delta-tuning approach. For CXR generation, we introduce an innovative synthesis approach that utilizes instruction-following capabilities within the Stable Diffusion (SD) architecture. This technique integrates smoothly with the existing model framework, requiring no extra parameters, thereby maintaining the SD's generative strength while also bestowing upon it the capacity to render fine-grained medical images with high fidelity. Through comprehensive experiments, our model demonstrates exceptional cross-task adaptability, displaying adeptness across all three defined tasks. Our MedXChat model and the instruction dataset utilized in this research will be made publicly available to encourage further exploration in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. A sequence-to-sequence model approach for imageclef 2018 medical domain visual question answering. In INDICON, 2018.
  3. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL workshop, 2005.
  4. Melanogans: High resolution skin lesion synthesis with gans. In arXiv preprint arXiv:1804.04338, 1804.
  5. Language models are few-shot learners. In NeurIPS, 2020.
  6. Adapting pretrained vision-language foundational models to medical imaging domains. In NeurIPS, 2022.
  7. Generating radiology reports via memory-driven transformer. In ACL, 2020.
  8. Cross-modal memory networks for radiology report generation. In ACL, 2022.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. 2023.
  10. How to fool radiologists with generative adversarial networks? a visual turing test for lung cancer diagnosis. In ISBI, 2018.
  11. Meshed-memory transformer for image captioning. In CVPR, 2020.
  12. Multiple meta-model quantifying for medical visual question answering. In MICCAI, 2021.
  13. Does clip benefit visual question answering in the medical domain as much as it does in the general domain? In arXiv preprint arXiv:2112.13906, 2021.
  14. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  15. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  16. Synthetic data augmentation using gan for improved liver lesion classification. In ISBI, 2018.
  17. Making llama see and draw with seed tokenizer. In arXiv preprint arXiv:2310.01218, 2023.
  18. Generative adversarial networks. In ACM, 2020.
  19. Deep residual learning for image recognition. In CVPR, 2015.
  20. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  21. Diversity-preserving chest radiographs generation from reports in one stage. In MICCAI, 2023.
  22. Lora: Low-rank adaptation of large language models. In ICLR, 2021.
  23. Promptmrg: Diagnosis-driven prompts for medical report generation. In arXiv preprint arXiv:2308.12604, 2023.
  24. On the automatic generation of medical imaging reports. In ACL, 2017.
  25. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. In arXiv preprint arXiv:1901.07042, 2019.
  26. Mmbert: Multimodal bert pretraining for improved medical vqa. In ISBI, 2021.
  27. Auto-encoding variational bayes. In ICLR, 2013.
  28. Generating images with multimodal language models. In NeurIPS, 2023.
  29. Unified chest x-ray and radiology report generation model with multi-view chest x-rays. In arXiv preprint arXiv:2302.12172, 2023a.
  30. Llm-cxr: Instruction-finetuned llm for cxr image understanding and generation. In arXiv preprint arXiv:2305.11490, 2023b.
  31. Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In AAAI, 2019.
  32. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 2004.
  33. Exploring and distilling posterior and prior knowledge for radiology report generation. In CVPR, 2021.
  34. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR, 2017.
  35. Overcoming data limitation in medical visual question answering. In MICCAI, 2019.
  36. OpenAI. Gpt-4 technical report. 2023.
  37. Generation of anonymous chest radiographs using latent diffusion models for training thoracic abnormality classification systems. In ISBI, 2023.
  38. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
  39. Instruction tuning with gpt-4. In arXiv preprint arXiv:2304.03277, 2023.
  40. Improving language understanding by generative pre-training. OpenAI, 2018.
  41. Learning transferable visual models from natural language supervision. In ICML, 2021.
  42. Self-critical sequence training for image captioning. In CVPR, 2017.
  43. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  44. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  45. Interactive and explainable region-guided radiology report generation. In CVPR, 2023.
  46. Stanford alpaca: an instruction-following llama model (2023). 2023.
  47. Xraygpt: Chest radiographs summarization using medical vision-language models. In arXiv preprint arXiv:2306.07971, 2023.
  48. Llama: Open and efficient foundation language models. In arXiv preprint arXiv:2302.13971, 2023.
  49. Attention is all you need. In NeurIPS, 2017.
  50. Cider: Consensus-based image description evaluation. In CVPR, 2015.
  51. Show and tell: A neural image caption generator. In CVPR, 2015.
  52. A self-boosting framework for automated radiographic report generation. In CVPR, 2021.
  53. Automated radiographic report generation purely on transformer: A multicriteria supervised approach. In IEEE Transactions on Medical Imaging, 2022a.
  54. A medical semantic-assisted transformer for radiographic report generation. In MICCAI, 2022b.
  55. Medclip: Contrastive learning from unpaired medical images and text. In ACL, 2022c.
  56. Diffusion models for medical anomaly detection. In MICCAI, 2022.
  57. Next-gpt: Any-to-any multimodal llm. In arXiv preprint arXiv:2309.05519, 2023.
  58. Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. In arXiv preprint arXiv:2308.01317, 2023.
  59. Knowledge matters: Chest radiology report generation with general and specific knowledge. In Medical image analysis, 2022.
  60. Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network. In ICDM, 2019.
  61. Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report generation. In MICCAI, 2021.
  62. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  63. When radiology report generation meets knowledge graph. In AAAI, 2020.
  64. Chatcad+: Towards a universal and reliable interactive cad using llms. In arXiv preprint arXiv:2305.15964, 2023.
  65. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ling Yang (88 papers)
  2. Zhanyu Wang (22 papers)
  3. Luping Zhou (72 papers)
  4. Zhenghao Chen (30 papers)
  5. Xinyu Liang (11 papers)
Citations (5)