Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model (2405.01591v1)

Published 29 Apr 2024 in cs.CL, cs.AI, and eess.IV

Abstract: Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises questions about the feasibility of these models when encountering with the inevitable variations and errors inherent in real-world medical data. In this paper, we introduce MID-M, a novel framework that leverages the in-context learning capabilities of a general-domain LLM to process multimodal data via image descriptions. MID-M achieves a comparable or superior performance to task-specific fine-tuned LMMs and other general-domain ones, without the extensive domain-specific training or pre-training on multimodal data, with significantly fewer parameters. This highlights the potential of leveraging general-domain LLMs for domain-specific tasks and offers a sustainable and cost-effective alternative to traditional LMM developments. Moreover, the robustness of MID-M against data quality issues demonstrates its practical utility in real-world medical domain applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Large language models are few-shot clinical information extractors. arXiv preprint arXiv:2205.12689, 2022.
  3. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  4. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  5. Evaluating re-identification risks with respect to the hipaa privacy rule. Journal of the American Medical Informatics Association, 17(2):169–177, 2010.
  6. Towards language models that can see: Computer vision through the lens of natural language. arXiv preprint arXiv:2306.16410, 2023.
  7. Adrian P Brady. Error and discrepancy in radiology: inevitable or avoidable? Insights into Imaging, 8(1):171–182, 2017. 10.1007/s13244-016-0534-1.
  8. Few-shot medical image classification with simple shape and texture text descriptors using vision-language models. arXiv preprint arXiv:2308.04005, 2023.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
  10. Category-wise fine-tuning for image multi-label classification with partial labels. In International Conference on Neural Information Processing, pages 332–345. Springer, 2023.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  12. Improving the factual correctness of radiology report generation with semantic rewards. arXiv preprint arXiv:2210.12186, 2022.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  15. Transcription error rates in retrospective chart reviews. Orthopedics, 43(5):e404–e408, 2020.
  16. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  17. From images to textual prompts: Zero-shot vqa with frozen large language models. arXiv preprint arXiv:2212.10846, 2022.
  18. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019.
  19. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  20. Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699, 2022.
  21. Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463, 2021.
  22. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
  23. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. arXiv preprint arXiv:2305.14045, 2023.
  24. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024.
  25. Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art. In Anna Rumshisky, Kirk Roberts, Steven Bethard, and Tristan Naumann, editors, Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 146–157, Online, November 2020. Association for Computational Linguistics. 10.18653/v1/2020.clinicalnlp-1.17. URL https://aclanthology.org/2020.clinicalnlp-1.17.
  26. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
  27. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024.
  28. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023b.
  29. Can large language models reason about medical questions? Patterns, 2023.
  30. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  31. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  32. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  33. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2022.
  34. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR, 2023.
  35. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452, 2023.
  36. Variation between experienced observers in the interpretation of accident and emergency radiographs. British Journal of Radiology, 72(856):323–330, Apr 1999. 10.1259/bjr.72.856.10474490.
  37. Chexbert: combining automatic labelers and expert annotations for accurate radiology report labeling using bert. arXiv preprint arXiv:2004.09167, 2020.
  38. Pathasst: Redefining pathology through generative foundation ai assistant for pathology. arXiv preprint arXiv:2305.15072, 2023.
  39. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  41. Towards generalist biomedical ai. NEJM AI, 1(3):AIoa2300138, 2024.
  42. Radadapt: Radiology report summarization via lightweight domain adaptation of large language models. arXiv preprint arXiv:2305.01146, 2023a.
  43. Clinical text summarization: Adapting large language models can outperform human experts. Research Square, 2023b.
  44. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  45. How well do health professionals interpret diagnostic information? a systematic review. BMJ Open, 5(7), 2015. ISSN 2044-6055. 10.1136/bmjopen-2015-008155. URL https://bmjopen.bmj.com/content/5/7/e008155.
  46. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data. arxiv [preprint]. 2023 [accessed on october 2, 2023].
  47. Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463, 2023.
  48. Visual clues: Bridging vision and language foundations for image paragraph captioning. Advances in Neural Information Processing Systems, 35:17287–17300, 2022.
  49. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. arXiv preprint arXiv:2403.11703, 2024.
  50. Robust and interpretable medical image classifiers via concept bottleneck models. arXiv preprint arXiv:2310.03182, 2023.
  51. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022.
  52. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks. arXiv preprint arXiv:2305.17100, 2023a.
  53. Optimizing the factual correctness of a summary: A study of summarizing radiology reports. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5108–5120, Online, July 2020. Association for Computational Linguistics. 10.18653/v1/2020.acl-main.458. URL https://aclanthology.org/2020.acl-main.458.
  54. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Seonhee Cho (5 papers)
  2. Choonghan Kim (4 papers)
  3. Jiho Lee (17 papers)
  4. Chetan Chilkunda (1 paper)
  5. Sujin Choi (1 paper)
  6. Joo Heung Yoon (4 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets