PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging (2401.02797v2)
Abstract: Multimodal LLMs (MLLMs) represent an evolutionary expansion in the capabilities of traditional LLMs, enabling them to tackle challenges that surpass the scope of purely text-based applications. It leverages the knowledge previously encoded within these LLMs, thereby enhancing their applicability and functionality in the reign of multimodal contexts. Recent works investigate the adaptation of MLLMs as a universal solution to address medical multi-modal problems as a generative task. In this paper, we propose a parameter efficient framework for fine-tuning MLLMs, specifically validated on medical visual question answering (Med-VQA) and medical report generation (MRG) tasks, using public benchmark datasets. We also introduce an evaluation metric using the 5-point Likert scale and its weighted average value to measure the quality of the generated reports for MRG tasks, where the scale ratings are labelled by both humans manually and the GPT-4 model. We further assess the consistency of performance metrics across traditional measures, GPT-4, and human ratings for both VQA and MRG tasks. The results indicate that semantic similarity assessments using GPT-4 align closely with human annotators and provide greater stability, yet they reveal a discrepancy when compared to conventional lexical similarity measurements. This questions the reliability of lexical similarity metrics for evaluating the performance of generative models in Med-VQA and report generation tasks. Besides, our fine-tuned model significantly outperforms GPT-4v. This indicates that without additional fine-tuning, multi-modal models like GPT-4v do not perform effectively on medical imaging tasks. The code will be available here: https://github.com/jinlHe/PeFoMed.
- Z. Lin, et al., “Medical visual question answering: A survey,” Artificial Intelligence in Medicine, p. 102611, 2023.
- T. Do, et al., “Multiple meta-model quantifying for medical visual question answering,” in MICCAI, 2021, pp. 64–74.
- J. Liu, et al., “Parameter-efficient transfer learning for medical visual question answering,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2023.
- F. Cong, et al., “Caption-aware medical vqa via semantic focusing and progressive cross-modality comprehension,” in ACM MM, 2022, pp. 3569–3577.
- Z. Chen, et al., “Multi-modal masked autoencoders for medical vision-and-language pre-training,” in MICCAI, 2022, pp. 679–689.
- P. Li, et al., “Self-supervised vision-language pretraining for medial visual question answering,” in ISBI, 2023, pp. 1–5.
- ——, “Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering,” in MICCAI. Springer, 2023, pp. 374–383.
- L. Ouyang, et al., “Training language models to follow instructions with human feedback,” Advances in NeurIPS, vol. 35, pp. 27 730–27 744, 2022.
- H. Touvron, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- OpenAI, “Introducing chatgpt,” https://openai.com/blog/chatgpt, 2022.
- H. Touvron, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- J. Li, et al., “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
- D. Zhu, et al., “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
- J. Chen, et al., “Minigpt-v2: large language model as a unified interface for vision-language multi-task learning,” arXiv preprint arXiv:2310.09478, 2023.
- S. Liu, et al., “Llava-plus: Learning to use tools for creating multimodal agents,” arXiv preprint arXiv:2311.05437, 2023.
- OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- W. X. Zhao, et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
- Y. F. et al., “EVA: exploring the limits of masked visual representation learning at scale,” in CVPR. IEEE, 2023, pp. 19 358–19 369.
- O. Pelka, et al., “Radiology objects in context (roco): a multimodal image dataset,” in MICCAI. Springer, 2018, pp. 180–189.
- J. J. Lau, et al., “A dataset of clinically generated visual questions and answers about radiology images,” Scientific data, vol. 5, no. 1, pp. 1–10, 2018.
- I. L. et al., “Decoupled weight decay regularization,” in ICLR, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7
- E. J. H. et al., “Lora: Low-rank adaptation of large language models,” in ICLR, 2022.
- B. Tom B, et al., “Language models are few-shot learners,” in NeurIPS, 2020.
- A. C. et al., “Palm: Scaling language modeling with pathways,” J. Mach. Learn. Res., vol. 24, pp. 240:1–240:113, 2023.
- Y. L. et al., “Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge,” CoRR, vol. abs/2303.14070, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.14070
- H. X. et al., “Doctorglm: Fine-tuning your chinese doctor is not a herculean task,” CoRR, vol. abs/2304.01097, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.01097
- H. W. et al., “Huatuo: Tuning llama model with chinese medical knowledge,” CoRR, vol. abs/2304.06975, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.06975
- A. T. et al., “Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding,” CoRR, vol. abs/2305.12031, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.12031
- C. W. et al., “Pmc-llama: Further finetuning llama on medical papers,” CoRR, vol. abs/2304.14454, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.14454
- S. M. et al., “Anymal: An efficient and scalable any-modality augmented language model,” CoRR, vol. abs/2309.16058, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.16058
- W. D. et al., “Instructblip: Towards general-purpose vision-language models with instruction tuning,” CoRR, vol. abs/2305.06500, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.06500
- C. S. et al., “Visual med-alpaca: A parameter-efficient biomedical llm with visual capabilities,” 2023.
- C. L. et al., “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” CoRR, vol. abs/2306.00890, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.00890
- M. et al., “Med-flamingo: A multimodal medical few-shot learner,” July 2023, arXiv:2307.15189. [Online]. Available: https://arxiv.org/abs/2307.15189
- S. X. et al., “Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders,” 2023.
- Z. C. et al., “Align, reason and learn: Enhancing medical vision-and-language pre-training with knowledge,” in MM. ACM, 2022, pp. 5152–5161. [Online]. Available: https://doi.org/10.1145/3503161.3547948
- J. D. et al., “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT. ACL, 2019, pp. 4171–4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423
- J. A. et al., “Flamingo: a visual language model for few-shot learning,” in NeurIPS, 2022.
- S. L. et al., “Llm-cxr: Instruction-finetuned llm for cxr image understanding and generation,” 2023.
- P. E. et al., “Taming transformers for high-resolution image synthesis,” in CVPR 2021. Computer Vision Foundation / IEEE, 2021, pp. 12 873–12 883.
- Q. Li, “Harnessing the power of pre-trained vision-language models for efficient medical report generation,” in CIKM 2023. ACM, 2023, pp. 1308–1317. [Online]. Available: https://doi.org/10.1145/3583780.3614961
- Y. L. et al., “A comprehensive study of gpt-4v’s multimodal capabilities in medical imaging,” CoRR, vol. abs/2310.20381, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.20381
- Jinlong He (7 papers)
- Pengfei Li (185 papers)
- Gang Liu (177 papers)
- Shenjun Zhong (7 papers)
- Genrong He (1 paper)
- Zhaolin Chen (24 papers)