Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models (2404.10237v3)
Abstract: Recent advancements in general-purpose or domain-specific multimodal LLMs have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta expert. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
- Vision–language model for visual question answering in medical imagery. Bioengineering.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
- Visual question answering: A survey on techniques and common trends in recent literature. arXiv preprint arXiv:2305.11033.
- Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314.
- Pubmedclip: How much does clip benefit visual question answering in the medical domain? In Findings of the Association for Computational Linguistics: EACL 2023, pages 1151–1163.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39.
- Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935.
- Towards transparent ai systems: Interpreting visual question answering models. arXiv preprint arXiv:1608.08974.
- Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
- Diego de las casas, emma bou hanna, florian bressand, et al. 2024. mixtral of experts. arXiv preprint arXiv:2401.04088.
- A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10.
- Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36.
- Self-supervised vision-language pretraining for medical visual question answering. arXiv preprint arXiv:2211.13594.
- Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947.
- Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE.
- Visual instruction tuning. Advances in neural information processing systems, 36.
- Q2atransformer: Improving medical vqa via an answer querying decoder. arXiv preprint arXiv:2304.01611.
- Research on visual question answering based on dynamic memory network model of multiple attention mechanisms. Scientific Reports, 12(1):16758.
- Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR.
- OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt/.
- OpenAI. 2023. GPT-4 technical report. https://arxiv.org/abs/2303.08774. Preprint, arXiv:2303.08774.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
- Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
- Open-ended medical visual question answering through prefix tuning of language models. arXiv preprint arXiv:2303.05977.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
- Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070.
- Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915.