Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models (2404.10237v3)

Published 16 Apr 2024 in cs.CV and cs.CL

Abstract: Recent advancements in general-purpose or domain-specific multimodal LLMs have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta expert. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  2. Vision–language model for visual question answering in medical imagery. Bioengineering.
  3. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
  4. Visual question answering: A survey on techniques and common trends in recent literature. arXiv preprint arXiv:2305.11033.
  5. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314.
  6. Pubmedclip: How much does clip benefit visual question answering in the medical domain? In Findings of the Association for Computational Linguistics: EACL 2023, pages 1151–1163.
  7. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39.
  8. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935.
  9. Towards transparent ai systems: Interpreting visual question answering models. arXiv preprint arXiv:1608.08974.
  10. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286.
  11. Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
  12. Diego de las casas, emma bou hanna, florian bressand, et al. 2024. mixtral of experts. arXiv preprint arXiv:2401.04088.
  13. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10.
  14. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
  15. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36.
  16. Self-supervised vision-language pretraining for medical visual question answering. arXiv preprint arXiv:2211.13594.
  17. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947.
  18. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE.
  19. Visual instruction tuning. Advances in neural information processing systems, 36.
  20. Q2atransformer: Improving medical vqa via an answer querying decoder. arXiv preprint arXiv:2304.01611.
  21. Research on visual question answering based on dynamic memory network model of multiple attention mechanisms. Scientific Reports, 12(1):16758.
  22. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR.
  23. OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt/.
  24. OpenAI. 2023. GPT-4 technical report. https://arxiv.org/abs/2303.08774. Preprint, arXiv:2303.08774.
  25. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
  26. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
  27. Open-ended medical visual question answering through prefix tuning of language models. arXiv preprint arXiv:2303.05977.
  28. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
  29. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070.
  30. Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub