Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models (2402.11217v2)

Published 17 Feb 2024 in cs.CL and cs.CV

Abstract: The significant breakthroughs of Medical Multi-Modal LLMs (Med-MLLMs) renovate modern healthcare with robust information synthesis and medical decision support. However, these models are often evaluated on benchmarks that are unsuitable for the Med-MLLMs due to the complexity of real-world diagnostics across diverse specialties. To address this gap, we introduce Asclepius, a novel Med-MLLM benchmark that comprehensively assesses Med-MLLMs in terms of: distinct medical specialties (cardiovascular, gastroenterology, etc.) and different diagnostic capacities (perception, disease analysis, etc.). Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties, stratifying into 3 main categories and 8 sub-categories of clinical tasks, and exempting overlap with existing VQA dataset. We further provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists, providing insights into their competencies and limitations in various medical contexts. Our work not only advances the understanding of Med-MLLMs' capabilities but also sets a precedent for future evaluations and the safe deployment of these models in clinical environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Kim E Barrett. 2010. Ganong’s review of medical physiology.
  2. Multimodal llms for health grounded in individual-specific data. In Workshop on Machine Learning for Multimodal Healthcare Data, pages 86–102. Springer.
  3. Chexagent: Towards a foundation model for chest x-ray interpretation. arXiv preprint arXiv:2401.12208.
  4. Dave A Davis. 2009. How to help professionals maintain and improve their knowledge and skills: Triangulating best practices in medicine. Development of professional expertise: Toward measurement of expert performance and design of optimal learning environments, pages 180–202.
  5. Benchmark probing: Investigating data leakage in large language models. In NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly.
  6. Coordination of specialty referrals and physician satisfaction with referral care. Archives of pediatrics & adolescent medicine, 154(5):499–506.
  7. Christopher B Forrest and Robert J Reid. 2001. Prevalence of health problems and primary care physicians’ specialty referral decisions. Journal of family practice, 50(5):427–427.
  8. Hypermobile ehlers-danlos syndromes: Complex phenotypes, challenging diagnoses, and poorly understood causes. Developmental Dynamics, 250(3):318–344.
  9. Xuehai He. 2021. Towards visual question answering on pathology images. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, volume 2.
  10. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proceedings of the AAAI conference on artificial intelligence.
  11. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317.
  12. Jerome P Kassirer and G Anthony Gorry. 1978. Clinical problem solving: a behavioral analysis. Annals of Internal Medicine, 89(2):245–255.
  13. Basic & clinical pharmacology.
  14. Robbins and Cotran pathologic basis of disease, professional edition e-book. Elsevier health sciences.
  15. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10.
  16. How medical education pathways influence primary care specialty choice. Family medicine, 54(7):512–521.
  17. Cxr-llava: Multimodal large language model for interpreting chest x-ray images. arXiv preprint arXiv:2310.18341.
  18. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890.
  19. A comprehensive study of gpt-4v’s multimodal capabilities in medical imaging. medRxiv, pages 2023–11.
  20. Matthew J Liberatore and Robert L Nydick. 2008. The analytic hierarchy process in medical and health care decision making: A literature review. European Journal of Operational Research, 189(1):194–207.
  21. Pmc-clip: Contrastive language-image pre-training using biomedical documents. arXiv preprint arXiv:2303.07240.
  22. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE.
  23. A medical multimodal large language model for future pandemics. NPJ Digital Medicine, 6(1):226.
  24. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  25. A foundational multimodal vision language ai assistant for human pathology. arXiv preprint arXiv:2312.07814.
  26. Gary H Lyman and Nicole M Kuderer. 2023. Perception, cognition and thought: Part iii: Reasoning, judgement and decision-making.
  27. Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165.
  28. Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265.
  29. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR.
  30. Med-flamingo: A multimodal medical few-shot learner. ArXiv:2307.15189.
  31. Medical microbiology. Elsevier Health Sciences.
  32. OpenAI. 2023. Gpt-4v(ision) system card. OpenAI.
  33. Emerging paradigms of cognition in medical decision-making. Journal of biomedical informatics, 35(1):52–75.
  34. Wojciech Pawlina and Michael H Ross. 2018. Histology: a text and atlas: with correlated cell and molecular biology. Lippincott Williams & Wilkins.
  35. Radiology objects in context (roco): a multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3, pages 180–189. Springer.
  36. Edward A Sickles and Carl J D’Orsi. 2016. Acr bi-rads®. ACR BI-RADS®-Atlas der Mammadiagnostik: Richtlinien zu Befundung, Handlungsempfehlungen und Monitoring, page 474.
  37. Richard S Snell. 2010. Clinical neuroanatomy. Lippincott Williams & Wilkins.
  38. Medicat: A dataset of medical images, captions, and textual references. arXiv preprint arXiv:2010.06000.
  39. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  40. Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971.
  41. Medfmc: A real-world dataset and benchmark for foundation model adaptation in medical image classification. arXiv preprint arXiv:2306.09579.
  42. Can gpt-4v (ision) serve medical applications? case studies on gpt-4v for multimodal medical diagnosis. arXiv preprint arXiv:2310.09909.
  43. Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463.
  44. Towards generalist foundation model for radiology.
  45. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 10(1):41.
  46. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks. arXiv preprint arXiv:2305.17100.
  47. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. arXiv preprint arXiv:2306.05179.
  48. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415.
  49. Skingpt-4: an interactive dermatology diagnostic system with visual large language model. arXiv preprint arXiv:2306.00890.
  50. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Wenxuan Wang (128 papers)
  2. Yihang Su (3 papers)
  3. Jingyuan Huan (1 paper)
  4. Jie Liu (492 papers)
  5. Wenting Chen (26 papers)
  6. Yudi Zhang (19 papers)
  7. Cheng-Yi Li (3 papers)
  8. Kao-Jung Chang (3 papers)
  9. Xiaohan Xin (1 paper)
  10. Linlin Shen (133 papers)
  11. Michael R. Lyu (176 papers)
Citations (7)