GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI (2408.03361v7)

Published 6 Aug 2024 in eess.IV and cs.CV

Abstract: Large Vision-LLMs (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Current benchmarks are often built upon specific academic literature, mainly focusing on a single domain, and lacking varying perceptual granularities. Thus, they face specific challenges, including limited clinical relevance, incomplete evaluations, and insufficient guidance for interactive LVLMs. To address these limitations, we developed the GMAI-MMBench, the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format. Additionally, we implemented a lexical tree structure that allows users to customize evaluation tasks, accommodating various assessment needs and substantially supporting medical AI research and applications. We evaluated 50 LVLMs, and the results show that even the advanced GPT-4o only achieves an accuracy of 53.96%, indicating significant room for improvement. Moreover, we identified five key insufficiencies in current cutting-edge LVLMs that need to be addressed to advance the development of better medical applications. We believe that GMAI-MMBench will stimulate the community to build the next generation of LVLMs toward GMAI.

PDF HTML Abstract

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

The paper presents GMAI-MMBench, an extensive benchmark designed to evaluate Large Vision-LLMs (LVLMs) in the medical domain. The paper addresses critical limitations in existing benchmarks, offering a comprehensive, well-categorized, and multi-perceptual evaluation suite tailored for real-world clinical scenarios. This benchmark is distinguished by its diverse dataset, intricate categorization, and varying perceptual details.

Core Contributions

The paper outlines several key contributions:

Comprehensive Database: GMAI-MMBench is constructed from 285 high-quality datasets, spanning 39 medical imaging modalities. This diversity ensures broad coverage of medical knowledge from various sources across the globe, thereby minimizing data leakage risk and emphasizing clinical relevance.
Well-Categorized Lexical Tree: A novel categorization system organizes the dataset into 18 clinical Visual Question Answering (VQA) tasks across 18 departments and 4 levels of perceptual granularity. This enables tailored evaluations for specific clinical demands, enhancing the benchmark's usability and specificity.
Multi-Perceptual Granularity: The benchmark assesses LVLMs' abilities at different granularity levels—image, box, mask, and contour—reflecting the varied perceptual requirements of clinical practice.

Evaluation and Findings

The authors evaluated 50 LVLMs, including state-of-the-art proprietary models like GPT-4o and open-source models like MedDr. The results reveal significant room for improvement, even for the best-performing models. GPT-4o's best accuracy stands at a mere 52%, indicating the challenge's rigor and the current models' limitations in addressing medical tasks comprehensively.

Several key findings are highlighted:

Performance Disparities: There are noticeable performance disparities across different clinical tasks and departments. For instance, models perform better in Disease Diagnosis (DD) but struggle with tasks requiring complex reasoning, such as Severity Grading (SG).
Inadequate Instruction Tuning in Medical-Specific Models: MedDr stands out among medical-specific models, surpassing even some proprietary models. This suggests that a well-constructed medical instruction tuning dataset can significantly enhance model performance.
Challenges with Multi-Perceptual Granularity: Models consistently struggle with tasks requiring bounding box-level perception, indicating a need for improved robustness across different perceptual types.
Failure Types: Common issues include question misunderstanding, perceptual errors, knowledge gaps, and refusal to answer due to safety protocols. Proprietary models often decline to answer potentially risky queries, adhering to strict safety guidelines.

Implications and Future Directions

The implications of this benchmark are profound. GMAI-MMBench provides a rigorous, diversified, and clinically relevant framework for evaluating LVLMs in medical applications. This benchmark can guide the refinement of existing models, highlight areas in need of development, and ultimately propel advancements in general medical AI (GMAI).

Future Research:

Enhanced Instruction Tuning: Future models can benefit from improved medical-specific instruction tuning, as evidenced by MedDr's performance.
Balancing Across Departments and Tasks: Efforts should aim to balance LVLM capabilities across all clinical departments and tasks to develop truly general-purpose medical AI.
Perceptual Robustness: Enhancing models' ability to handle different perceptual granularities, particularly at the bounding box level, is critical for interactive and precise medical applications.
Human Evaluation Integration: Integrating human evaluations will provide a more grounded benchmark, aligning model capabilities closer to expert performance.

Conclusion

GMAI-MMBench marks a significant advancement towards developing robust, reliable general medical AI. By addressing the gaps in current benchmarks, it offers a comprehensive evaluation tool that can drive the next generation of LVLMs, ensuring they meet the diverse and complex demands of real-world clinical practice. While current models show promise, extensive improvements are necessary before LVLMs can fully meet clinical needs. The benchmark's future integration of human evaluations will further aid in this endeavor, providing crucial insights for enhancing model performance in the medical domain.

PDF Markdown Bookmark Chat (Pro)

Authors (18)

Pengcheng Chen (22 papers)
Jin Ye (38 papers)
Guoan Wang (13 papers)
Yanjun Li (56 papers)
Zhongying Deng (25 papers)
Wei Li (1121 papers)
Tianbin Li (20 papers)
Haodong Duan (55 papers)
Ziyan Huang (18 papers)
Yanzhou Su (26 papers)
Benyou Wang (109 papers)
Shaoting Zhang (133 papers)
Bin Fu (74 papers)
Jianfei Cai (163 papers)
Bohan Zhuang (79 papers)
Junjun He (77 papers)
Yu Qiao (563 papers)
Eric J Seibel (1 paper)

Citations (14)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/OpenlifesciAI/status/1823074124882985025

https://twitter.com/AdeenaY8/status/1829558686395560385

https://twitter.com/KennyUTC/status/1825465645687017952

https://twitter.com/KyeGomezB/status/1821904352480563333

https://twitter.com/arXivGPT/status/1822339338732020149

https://twitter.com/susumuota/status/1822423912967192916