GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
The paper presents GMAI-MMBench, an extensive benchmark designed to evaluate Large Vision-LLMs (LVLMs) in the medical domain. The paper addresses critical limitations in existing benchmarks, offering a comprehensive, well-categorized, and multi-perceptual evaluation suite tailored for real-world clinical scenarios. This benchmark is distinguished by its diverse dataset, intricate categorization, and varying perceptual details.
Core Contributions
The paper outlines several key contributions:
- Comprehensive Database: GMAI-MMBench is constructed from 285 high-quality datasets, spanning 39 medical imaging modalities. This diversity ensures broad coverage of medical knowledge from various sources across the globe, thereby minimizing data leakage risk and emphasizing clinical relevance.
- Well-Categorized Lexical Tree: A novel categorization system organizes the dataset into 18 clinical Visual Question Answering (VQA) tasks across 18 departments and 4 levels of perceptual granularity. This enables tailored evaluations for specific clinical demands, enhancing the benchmark's usability and specificity.
- Multi-Perceptual Granularity: The benchmark assesses LVLMs' abilities at different granularity levels—image, box, mask, and contour—reflecting the varied perceptual requirements of clinical practice.
Evaluation and Findings
The authors evaluated 50 LVLMs, including state-of-the-art proprietary models like GPT-4o and open-source models like MedDr. The results reveal significant room for improvement, even for the best-performing models. GPT-4o's best accuracy stands at a mere 52%, indicating the challenge's rigor and the current models' limitations in addressing medical tasks comprehensively.
Several key findings are highlighted:
- Performance Disparities: There are noticeable performance disparities across different clinical tasks and departments. For instance, models perform better in Disease Diagnosis (DD) but struggle with tasks requiring complex reasoning, such as Severity Grading (SG).
- Inadequate Instruction Tuning in Medical-Specific Models: MedDr stands out among medical-specific models, surpassing even some proprietary models. This suggests that a well-constructed medical instruction tuning dataset can significantly enhance model performance.
- Challenges with Multi-Perceptual Granularity: Models consistently struggle with tasks requiring bounding box-level perception, indicating a need for improved robustness across different perceptual types.
- Failure Types: Common issues include question misunderstanding, perceptual errors, knowledge gaps, and refusal to answer due to safety protocols. Proprietary models often decline to answer potentially risky queries, adhering to strict safety guidelines.
Implications and Future Directions
The implications of this benchmark are profound. GMAI-MMBench provides a rigorous, diversified, and clinically relevant framework for evaluating LVLMs in medical applications. This benchmark can guide the refinement of existing models, highlight areas in need of development, and ultimately propel advancements in general medical AI (GMAI).
Future Research:
- Enhanced Instruction Tuning: Future models can benefit from improved medical-specific instruction tuning, as evidenced by MedDr's performance.
- Balancing Across Departments and Tasks: Efforts should aim to balance LVLM capabilities across all clinical departments and tasks to develop truly general-purpose medical AI.
- Perceptual Robustness: Enhancing models' ability to handle different perceptual granularities, particularly at the bounding box level, is critical for interactive and precise medical applications.
- Human Evaluation Integration: Integrating human evaluations will provide a more grounded benchmark, aligning model capabilities closer to expert performance.
Conclusion
GMAI-MMBench marks a significant advancement towards developing robust, reliable general medical AI. By addressing the gaps in current benchmarks, it offers a comprehensive evaluation tool that can drive the next generation of LVLMs, ensuring they meet the diverse and complex demands of real-world clinical practice. While current models show promise, extensive improvements are necessary before LVLMs can fully meet clinical needs. The benchmark's future integration of human evaluations will further aid in this endeavor, providing crucial insights for enhancing model performance in the medical domain.