Extent of LVLMs’ capability to meet diverse clinical demands

Determine the extent to which large vision-language models (including general-purpose models such as DeepSeek-VL, GPT-4V/GPT-4o, Claude3-Opus, Gemini, and Qwen-VL, as well as medical-specific models such as MedDr, LLaVA-Med, Med-Flamingo, RadFM, and Qilin-Med-VL-Chat) can accommodate the diverse demands encountered in real-world clinical scenarios across modalities, tasks, departments, and perceptual granularities.

Background

Large Vision-LLMs (LVLMs) have shown promising performance on selected medical vision–language tasks, and both general-purpose and medical-specific systems are being actively evaluated and deployed. However, clinical environments present highly diverse demands across institutions, departments, imaging modalities, and task types, creating uncertainty about how broadly these models can satisfy real-world needs.

The paper introduces GMAI-MMBench to provide a comprehensive evaluation across 39 modalities, 18 clinical VQA tasks, 18 departments, and multiple perceptual granularities. This benchmark is motivated by the need to clarify the true breadth of LVLM capabilities with respect to clinical demands and to quantify how well current systems generalize to practical clinical settings.

References

However, it remains unclear to what extent these LVLMs can accommodate the diverse demands in real clinical scenarios.

— GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI (2408.03361 - Chen et al., 6 Aug 2024) in Abstract; Introduction

Extent of LVLMs’ capability to meet diverse clinical demands

Background

References

Related Problems