MedQ-Bench: Medical Image Quality Benchmark
- MedQ-Bench is a comprehensive benchmark that assesses medical image quality in multi-modal large language models using interpretable, language-driven perception-reasoning tasks.
- It spans five imaging modalities and over 40 quality attributes by integrating low-level perceptual queries with high-level diagnostic reasoning reflective of expert judgments.
- The framework employs a multi-dimensional evaluation protocol with rigorous human-AI alignment, offering actionable insights for enhancing clinical AI safety and reliability.
MedQ-Bench is a comprehensive benchmark for evaluating the medical image quality assessment (IQA) capabilities of multi-modal LLMs (MLLMs). Unlike conventional approaches that rely on scalar metrics such as PSNR, SSIM, or LPIPS, MedQ-Bench explicitly operationalizes an interpretable, language-driven perception–reasoning framework that mirrors expert human judgment in clinical settings. It spans five imaging modalities and over 40 specific quality attributes, offering both granular perceptual probes and higher-level reasoning queries. Model outputs are subjected to a multi-dimensional evaluation protocol with rigorous human-AI alignment validation, setting a new standard for assessing and optimizing MLLMs in the context of medical imaging safety and reliability (Liu et al., 2 Oct 2025).
1. Motivation and Scope
MedQ-Bench is motivated by the inadequacy of traditional scalar image quality assessment metrics for clinical practice. Scalar scores fail to capture the nuanced, multi-factorial reasoning process that medical experts employ when diagnosing quality issues—such as spatially localized blurring, subtle noise, artifacts, or clinically significant hallucinations in reconstructed or AI-generated images. The benchmark addresses this gap by introducing a perception–reasoning paradigm, in which MLLMs are not only queried about the presence and type of visual degradations but are also expected to provide structured, stepwise reasoning about their clinical impact. This approach aligns machine assessment with real-world clinical quality control procedures, where interpretability and @@@@1@@@@ are paramount for workflow safety.
2. Task Structure: Perception and Reasoning
MedQ-Bench defines two complementary tasks:
- MedQ-Perception:
- Probes low-level perceptual abilities through several types of human-curated queries regarding fundamental visual attributes.
- Question formats include:
- Yes-or-No (e.g., "Is this image clear?")
- What (e.g., "What type of degradation is present?")
- How (e.g., "How severe is the motion blur?")
- Tasks distinguish between binary, multi-class attribute recognition and severity grading, and between generic versus modality-specific attributes (e.g., MRI susceptibility vs. CT metal artifact).
- MedQ-Reasoning:
- Encompasses both no-reference and comparison reasoning:
- No-Reference Reasoning: The model must independently identify the modality/anatomical location, characterize degradation(s), infer likely technical causes, and reach a quality verdict ("good," "usable," or "reject") via a natural language chain-of-thought.
- Comparison Reasoning: The model compares two images (original vs. reconstructed or between algorithmic outputs) and justifies which is better and why, with attention to both coarse-grained (obvious) and fine-grained (subtle) quality differences.
- Encompasses both no-reference and comparison reasoning:
Perception queries total 2,600 across 40+ quality attributes, and there are 708 distinct reasoning assessment items, ensuring substantial breadth and depth in modality, degradation scenarios, and clinical context.
3. Benchmark Dataset Composition
MedQ-Bench samples are drawn from three primary image sources, ensuring diverse and realistic evaluation coverage:
- Authentic clinical acquisitions: Real-world clinical images portraying naturally occurring artifacts and degradations.
- Simulated degradations: Images degraded via controlled, physics-based reconstructions and signal-processing transformations to introduce known, quantifiable impairments with ground truth.
- AI-generated images: Outputs from image enhancement, domain translation, and reconstruction algorithms, encompassing algorithmic artifacts (e.g., hallucinated structures, loss of detail).
The dataset spans five core imaging modalities:
- Magnetic Resonance Imaging (MRI)
- Computed Tomography (CT)
- Endoscopy
- Histopathology Imaging
- Fundus Photography
Coverage across over 40 quality attributes captures both generic and modality-specific degradations (such as noise, motion, metal/beam hardening, streak, aliasing, color casts, and AI-induced hallucination).
4. Multi-Dimensional Evaluation Protocol
MedQ-Bench employs a robust, multi-dimensional assessment protocol for evaluating natural language outputs:
- Completeness: Proportion of key visual cues K_O in output overlapping with gold key set K_R:
- Preciseness: Semantic alignment of the description with the observed deteriorations, absence of contradictions.
- Consistency: Logical coherence between step-wise reasoning and the announced quality verdict.
- Quality Accuracy: Exact match between model’s quality verdict and expert reference.
Each dimension is scored (0, 1, 2), and the sum provides an overall reasoning quality score.
This protocol is supported by validation against board-certified radiologist marks. Human-AI agreement is quantitatively established via quadratic weighted Cohen’s kappa:
Values between 0.774 and 0.985 (and >80% accuracy) indicate high reliability and alignment between GPT-4o–based automatic scoring and clinical expert evaluation.
5. Model Evaluation and Findings
A total of 14 state-of-the-art MLLMs—encompassing open-source (e.g., Qwen2.5-VL-Instruct, InternVL3), commercially available (e.g., GPT-5, GPT-4o, Grok-4), and domain-specialized (e.g., MedGemma, BiMediX2)—were benchmarked under consistent zero-shot conditions. The principal findings are as follows:
- General and commercial models (GPT-5, GPT-4o) achieve the highest scores on both perception and reasoning tasks, outperforming open-source and specialized medical MLLMs in most cases.
- Underperformance on subtle degradations: All systems, including top commercial models, frequently miss mild or fine-grained quality defects, which are critical for clinical safety.
- Comparison tasks: Several MLLMs struggle with paired image reasoning, especially for nuanced differences required in real clinical decision-making contexts.
- Medical-specialized models: Contrary to expectation, some medical-finetuned systems (e.g., MedGemma, BiMediX2) trail behind general-purpose MLLMs, suggesting that domain-specific pre-training has not yet bridged the gap in foundational quality perception and language-grounded reasoning.
Despite progress, the top-performing model (GPT-5) remains ~13.5% below radiologist-level performance in perception accuracy, underscoring the need for further optimization before clinical deployment.
6. Implications for Clinical AI and Future Development
MedQ-Bench’s granular assessments reveal that current MLLMs display unstable and incomplete perceptual and reasoning abilities with respect to image quality—posing major limitations for adoption as “first-mile” safety gates in clinical AI workflows. These results suggest several avenues for future research:
- Fine-tuning and curriculum learning to strengthen low-level perceptual representations within MLLM frameworks.
- Multi-task optimization strategies unifying quality assessment and diagnostic reasoning.
- Targeted benchmarks and training sets emphasizing subtle, clinically significant degradations and modality-specific nuances.
- Using MedQ-Bench as a reference standard for rapid prototyping, validation, and regulatory compliance in automated medical image analysis pipelines.
MedQ-Bench is thus positioned as both a critical evaluation resource and a catalyst for the systematic improvement of multimodal models toward reliable, expert-aligned clinical image quality assessment.