MedQ-Perception Analysis
- MedQ-Perception is an evaluation framework that empirically tests MLLMs' ability to detect, describe, and discriminate low-level degradations in medical images across multiple modalities.
- The methodology employs human-curated questions targeting specific quality attributes, quantifying model performance with accuracy metrics for binary, categorical, and ordinal tasks.
- Findings reveal that while top models exceed random baselines, they struggle with subtle, mild degradations, highlighting the need for targeted fine-tuning and modality-aware optimization.
MedQ-Perception refers to the empirical and methodological paper of medical image quality perception by multi-modal LLMs (MLLMs), as instantiated in the MedQ-Bench framework (Liu et al., 2 Oct 2025). MedQ-Perception differs fundamentally from legacy scalar image quality assessment (IQA) by operationalizing a see-and-report paradigm that probes models with human-curated questions targeting low-level, modality-specific visual attributes. The objective is to determine whether MLLMs can reliably recognize, describe, and discriminate visual quality features—such as clarity, artifact presence, or degradation severity—in medical images, without resorting to high-level or diagnostic reasoning.
1. Motivations and Scope
MedQ-Perception was developed to address shortcomings in conventional IQA, which are typically restricted to numerical score-based metrics (e.g., PSNR, SSIM) and fail to capture the descriptive, context-dependent reasoning used by clinical experts. The paradigm evaluates MLLMs on their ability to emulate human-like perception: not only detecting image degradation but also distinguishing types and severities in a manner interpretable by clinicians. This rigorously curated evaluation spans five major imaging modalities (CT, MRI, endoscopy, histopathology, fundus), and covers over forty visual quality attributes. Unlike diagnostic or disease classification benchmarks, MedQ-Perception is strictly limited to low-level visual evaluation and avoids inference beyond the observed image features.
2. Task Formulation and Mathematical Framework
MedQ-Perception formalizes perception tasks as a conditional mapping:
where is the space of input images, is the space of curated image quality questions, and is the response space (either binary, categorical, or ordinal). Task types include:
- Yes/No (Binary): (e.g., “Is this image clear?”).
- What (Categorical): (e.g., selecting from {blur, noise, artifact, …}).
- How (Ordinal): (e.g., rating severity as “none,” “mild,” or “severe”).
Each question is meticulously designed by imaging specialists to target fundamental quality attributes relevant to specific modalities and real-world acquisition scenarios, including simulated degradations and authentic clinical scans.
3. Human-Curated Question Design and Attribute Taxonomy
The MedQ-Bench protocol relies on domain experts for seed question engineering, categorizing questions along:
- Degradation Status: “No Degradation” (reference), “Mild Degradation,” “Severe Degradation.”
- Task Specificity: “General” (cross-modality, e.g., “Is the image clear?”), and “Modality-Specific” (e.g., “Are there metal streaks in this CT?”).
Underlying attributes span clarity, overall noise, artifact severity (metal streaks, motion, ghosting), contrast preservation, anatomical detail retention, and more, distributed in a hierarchical taxonomy. Questions are instantiated for each image with ground-truth expert annotation serving as reference for evaluation.
4. Dataset Construction and Evaluation Protocols
The benchmark comprises 3,308 images drawn from:
- Real clinical acquisitions (with natural and device-induced degradations),
- Controlled physics-based reconstructions simulating degradations (e.g., blurring, streak artifacts),
- AI-generated images for assessing model robustness to synthetic variation.
For every image, up to three question subtypes (Yes/No, What, and How) are applied. The evaluation reports per-model and per-modality accuracy for each question type, enabling granular analysis of perceptual strengths and failure modes. Statistical metrics are disaggregated by degradation level, modality, and attribute category.
5. Empirical Findings on Perceptual Capabilities
The MedQ-Perception paper reveals notable insights:
- Most MLLMs surpass random baselines, demonstrating inherent visual perception capability; however, best-in-class models (e.g., GPT-5, GPT-4o) still fall short of radiologist-level reliability.
- “No degradation” and “severe degradation” cases are easier for models to resolve (typically exceeding 70% accuracy), while “mild” degradations are frequently misclassified.
- General questions yield higher performance than modality-specific ones, indicating a lack of domain-adapted perceptual skills.
- Several medical-specialized MLLMs underperform compared to general-purpose analogs, suggesting current finetuning procedures insufficiently emphasize low-level perceptual cues.
- Alignment studies between LLM-based judgements and radiologist grading further highlight gaps in model robustness and completeness.
6. Recommendations for Model Optimization and Clinical Reliability
Expert review of MedQ-Perception results indicates the need for targeted optimization strategies:
- Expansion of training datasets to include richer and more subtle forms of visual degradation, particularly mild artifact cases.
- Deeper modality-specific fine-tuning, with explicit attention to acquisition physics and device artifacts.
- Implementation of advanced human-in-the-loop verification and chain-of-thought prompting schemes to align model reasoning protocols with expert cognitive processes.
- Exploitation of multi-dimensional evaluation (precision, completeness, consistency, etc.) for iterative refinement of MLLM architectures.
7. Context and Implications in Medical Imaging Research
MedQ-Perception constitutes a paradigm shift in medical image quality assessment research, representing the first rigorous testbed for evaluating foundational perceptual abilities of language-driven multi-modal models. By exposing the limitations of current MLLMs, particularly their instability and non-clinical reliability in low-level image quality judgements, MedQ-Bench catalyzes targeted development for safer, AI-driven medical imaging pipelines. This enables a pathway toward MLLMs as first-mile “safety gates” for clinical AI, ensuring high-fidelity digital images before subsequent diagnostic processing.
In conclusion, MedQ-Perception operationalizes a perception-reasoning framework that serves as a foundation for advancing both technical fidelity and clinical interpretability in medical image quality assessment with MLLMs. Future strategies informed by this benchmark (enhanced training, prompt engineering, modality-aware reasoning) are necessary for closing the gap between model and expert, ultimately enabling trustworthy deployment in healthcare environments (Liu et al., 2 Oct 2025).