MIMIC-EchoQA: Echocardiography AI Benchmark
- MIMIC-EchoQA is a benchmark that assesses multimodal AI systems on echocardiography by combining view-aware cine videos with expert-validated clinical questions.
- The benchmark covers key tasks including disease classification, measurement retrieval, and threshold-based decision making using MIMIC-IV-ECHO data.
- State-of-the-art models like Echo-CoPilot demonstrate improved accuracy by integrating clinical rule-based reasoning with precise visual assessments.
The MIMIC-EchoQA benchmark is a publicly released evaluation set designed to assess multimodal artificial intelligence systems on clinically relevant, view-aware reasoning and language comprehension tasks in echocardiography. Each example consists of a single, quality-controlled transthoracic echocardiogram cine video paired with an expert-validated four-choice multiple-choice question, focused on disease classification, quantitative measurement, or clinical threshold decisions. The benchmark, derived from the MIMIC-IV-ECHO dataset, establishes a standardized testbed for models aiming to provide unified, guideline-aware assessments across diverse echocardiographic views and tasks (Heidari et al., 6 Dec 2025).
1. Origin, Construction, and Scope
MIMIC-EchoQA was introduced by Thapa et al. (2025) and is built using the publicly available MIMIC-IV-ECHO repository. Its explicit purpose is to drive evaluation of multimodal models on “view-aware visual reasoning and clinically meaningful language comprehension” for echocardiography. The resource comprises 622 examples, each including:
- A cine video (one per paper), sampled to 38 possible echocardiographic views,
- An expert-annotated clinical multiple-choice question (MCQ) with four candidate answers,
- The correct answer label, validated against clinical reports and imaging guidelines.
The benchmark covers clinically central reasoning tasks, including:
- Measurement retrieval (e.g., estimating left ventricular ejection fraction),
- Disease classification (presence and grade, e.g., of valvular regurgitation or pericardial effusion),
- Threshold-based decision-making (e.g., discriminating borderline ventricular hypertrophy given wall thickness cut-offs, stratifying effusion severity).
No official training or validation splits are provided; all 622 examples are reserved as held-out test cases for model evaluation (Heidari et al., 6 Dec 2025).
2. Data Preprocessing and Inclusion Criteria
Videos are standardized via frame sampling, resolution normalization, and temporal aggregation to ensure input consistency across models and tool pipelines. For image-centric baselines, such as MedGemma-4B, representative end-diastolic and end-systolic frames are automatically selected using an EchoNet-based key-frame selector. All included studies from MIMIC-IV-ECHO satisfy quality requirements set by the dataset authors; no further exclusion criteria are enumerated. The evaluation set encompasses the full heterogeneity of real-world hospital echocardiography, spanning 38 unique clinical views and a spectrum of cardiac pathologies (Heidari et al., 6 Dec 2025).
3. Evaluation Metrics and Protocol
Performance on MIMIC-EchoQA is measured exclusively by accuracy, defined as the proportion of examples for which the selected model response matches the benchmark label:
or, from a binary classification perspective,
where , , , and denote true positives, true negatives, false positives, and false negatives, respectively.
For MCQs, models select the candidate with the highest confidence or predicted probability without the application of auxiliary thresholds. The benchmark does not report F1, calibration, explanation quality, or other secondary metrics. Statistical significance, such as -values or confidence intervals, are not provided in initial reports (Heidari et al., 6 Dec 2025).
4. Model Baselines and Comparative Performance
MIMIC-EchoQA serves to differentiate the capabilities of general, vision-LLMs (VLMs), biomedical-specialized pipelines, and modular agentic frameworks. A broad range of models have been evaluated using the benchmark:
| Model Type | Model Name | Accuracy (%) |
|---|---|---|
| General V+L | Video-ChatGPT | 31.7 |
| General V+L | Video-LLaVA | 32.0 |
| General V+L | Phi-3.5-vision-instruct | 41.1 |
| General V+L | Phi-4-multimodal-instruct | 37.8 |
| General V+L | InternVideo2.5-Chat-8B | 40.3 |
| General V+L | GPT-4o | 41.6 |
| General V+L | o4-mini | 43.9 |
| Biomedical-specialized | MedGemma-4B | 33.4 |
| Biomedical-specialized | Qwen2-VL-2B-biomed | 42.0 |
| Biomedical-specialized | Qwen2-VL-7B-biomed | 49.0 |
| Modular Agent (Echo-CoPilot) | Echo-CoPilot | 50.8 |
Echo-CoPilot establishes a new state-of-the-art, achieving 50.8% accuracy. Absolute improvement over the best biomedical baseline (Qwen2-VL-7B-biomed, 49.0%) is +1.8 points (relative +3.7%), and over the strongest general VLM (o4-mini, 43.9%) is +6.9 points (relative +15.7%) (Heidari et al., 6 Dec 2025).
5. Qualitative Reasoning and Case Studies
MIMIC-EchoQA is designed to probe not only perceptual acuity but clinically consistent reasoning, particularly at decision thresholds. Analysis of Echo-CoPilot’s performance reveals several modes of reasoning that are central to success on the benchmark:
- Hierarchical Tool Invocation: For tasks such as mitral regurgitation severity, the agent utilizes sequential disease prediction, quantitative measurement (e.g., regurgitant jet area, vena contracta), and cross-references established clinical thresholds (e.g., VC ≥ 0.7 cm for “severe”).
- Threshold and Grading Decisions: In borderline left ventricular hypertrophy, segmentation tools (MedSAM2) are used to quantify septal thickness, with results normalized to ASE guideline cut-offs for assigning grade.
- Physiologic Context Integration: Where direct measurement is ambiguous or missing, the agent leverages surrogate features (e.g., cardiac dilation, hemodynamic context such as tamponade signs) to inform answer refinement and mitigate overcalling seen in purely vision-based systems.
These reasoning chains demonstrate that the benchmark’s query set surfaces limitations in single-shot VLMs that do not combine measurement precision and clinical rule-based logic (Heidari et al., 6 Dec 2025).
6. Limitations and Directions for Future Benchmarking
Authors highlight several key constraints in the current MIMIC-EchoQA release:
- Absence of official training and validation splits hinders systematic development and ablation of specialized agents.
- Lack of per-task subbenchmarking (e.g., view classification, segmentation, measurement) prevents granular diagnosis of model strengths and weaknesses.
- Evaluation is restricted to accuracy; it does not consider calibration, explanation quality, or human-in-the-loop trust metrics.
- No statistical significance assessments (e.g., p-values) are reported.
Recommended future directions include: release of train/validation splits, per-task annotation frameworks to support modular, interpretable benchmarking, and the addition of richer metrics encompassing reasoning trace fidelity, calibration, and expert trust assessments (Heidari et al., 6 Dec 2025).
7. Relationship to Larger Echocardiographic QA Datasets
MIMIC-EchoQA is distinct from the scale-focused EchoQA dataset, which comprises 771,244 text-based question-answer pairs derived from echocardiogram reports in MIMIC-IV (Moukheiber et al., 4 Mar 2025). The EchoQA dataset targets instruction tuning for LLMs using free-text clinical data, while MIMIC-EchoQA centers on multimodal video-language reasoning, measurement, and clinical guideline compliance in the context of imaging data. This suggests the two resources are complementary, addressing different aspects of echocardiographic AI benchmarking and development.
In summary, MIMIC-EchoQA provides a challenging, clinically grounded benchmark for the assessment of multimodal agents in echocardiography. It catalyzes the development and comparative paper of models capable of unified reasoning over multi-view videos and complex guideline-based decisions, with Echo-CoPilot’s 50.8% accuracy establishing the current state-of-the-art (Heidari et al., 6 Dec 2025).