MedTVT-QA: Multimodal Medical Reasoning Dataset

Updated 9 December 2025

MedTVT-QA is a multimodal dataset featuring expert-curated ECG, CXR, and LAB data for comprehensive medical reasoning.
It employs a Chain of Evidence approach, ensuring explicit cross-modal synthesis and rigorous expert validation.
The dataset supports training and benchmarking of MLLMs in clinical tasks such as multi-disease diagnosis and report automation.

MedTVT-QA is a large-scale, expert-curated multimodal instruction dataset designed for the development and evaluation of multimodal LLMs (MLLMs) in the context of medical reasoning and multi-disease diagnosis. Leveraging heterogeneous clinical data—including electrocardiograms (ECG), chest X-rays (CXR), and laboratory (LAB) test results—the dataset emphasizes both physiological-level interpretation and disease-level diagnostic reasoning. A Chain of Evidence approach underpins the dataset’s structure, ensuring explicit cross-modal evidence synthesis for every diagnostic conclusion. MedTVT-QA has been constructed as part of the MedTVT-R1 project to provide a foundation for training, benchmarking, and advancing MLLMs in clinically relevant scenarios (Zhang et al., 23 Jun 2025).

1. Cohort Construction and Dataset Scale

MedTVT-QA is assembled from three major sources: MIMIC-IV, MIMIC-IV-ECG, and MIMIC-CXR-JPG. Patient and paper identifiers enable reliable alignment of multimodal data. For each patient episode, three temporal data windows are enforced to maintain clinical consistency:

ECGs from the first 24 hours of admission,
CXRs acquired within 24–72 hours,
LAB results from the first 24 hours.

After filtering to ensure temporal contiguity and data completeness, the resulting dataset comprises 8,706 multimodal records. These are divided into 8,331 samples for training and 375 for testing at the record (patient-time) level; there is no explicit validation split reported.

Each sample yields four QA pairs: three at the physiological (modality-specific) level and one at the disease (integrated) level. In aggregate, MedTVT-QA provides 34,824 QA pairs, corresponding to 8,706 each for ECG-QA, CXR-QA, LAB-QA, and disease-level QA.

2. Structure and Annotation Pipeline

The QA pairs are generated and refined through a two-stage, expert-aided process:

Physiological QA: For each modality, a prompt (Role Setting → Task Description → Answer Guidance → Answer Format) is constructed. Automated GPT-4o generation is steered by key label inputs (e.g., "Sinus Rhythm," "LBBB") and templated questions. Generated outputs follow a three-part structure: introduction, label-wise interpretation, summary. Every automated answer is subsequently reviewed and revised by a medical expert to ensure accuracy and coherence.

Disease-Level QA: The Chain of Evidence prompt includes synthesized <ecg_report>, <cxr_report>, and <blood_test_report> sections, along with explicit ground-truth disease tags. Instructions mandate the citation of supporting evidence from each modality, and the diagnoses must use the exact provided disease labels. Answers are split into a > section (evidence synthesis) and an <answer> section (diagnosis list).

All responses are subject to human expert validation; no quantitative inter-annotator agreement metrics are reported, but the dual-stage process is designed for high clinical fidelity.

3. Representation of Modalities and Integration Protocol

MedTVT-QA encodes cross-modal clinical scenarios and, while paired with an encoder–projector alignment scheme in MedTVT-R1, its documentation defines the following:

Modality placeholders <ecg>, <cxr>, <lab> in the QA prompts for model-inference compatibility.

Expected model-side encodings:

ECG: $f_E$ (encoder) followed by $g_E$ (projector), outputting $Z_E \in \mathbb{R}^d$ .

CXR: $f_C$ and $g_C$ , yielding $Z_C \in \mathbb{R}^d$ .

LAB: $f_L$ and $g_L$ , producing $Z_L \in \mathbb{R}^d$ .

Disease-level QA items enforce direct inter-modal dependency, requiring explicit referencing of all three input sources to synthesize a diagnosis. This structure ensures that downstream models are objectively challenged to combine physiological signals, radiographic evidence, and laboratory data.

4. Data Coverage, Disease Taxonomy, and QA Characteristics

MedTVT-QA covers seven primary disease categories—Coronary Artery Disease, Acute Renal Failure, Hypertension, Atrial Fibrillation, Pneumonia, Diabetes Mellitus, and Sepsis—each decomposed into ICD-10-coded subtypes. Disease prevalence and detailed subcode counts are referenced in Figure 1(b) and Table A.2 of the dataset’s source publication.

Each sample combines three physiological QAs (ECG, CXR, LAB), each 250–350 words in length, and a single disease QA (100–150 words of synthesized evidence and diagnosis). The scoring distribution over diseases is stratified for balanced multimodal task representation. Every entry is subjected to structured annotation and professional validation.

Disease Category Overview
Major Category Example ICD-10 Subtypes (see Table A.2) Relative Frequency Reference

Coronary Artery Disease I20, I21, I25, ... Figure 1(b)

Acute Renal Failure N17 Figure 1(b)

Hypertension I10, I11 Figure 1(b)

Atrial Fibrillation I48 Figure 1(b)

Pneumonia J18 Figure 1(b)

Diabetes Mellitus E11 Figure 1(b)

Sepsis A41 Figure 1(b)

A plausible implication is that the dataset’s coverage balances disease-level variety with multiclass, multimodal complexity.

5. Chain of Evidence Reasoning Format

A defining feature of MedTVT-QA is the adoption of a reasoning scaffold for disease-level QAs:

The <think> block captures explicit cross-modal synthesis steps: salient ECG features, CXR abnormalities, and LAB result correlations.

The <answer> block communicates the set of confirmed diagnoses, enumerated using only the ground-truth disease names.

This approach mandates that answers are both interpretable and clinically justified, training models not only to predict disease categories but also to articulate multimodal reasoning. It reflects an explicit combinatorial reasoning challenge aligned with contemporary research requirements for transparency and explainability in clinical AI.

6. Evaluation Metrics and Reinforcement Alignment

Although primarily a dataset specification, MedTVT-QA was devised in conjunction with specific reward formulations within the MedTVT-R1 pipeline. Diagnostics are supervised with verifiable QA pairs and further optimized using reinforcement objectives:

Jaccard Reward: Given $L_C$ (predicted) and $L_G$ (ground truth) label sets:

$R_J(L_C, L_G) = \begin{cases} \frac{|L_C \cap L_G|}{|L_C \cup L_G|}, & |L_C \cup L_G| > 0 \ 0, & \text{otherwise} \end{cases}$

GRPO Objective:

$\max_{\theta} \; \mathbb{E}_{A\sim\pi_\theta(Q)}\Big[R(Q,A)\;-\;\beta\,\mathrm{KL}\big(\pi_\theta(A\mid Q)\,\|\,\pi_{\mathrm{ref}}(A\mid Q)\big)\Big],$

where $R = R_F + R_J$ sums the format and Jaccard rewards.

These formulations are integral not only to downstream model optimization but also serve as a basis for QA definition fidelity during dataset construction.

7. Use Cases and Impact

MedTVT-QA serves as a resource for multiple clinical and technical tasks:

Training of Multimodal Medical LLMs: Enables end-to-end learning for both modality-specific interpretation and integrated multi-disease diagnosis.

Evaluation Benchmark: Facilitates rigorous benchmarking of physiology report generation, evidence-based disease prediction, and chain-of-evidence reasoning.

Downstream Clinical Scenarios: Applicability includes diagnostic report automation, comorbidity reasoning, risk stratification, and decision support systems requiring interpretability.

By imposing strict annotation, coverage across modalities and diseases, and verifiable reasoning structures, MedTVT-QA establishes a new standard for multi-evidence, multimodal medical NLP resources (Zhang et al., 23 Jun 2025).

Major Category	Example ICD-10 Subtypes (see Table A.2)	Relative Frequency Reference
Coronary Artery Disease	I20, I21, I25, ...	Figure 1(b)
Acute Renal Failure	N17	Figure 1(b)
Hypertension	I10, I11	Figure 1(b)
Atrial Fibrillation	I48	Figure 1(b)
Pneumonia	J18	Figure 1(b)
Diabetes Mellitus	E11	Figure 1(b)
Sepsis	A41	Figure 1(b)

PDF Markdown Chat (Pro)

References (1)

MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MedTVT-QA Dataset.

MedTVT-QA: Multimodal Medical Reasoning Dataset

1. Cohort Construction and Dataset Scale

2. Structure and Annotation Pipeline

3. Representation of Modalities and Integration Protocol

4. Data Coverage, Disease Taxonomy, and QA Characteristics

Disease Category Overview

5. Chain of Evidence Reasoning Format

6. Evaluation Metrics and Reinforcement Alignment

7. Use Cases and Impact

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics