Papers
Topics
Authors
Recent
2000 character limit reached

MedTVT-QA: Multimodal Medical Reasoning Dataset

Updated 9 December 2025
  • MedTVT-QA is a multimodal dataset featuring expert-curated ECG, CXR, and LAB data for comprehensive medical reasoning.
  • It employs a Chain of Evidence approach, ensuring explicit cross-modal synthesis and rigorous expert validation.
  • The dataset supports training and benchmarking of MLLMs in clinical tasks such as multi-disease diagnosis and report automation.

MedTVT-QA is a large-scale, expert-curated multimodal instruction dataset designed for the development and evaluation of multimodal LLMs (MLLMs) in the context of medical reasoning and multi-disease diagnosis. Leveraging heterogeneous clinical data—including electrocardiograms (ECG), chest X-rays (CXR), and laboratory (LAB) test results—the dataset emphasizes both physiological-level interpretation and disease-level diagnostic reasoning. A Chain of Evidence approach underpins the dataset’s structure, ensuring explicit cross-modal evidence synthesis for every diagnostic conclusion. MedTVT-QA has been constructed as part of the MedTVT-R1 project to provide a foundation for training, benchmarking, and advancing MLLMs in clinically relevant scenarios (Zhang et al., 23 Jun 2025).

1. Cohort Construction and Dataset Scale

MedTVT-QA is assembled from three major sources: MIMIC-IV, MIMIC-IV-ECG, and MIMIC-CXR-JPG. Patient and paper identifiers enable reliable alignment of multimodal data. For each patient episode, three temporal data windows are enforced to maintain clinical consistency:

  • ECGs from the first 24 hours of admission,
  • CXRs acquired within 24–72 hours,
  • LAB results from the first 24 hours.

After filtering to ensure temporal contiguity and data completeness, the resulting dataset comprises 8,706 multimodal records. These are divided into 8,331 samples for training and 375 for testing at the record (patient-time) level; there is no explicit validation split reported.

Each sample yields four QA pairs: three at the physiological (modality-specific) level and one at the disease (integrated) level. In aggregate, MedTVT-QA provides 34,824 QA pairs, corresponding to 8,706 each for ECG-QA, CXR-QA, LAB-QA, and disease-level QA.

2. Structure and Annotation Pipeline

The QA pairs are generated and refined through a two-stage, expert-aided process:

  • Physiological QA: For each modality, a prompt (Role Setting → Task Description → Answer Guidance → Answer Format) is constructed. Automated GPT-4o generation is steered by key label inputs (e.g., "Sinus Rhythm," "LBBB") and templated questions. Generated outputs follow a three-part structure: introduction, label-wise interpretation, summary. Every automated answer is subsequently reviewed and revised by a medical expert to ensure accuracy and coherence.
  • Disease-Level QA: The Chain of Evidence prompt includes synthesized <ecg_report>, <cxr_report>, and <blood_test_report> sections, along with explicit ground-truth disease tags. Instructions mandate the citation of supporting evidence from each modality, and the diagnoses must use the exact provided disease labels. Answers are split into a > section (evidence synthesis) and an <answer> section (diagnosis list).

    All responses are subject to human expert validation; no quantitative inter-annotator agreement metrics are reported, but the dual-stage process is designed for high clinical fidelity.

    3. Representation of Modalities and Integration Protocol

    MedTVT-QA encodes cross-modal clinical scenarios and, while paired with an encoder–projector alignment scheme in MedTVT-R1, its documentation defines the following:

    • Modality placeholders <ecg>, <cxr>, <lab> in the QA prompts for model-inference compatibility.

    • Expected model-side encodings:

      • ECG: fEf_E (encoder) followed by gEg_E (projector), outputting ZERdZ_E \in \mathbb{R}^d.
      • CXR: fCf_C and gCg_C, yielding ZCRdZ_C \in \mathbb{R}^d.
      • LAB: fLf_L and gLg_L, producing ZLRdZ_L \in \mathbb{R}^d.
    • Disease-level QA items enforce direct inter-modal dependency, requiring explicit referencing of all three input sources to synthesize a diagnosis. This structure ensures that downstream models are objectively challenged to combine physiological signals, radiographic evidence, and laboratory data.

    4. Data Coverage, Disease Taxonomy, and QA Characteristics

    MedTVT-QA covers seven primary disease categories—Coronary Artery Disease, Acute Renal Failure, Hypertension, Atrial Fibrillation, Pneumonia, Diabetes Mellitus, and Sepsis—each decomposed into ICD-10-coded subtypes. Disease prevalence and detailed subcode counts are referenced in Figure 1(b) and Table A.2 of the dataset’s source publication.

    Each sample combines three physiological QAs (ECG, CXR, LAB), each 250–350 words in length, and a single disease QA (100–150 words of synthesized evidence and diagnosis). The scoring distribution over diseases is stratified for balanced multimodal task representation. Every entry is subjected to structured annotation and professional validation.

    Disease Category Overview

    Major Category Example ICD-10 Subtypes (see Table A.2) Relative Frequency Reference
    Coronary Artery Disease I20, I21, I25, ... Figure 1(b)
    Acute Renal Failure N17 Figure 1(b)
    Hypertension I10, I11 Figure 1(b)
    Atrial Fibrillation I48 Figure 1(b)
    Pneumonia J18 Figure 1(b)
    Diabetes Mellitus E11 Figure 1(b)
    Sepsis A41 Figure 1(b)

    A plausible implication is that the dataset’s coverage balances disease-level variety with multiclass, multimodal complexity.

    5. Chain of Evidence Reasoning Format

    A defining feature of MedTVT-QA is the adoption of a reasoning scaffold for disease-level QAs:

    • The <think> block captures explicit cross-modal synthesis steps: salient ECG features, CXR abnormalities, and LAB result correlations.
    • The <answer> block communicates the set of confirmed diagnoses, enumerated using only the ground-truth disease names.

    This approach mandates that answers are both interpretable and clinically justified, training models not only to predict disease categories but also to articulate multimodal reasoning. It reflects an explicit combinatorial reasoning challenge aligned with contemporary research requirements for transparency and explainability in clinical AI.

    6. Evaluation Metrics and Reinforcement Alignment

    Although primarily a dataset specification, MedTVT-QA was devised in conjunction with specific reward formulations within the MedTVT-R1 pipeline. Diagnostics are supervised with verifiable QA pairs and further optimized using reinforcement objectives:

    • Jaccard Reward: Given LCL_C (predicted) and LGL_G (ground truth) label sets:

    RJ(LC,LG)={LCLGLCLG,LCLG>0 0,otherwiseR_J(L_C, L_G) = \begin{cases} \frac{|L_C \cap L_G|}{|L_C \cup L_G|}, & |L_C \cup L_G| > 0 \ 0, & \text{otherwise} \end{cases}

    maxθ  EAπθ(Q)[R(Q,A)    βKL(πθ(AQ)πref(AQ))],\max_{\theta} \; \mathbb{E}_{A\sim\pi_\theta(Q)}\Big[R(Q,A)\;-\;\beta\,\mathrm{KL}\big(\pi_\theta(A\mid Q)\,\|\,\pi_{\mathrm{ref}}(A\mid Q)\big)\Big],

    where R=RF+RJR = R_F + R_J sums the format and Jaccard rewards.

    These formulations are integral not only to downstream model optimization but also serve as a basis for QA definition fidelity during dataset construction.

    7. Use Cases and Impact

    MedTVT-QA serves as a resource for multiple clinical and technical tasks:

    • Training of Multimodal Medical LLMs: Enables end-to-end learning for both modality-specific interpretation and integrated multi-disease diagnosis.
    • Evaluation Benchmark: Facilitates rigorous benchmarking of physiology report generation, evidence-based disease prediction, and chain-of-evidence reasoning.
    • Downstream Clinical Scenarios: Applicability includes diagnostic report automation, comorbidity reasoning, risk stratification, and decision support systems requiring interpretability.

    By imposing strict annotation, coverage across modalities and diseases, and verifiable reasoning structures, MedTVT-QA establishes a new standard for multi-evidence, multimodal medical NLP resources (Zhang et al., 23 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MedTVT-QA Dataset.