Papers
Topics
Authors
Recent
2000 character limit reached

MedTVT-R1: Multimodal Diagnostic Model

Updated 9 December 2025
  • MedTVT-R1 is a multimodal large language model combining ECG, CXR, and LAB data to provide interpretable, evidence-based diagnostic reasoning.
  • The model employs specialized encoders and novel modules like CMHA and CAO to fuse heterogeneous clinical data into a unified embedding space.
  • Reinforcement fine-tuning improves multi-disease prediction accuracy, as demonstrated by extensive empirical evaluations and ablation studies.

MedTVT-R1 is a Multimodal LLM (MLLM) specifically designed for interpretable, multi-disease diagnosis in clinical settings utilizing heterogeneous multimodal data. Unlike traditional single-modal diagnostic frameworks that struggle to synthesize the complexity inherent in real-world patient data, MedTVT-R1 integrates time-series electrocardiograms (ECG), chest X-ray images (CXR), and tabular laboratory results (LAB) to generate long-form, evidence-based diagnostic reasoning and disease prediction (Zhang et al., 23 Jun 2025). The model’s architecture, dataset design, reinforcement fine-tuning strategy, empirical validation, clinical deployment scenarios, and stated limitations collectively define a new paradigm for data-centric medical reasoning.

1. Model Architecture

MedTVT-R1 processes three modalities: ECG signals, CXR images, and LAB tabular tests, using specialized encoders and a unified embedding space. The ECG encoder (fEf_E) leverages a pretrained time-series backbone (ECGFM-KED), the CXR encoder (fCf_C) utilizes a Vision Transformer (ViT-B/16), and the LAB encoder (fLf_L) incorporates a tabular feature extractor (from Symile). Each output is projected via modality-specific dense projectors (gEg_E, gCg_C, gLg_L) to a shared dimension dd.

ZE=gE(fE(XE)),ZC=gC(fC(XC)),ZL=gL(fL(XL)),ZE/C/L∈Rd\mathbf{Z}_E = g_E(f_E(\mathbf{X}_E)),\quad \mathbf{Z}_C = g_C(f_C(\mathbf{X}_C)),\quad \mathbf{Z}_L = g_L(f_L(\mathbf{X}_L)),\quad \mathbf{Z}_{E/C/L} \in \mathbb{R}^d

A Modality Perception Layer (MPL) enables cross-modal interaction and adaptive modality weighting. The MPL includes:

  • Cyclic Multi-Head Attention (CMHA): Each modality cyclically acts as Query, Key, Value in multi-head attention; outputs are averaged and residually added:

F=AveragePooling(CMHA(ZE,ZC,ZL)),Mm=Zm+F\mathbf{F} = \mathrm{AveragePooling}(\mathrm{CMHA}(\mathbf{Z}_E, \mathbf{Z}_C, \mathbf{Z}_L)),\quad \mathbf{M}_m = \mathbf{Z}_m + \mathbf{F}

  • Contribution-Aware Operator (CAO): Sigmoid-gated weights per modality are computed from concatenated updated features:

[αE,αC,αL]=σ(h[ME:MC:ML]),Tm=αm⊙Mm[\alpha_E, \alpha_C, \alpha_L] = \sigma(h[\mathbf{M}_E : \mathbf{M}_C : \mathbf{M}_L]),\quad \mathbf{T}_m = \alpha_m \odot \mathbf{M}_m

Features TE\mathbf{T}_E, TC\mathbf{T}_C, TL\mathbf{T}_L are then injected as placeholders (<ecg>, <cxr>, <lab>) in the token sequence for the LLM backbone (LLaMA-3.2-1B, augmented via LoRA). The output consists of two main blocks: "think" (Chain of Evidence reasoning) and "answer" (predicted disease list).

2. MedTVT-QA Dataset Construction

MedTVT-QA is a multimodal instruction dataset curated for both physiological interpretation and disease diagnosis tasks. It derives data from MIMIC-IV (LAB, diagnoses), MIMIC-IV-ECG (12-lead waveforms), MIMIC-CXR-JPG (images), and MIMIC-IV-ECG-EXT-ICD (ICD-10 codes). Dataset extraction yields 8,706 patient-timepoint triplets, partitioned into 8,331 training and 375 test samples, covering seven disease categories: Coronary Artery Disease, Acute Renal Failure, Hypertension, Atrial Fibrillation, Pneumonia, Diabetes Mellitus, and Sepsis.

Annotation is performed in two workflows:

  1. Physiological QA: GPT-4o converts raw modality annotations into >300-word expert-reviewed explanatory reports per modality.
  2. Disease-Level Chain of Evidence QA: The three reports and ground-truth disease sets are synthesized using strict prompts to produce a "think" block (cross-modal evidence) and an "answer" block (definitive diagnosis).

Representative QA samples include explicit identification of diagnostic support from individual modalities and cross-modal synthesis that yields interpretable clinical reasoning.

3. Reinforcement Fine-Tuning: GRPO and Jaccard Reward

Post-supervised fine-tuning, MedTVT-R1 adopts a reinforcement fine-tuning paradigm, specifically Group Relative Policy Optimization (GRPO) to enhance multi-disease prediction robustness. For a prompt Q\mathbf{Q}, a set of GG candidate outputs is generated and individually scored:

r^i=ri−μσ\hat{r}_i = \frac{r_i - \mu}{\sigma}

where μ\mu and σ\sigma are mean and standard deviation of group rewards, with ri=R(Q,oi)r_i=R(\mathbf{Q}, o_i).

The GRPO objective maximizes:

Eo∼πθ(⋅∣Q)[r^(o)]−β KL[πθ(⋅∣Q)∥πref(⋅∣Q)]\mathbb{E}_{o\sim\pi_\theta(\cdot|\mathbf{Q})}[\hat{r}(o)] - \beta\,\mathrm{KL}[\pi_\theta(\cdot|\mathbf{Q}) \| \pi_{\rm ref}(\cdot|\mathbf{Q})]

A Jaccard Reward RJR_J incentivizes set-level overlap between predicted LCL_C and ground-truth LGL_G disease labels:

RJ(LC,LG)=∣LC∩LG∣∣LC∪LG∣ if ∣LC∪LG∣>0;0 otherwiseR_J(L_C, L_G) = \frac{|L_C \cap L_G|}{|L_C \cup L_G|}\ \text{if}\ |L_C \cup L_G| > 0;\quad 0\ \text{otherwise}

Total reward includes a format penalty RFR_F for outputs missing required tags. Optimizing Jaccard directly improves both recall and precision for multi-disease reasoning.

4. Empirical Evaluation and Ablations

MedTVT-R1’s training uses 8× NVIDIA A800 80GB GPUs. The model backbone is LLaMA-3.2-1B (LoRA rank 8), and encoders include ECGFM-KED, ViT-B/16, and Symile, with projector output dimension d=2048d=2048. Training stages comprise physiological-level pre-training for 20 epochs, supervised disease-level fine-tuning for 20 epochs, and 500 iterations of RFT with G=8G=8.

MedTVT-R1 is benchmarked against eight resized MLLM baselines. Metrics include NLG (BLEU, METEOR, ROUGE, BERTScore) and multi-label clinical efficacy (Precision, Recall, F1, AUC):

Model/Metric BLEU METEOR ROUGE BERTScore Precision Recall F1 AUC
MedTVT-R1 0.1353 0.3536 0.2295 0.8652 0.5407 0.5908 0.5190 0.6554

MedTVT-R1 improves F1 by >0.32 and AUC by >0.15 over best baselines (e.g., InternVL3-1B, Qwen2.5-3B). Ablation studies demonstrate:

  • Removing physiological pre-training lowers F1 to 0.4672.
  • Removing RFT lowers F1 to 0.4992.
  • Excluding either CMHA or CAO reduces METEOR by 0.015–0.016 and F1 by 0.022–0.032.
  • Single-modality dropout during pre-training lowers F1 by ≥0.015, with maximum degradation on ECG removal.

On physiological QA (ECG, CXR, LAB), MedTVT-R1’s LAB-QA METEOR is 0.3827 versus baseline’s 0.2058.

5. Clinical Implementation and Reasoning Examples

MedTVT-R1 produces interpretable, evidence-tagged outputs applicable as clinical report drafts or comorbidity reasoning aids in EHR workflows. A representative multimodal reasoning scenario:

  • ECG: left ventricular hypertrophy substantiates Hypertension.
  • CXR: interstitial opacities suggest Pneumonia.
  • LAB: elevated WBC, altered pCOâ‚‚/pH support infectious diagnosis.

Model output format:

1
2
<think>ECG shows LVH (hypertensive change); CXR shows enlarged silhouette; WBC↑ supports infection…</think>
<answer>Hypertension; Pneumonia</answer>

This structure enables transparent attribution of diagnostic conclusions to observed physiological evidence.

6. Limitations, Prospective Extensions, and Resources

MedTVT-R1 is constrained by limited availability of large-scale, temporally aligned multimodal triplets, potentially limiting cross-institutional generalizability. The current implementation includes only ECG, CXR, and LAB modalities; inclusion of other sources (clinical notes, genomics, vital signs) is anticipated to augment diagnostic coverage.

Proposed future directions include:

  • Expanding to additional modalities (clinical text, ultrasound, genomics).
  • Cohort enlargement with tighter temporal synchronization.
  • Development of causal interpretability modules for clinical decision support.

Complete datasets, codebase, and weights are made available at https://github.com/keke-nice/MedTVT-R1 (Zhang et al., 23 Jun 2025).


MedTVT-R1 establishes a rigorous multimodal reasoning pipeline, validated by extensive empirical studies and ablation analysis, with clear interpretability and extensibility for diverse diagnostic applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MedTVT-R1.