MedTVT-R1: Multimodal Diagnostic Model
- MedTVT-R1 is a multimodal large language model combining ECG, CXR, and LAB data to provide interpretable, evidence-based diagnostic reasoning.
- The model employs specialized encoders and novel modules like CMHA and CAO to fuse heterogeneous clinical data into a unified embedding space.
- Reinforcement fine-tuning improves multi-disease prediction accuracy, as demonstrated by extensive empirical evaluations and ablation studies.
MedTVT-R1 is a Multimodal LLM (MLLM) specifically designed for interpretable, multi-disease diagnosis in clinical settings utilizing heterogeneous multimodal data. Unlike traditional single-modal diagnostic frameworks that struggle to synthesize the complexity inherent in real-world patient data, MedTVT-R1 integrates time-series electrocardiograms (ECG), chest X-ray images (CXR), and tabular laboratory results (LAB) to generate long-form, evidence-based diagnostic reasoning and disease prediction (Zhang et al., 23 Jun 2025). The model’s architecture, dataset design, reinforcement fine-tuning strategy, empirical validation, clinical deployment scenarios, and stated limitations collectively define a new paradigm for data-centric medical reasoning.
1. Model Architecture
MedTVT-R1 processes three modalities: ECG signals, CXR images, and LAB tabular tests, using specialized encoders and a unified embedding space. The ECG encoder () leverages a pretrained time-series backbone (ECGFM-KED), the CXR encoder () utilizes a Vision Transformer (ViT-B/16), and the LAB encoder () incorporates a tabular feature extractor (from Symile). Each output is projected via modality-specific dense projectors (, , ) to a shared dimension .
A Modality Perception Layer (MPL) enables cross-modal interaction and adaptive modality weighting. The MPL includes:
- Cyclic Multi-Head Attention (CMHA): Each modality cyclically acts as Query, Key, Value in multi-head attention; outputs are averaged and residually added:
- Contribution-Aware Operator (CAO): Sigmoid-gated weights per modality are computed from concatenated updated features:
Features , , are then injected as placeholders (<ecg>, <cxr>, <lab>) in the token sequence for the LLM backbone (LLaMA-3.2-1B, augmented via LoRA). The output consists of two main blocks: "think" (Chain of Evidence reasoning) and "answer" (predicted disease list).
2. MedTVT-QA Dataset Construction
MedTVT-QA is a multimodal instruction dataset curated for both physiological interpretation and disease diagnosis tasks. It derives data from MIMIC-IV (LAB, diagnoses), MIMIC-IV-ECG (12-lead waveforms), MIMIC-CXR-JPG (images), and MIMIC-IV-ECG-EXT-ICD (ICD-10 codes). Dataset extraction yields 8,706 patient-timepoint triplets, partitioned into 8,331 training and 375 test samples, covering seven disease categories: Coronary Artery Disease, Acute Renal Failure, Hypertension, Atrial Fibrillation, Pneumonia, Diabetes Mellitus, and Sepsis.
Annotation is performed in two workflows:
- Physiological QA: GPT-4o converts raw modality annotations into >300-word expert-reviewed explanatory reports per modality.
- Disease-Level Chain of Evidence QA: The three reports and ground-truth disease sets are synthesized using strict prompts to produce a "think" block (cross-modal evidence) and an "answer" block (definitive diagnosis).
Representative QA samples include explicit identification of diagnostic support from individual modalities and cross-modal synthesis that yields interpretable clinical reasoning.
3. Reinforcement Fine-Tuning: GRPO and Jaccard Reward
Post-supervised fine-tuning, MedTVT-R1 adopts a reinforcement fine-tuning paradigm, specifically Group Relative Policy Optimization (GRPO) to enhance multi-disease prediction robustness. For a prompt , a set of candidate outputs is generated and individually scored:
where and are mean and standard deviation of group rewards, with .
The GRPO objective maximizes:
A Jaccard Reward incentivizes set-level overlap between predicted and ground-truth disease labels:
Total reward includes a format penalty for outputs missing required tags. Optimizing Jaccard directly improves both recall and precision for multi-disease reasoning.
4. Empirical Evaluation and Ablations
MedTVT-R1’s training uses 8× NVIDIA A800 80GB GPUs. The model backbone is LLaMA-3.2-1B (LoRA rank 8), and encoders include ECGFM-KED, ViT-B/16, and Symile, with projector output dimension . Training stages comprise physiological-level pre-training for 20 epochs, supervised disease-level fine-tuning for 20 epochs, and 500 iterations of RFT with .
MedTVT-R1 is benchmarked against eight resized MLLM baselines. Metrics include NLG (BLEU, METEOR, ROUGE, BERTScore) and multi-label clinical efficacy (Precision, Recall, F1, AUC):
| Model/Metric | BLEU | METEOR | ROUGE | BERTScore | Precision | Recall | F1 | AUC |
|---|---|---|---|---|---|---|---|---|
| MedTVT-R1 | 0.1353 | 0.3536 | 0.2295 | 0.8652 | 0.5407 | 0.5908 | 0.5190 | 0.6554 |
MedTVT-R1 improves F1 by >0.32 and AUC by >0.15 over best baselines (e.g., InternVL3-1B, Qwen2.5-3B). Ablation studies demonstrate:
- Removing physiological pre-training lowers F1 to 0.4672.
- Removing RFT lowers F1 to 0.4992.
- Excluding either CMHA or CAO reduces METEOR by 0.015–0.016 and F1 by 0.022–0.032.
- Single-modality dropout during pre-training lowers F1 by ≥0.015, with maximum degradation on ECG removal.
On physiological QA (ECG, CXR, LAB), MedTVT-R1’s LAB-QA METEOR is 0.3827 versus baseline’s 0.2058.
5. Clinical Implementation and Reasoning Examples
MedTVT-R1 produces interpretable, evidence-tagged outputs applicable as clinical report drafts or comorbidity reasoning aids in EHR workflows. A representative multimodal reasoning scenario:
- ECG: left ventricular hypertrophy substantiates Hypertension.
- CXR: interstitial opacities suggest Pneumonia.
- LAB: elevated WBC, altered pCOâ‚‚/pH support infectious diagnosis.
Model output format:
1 2 |
<think>ECG shows LVH (hypertensive change); CXR shows enlarged silhouette; WBC↑ supports infection…</think> <answer>Hypertension; Pneumonia</answer> |
This structure enables transparent attribution of diagnostic conclusions to observed physiological evidence.
6. Limitations, Prospective Extensions, and Resources
MedTVT-R1 is constrained by limited availability of large-scale, temporally aligned multimodal triplets, potentially limiting cross-institutional generalizability. The current implementation includes only ECG, CXR, and LAB modalities; inclusion of other sources (clinical notes, genomics, vital signs) is anticipated to augment diagnostic coverage.
Proposed future directions include:
- Expanding to additional modalities (clinical text, ultrasound, genomics).
- Cohort enlargement with tighter temporal synchronization.
- Development of causal interpretability modules for clinical decision support.
Complete datasets, codebase, and weights are made available at https://github.com/keke-nice/MedTVT-R1 (Zhang et al., 23 Jun 2025).
MedTVT-R1 establishes a rigorous multimodal reasoning pipeline, validated by extensive empirical studies and ablation analysis, with clear interpretability and extensibility for diverse diagnostic applications.