MedTVT-R1: Multimodal Clinical Inference
- MedTVT-R1 is a multimodal large language model that integrates ECG, CXR, and LAB data for comprehensive clinical inference and multi-disease diagnosis.
- It employs a sophisticated fusion mechanism with cyclic multi-head attention and a contribution-aware operator, complemented by reinforcement fine-tuning using GRPO and a Jaccard reward.
- Experimental results show that MedTVT-R1 significantly improves diagnostic accuracy and factual reasoning over previous models, setting a new benchmark for clinical AI.
MedTVT-R1 is a Multimodal LLM (MLLM) framework for clinical medical inference and multi-disease diagnosis, specifically designed to integrate heterogeneous data sources—electrocardiogram (ECG) time-series, chest X-ray (CXR) images, and laboratory (LAB) tabular measurements. The project comprises a carefully designed model architecture, the MedTVT-QA instruction dataset for physiological interpretation and chain-of-evidence (CoE) multi-label diagnosis, and a reinforcement fine-tuning regimen incorporating Group Relative Policy Optimization (GRPO) with a Jaccard similarity reward. MedTVT-R1 yields improved factual reasoning and diagnostic accuracy on multimodal clinical QA tasks compared to prior MLLMs (Zhang et al., 23 Jun 2025).
1. Model Architecture and Multimodal Fusion
The MedTVT-R1 framework consists of dedicated modality encoders, projectors, a modality perception layer (MPL) for cross-modal fusion, and a LLM backbone. Each input—ECG sequences , CXR images , and LAB tabular features —is processed by a domain-specific pre-trained encoder. The encoders used are ECGFM-KED for ECG, ViT-B/16 for CXR, and Symile for LAB data. Their outputs are subsequently mapped to a unified embedding space (, via Dense block projectors , , ).
The MPL performs adaptive inter-modal fusion in two stages:
- Cyclic Multi-Head Attention (CMHA): For each modality, the set is cyclically treated as Query, Key, Value in a transformer-style attention mechanism. The resulting attention outputs are average pooled, yielding a fused vector that is added residually to each modality, giving , etc.
- Contribution-Aware Operator (CAO): This module produces context-dependent modality weights 0 using a sigmoid gate 1. The reweighted embeddings 2 replace modality placeholders in the input tokens for the LLM.
The LLM backbone is LLaMA-3.2-1B augmented with Low-Rank Adaptation (LoRA) layers (rank 8) for parameter-efficient fine-tuning.
2. MedTVT-QA Dataset and Chain of Evidence Prompts
MedTVT-QA is a curated instruction dataset comprising 8,706 multimodal records (8,331 train, 375 test), each consisting of temporally aligned ECG (first 24 hours post-admission), LAB (first 24 hours), and CXR (24–72 hours) extracted from MIMIC-IV and MIMIC-CXR, with ICD-10 multi-label diagnoses spanning seven disease categories.
Two QA tasks are central:
- Physiological-Level QA: Single-modality interpretation. GPT-4o was prompted to generate detailed, multi-paragraph descriptions of physiological patterns (e.g., “Sinus Rhythm,” lab-group anomalies) for each data type, grouped by clinical attribute.
- Disease-Level QA with Chain of Evidence: Each prompt aggregates the three physiological reports and a target diagnosis set, instructing the model to produce explicitly synthesized evidence and relate findings to disease prediction in the format: 3 This explicit chaining enforces factual alignment between modality findings and diagnostic output.
3. Reinforcement Fine-Tuning with GRPO and Jaccard Reward
To improve multi-label diagnostic reasoning, MedTVT-R1 uses a three-stage training regime:
- Pre-training (PT): On physiological-level QA, optimizing cross-entropy over answer tokens. Trainable parameters: modality projectors, LoRA layers. MPL and LLM weights are frozen.
- Supervised Fine-Tuning (SFT): On disease-level QA with CoE, updating the MPL and LoRA layers.
- Reinforcement Fine-Tuning (RFT): On the same data, applying GRPO with a Jaccard reward.
The GRPO objective samples 3 candidate answers 4 for each prompt 5, obtains rewards, and optimizes the expected
6
where 7. The Jaccard reward 8 is given by
9
with 0 and 1 denoting candidate and ground-truth disease sets, and 2 penalizing format violations.
4. Experimental Setup and Quantitative Results
Training utilized eight NVIDIA A800 80GB GPUs with Huggingface Trainer. Key training details: 20 epochs each for PT and SFT, 500 GRPO iterations for RFT, batch size and learning rates set to Huggingface defaults.
MedTVT-R1 achieves substantial gains over prior multimodal LLMs (InternVL3-1B, LLaVA-1.5-7B, Qwen2.5-VL). On disease-level QA:
| Method | F1 | AUC |
|---|---|---|
| MedTVT-R1 | 0.5190 | 0.6554 |
| MedTVT-R1 w/o PT | 0.4672 | 0.5851 |
| MedTVT-R1 w/o RFT | 0.4992 | 0.6242 |
| SOTA baselines* | <0.20 | <0.51 |
Ablation studies reveal the importance of the full MPL (CMHA+CAO F1=0.5190; w/o CAO F1=0.4977; w/o CMHA F1=0.4867) and each modality; maximal performance is reached only with all three modalities.
For physiological-level QA (single modality):
| Modality | BLEU | METEOR | ROUGE | BERTScore |
|---|---|---|---|---|
| ECG-QA | 0.0831 | 0.3044 | 0.2202 | 0.8650 |
| CXR-QA | 0.0931 | 0.3073 | 0.2121 | 0.8673 |
| LAB-QA | 0.1807 | 0.3827 | 0.3081 | 0.8855 |
These results indicate that physiological-level pre-training and reinforcement fine-tuning both contribute significantly to overall diagnostic and interpretive performance.
5. Clinical Applications and Interpretation
MedTVT-R1 supports automated generation of long-form, evidence-anchored diagnostic reports drawing on all available clinical data. The Chain of Evidence approach ensures each predicted disease in the output is explicitly linked to corroborating findings in the modalities (e.g., “ECG shows left ventricular hypertrophy; CXR confirms enlarged cardiac silhouette; elevated creatinine → hypertensive heart disease evidence”).
The architecture supports concurrent diagnosis of comorbidities (e.g., Pneumonia + Sepsis), each mapped to its supporting evidence, thereby enhancing explainability and alignment with clinical reasoning protocols.
A plausible implication is that MedTVT-R1’s integration of cross-modal perception and chain-of-evidence prompts addresses a major bottleneck in evidence-aligned, factual medical AI outputs, particularly in high-stakes, multi-disease settings.
6. Relationship to Other Medical AI Benchmarks and Recommendations
MedTVT-R1 demonstrates an application-driven evolution beyond single-frame or single-modal segmentation and reasoning models that dominate earlier literature (e.g., Trichomonas segmentation (Li et al., 2022), Video-TransUNet (Zeng et al., 2022)). The project exemplifies the need for high-quality, temporally and physiologically aligned data, robust cross-modal fusion via explicit attention and contribution-aware mechanisms, and reward-driven reinforcement learning that directly operationalizes clinical correctness.
Recommendations for extensions include domain adaptation to new data capture protocols, model compression for clinical deployment, and further refinement of reward schemes to better match consensus clinical practice.
7. Limitations and Future Directions
MedTVT-R1’s current regime is contingent upon the availability of three specific data modalities within strict temporal windows. Model training has not explicitly addressed domain generalization to new sources, real-time deployment constraints, or the handling of missing modalities. This suggests that future work should prioritize:
- Robustness to missing/incomplete modalities.
- Domain adaptation across institutions and imaging standards.
- Model distillation and quantization for resource-limited deployment settings.
- Expansion to other structured medical diagnostic tasks with richer annotation schemas.
MedTVT-R1, through instructional dataset curation, explicit multimodal fusion, and reward-based fine-tuning, establishes a new reference point for interpretable, evidence-grounded AI in multi-disease clinical inference (Zhang et al., 23 Jun 2025).