MedTVT-R1: A Multimodal LLM for Medical Reasoning and Diagnosis
MedTVT-R1 introduces a comprehensive multimodal LLM (MLLM) framework for interpretable medical reasoning and multi-disease diagnosis, addressing the limitations of single-modality approaches in clinical AI. The model is designed to integrate heterogeneous clinical data—electrocardiograms (ECG, time series), chest X-rays (CXR, images), and laboratory blood tests (LAB, tabular)—to enable both physiological-level understanding and disease-level diagnostic reasoning. This work is underpinned by the construction of MedTVT-QA, a curated instruction dataset that provides question-answer pairs for both physiological and disease-level tasks, leveraging a Chain of Evidence (CoE) methodology to ensure robust, evidence-based reasoning.
Methodological Contributions
1. MedTVT-QA Dataset Construction
MedTVT-QA is the first instruction dataset to simultaneously consider ECG, CXR, and LAB modalities, with 8,706 multimodal data combinations derived from the MIMIC-IV suite. The dataset is structured to support two levels of reasoning:
- Physiological-level QA pairs: Each modality is annotated and paired with prompts to elicit detailed, clinically relevant interpretations. These are generated using GPT-4o and manually reviewed for accuracy.
- Disease-level QA pairs: These pairs require the model to synthesize evidence across all modalities to justify the presence of specific diseases, enforcing a CoE approach. Seven major disease categories (e.g., coronary artery disease, sepsis, diabetes) are covered, with each QA pair demanding explicit, modality-grounded evidence for each diagnosis.
2. Model Architecture
MedTVT-R1 comprises:
- Modality-specific encoders and projectors: Pretrained encoders for ECG (ECGFM-KED), CXR (ViT-B/16), and LAB (Symile) extract features, which are projected into a shared embedding space compatible with the LLM.
- Modality Perception Layer (MPL): This layer includes a Cyclic Multi-Head Attention (CMHA) mechanism for cross-modal interaction and a Contribution-Aware Operator (CAO) that adaptively weights each modality’s contribution based on diagnostic context.
- LLM backbone: LLaMA 3.2-1B with LoRA adaptation is used for LLMing and reasoning.
3. Training Strategy
A three-stage training pipeline is employed:
- Pre-training (PT): The model is trained on physiological-level QA pairs to build foundational modality understanding.
- Supervised Fine-Tuning (SFT): Disease-level QA pairs with CoE logic are used to teach the model multimodal synthesis and diagnostic reasoning.
- Reinforcement Fine-Tuning (RFT): Group Relative Policy Optimization (GRPO) is applied with a novel Jaccard Reward function, directly optimizing for multi-disease diagnostic accuracy and output format compliance.
Experimental Results
Quantitative Performance
MedTVT-R1 demonstrates state-of-the-art results on both natural language generation (NLG) and clinical efficacy (CE) metrics. In disease-level diagnostic reasoning, it achieves:
Model | BLEU | METEOR | ROUGE | BERT | Precision | Recall | F1 Score | AUC |
---|---|---|---|---|---|---|---|---|
MedTVT-R1 | 0.1353 | 0.3536 | 0.2295 | 0.8652 | 0.5407 | 0.5908 | 0.5190 | 0.6554 |
Best prior baseline | 0.0341 | 0.2031 | 0.1435 | 0.8181 | 0.3493 | 0.1397 | 0.1995 | 0.5053 |
Ablation studies confirm that both the MPL (CMHA and CAO) and the inclusion of all three modalities are critical for optimal performance. The RFT stage with GRPO and the Jaccard Reward further boosts diagnostic accuracy, as evidenced by significant improvements in F1 and AUC.
Physiological-level Understanding
On single-modality QA tasks, MedTVT-R1 outperforms all baselines, particularly excelling in long-form, detailed physiological analysis. Notably, the model’s performance is highest on LAB data, likely due to its tabular-to-text proximity, but it also achieves strong results on CXR and ECG.
Qualitative Analysis
MedTVT-R1’s diagnostic outputs are characterized by explicit evidence tracing, with each diagnosis substantiated by findings from multiple modalities. The model’s reasoning chains are transparent, aligning with clinical expectations for interpretability and trustworthiness.
Practical Implications
Clinical Applications
- Diagnostic Report Generation: MedTVT-R1 can generate comprehensive, evidence-based diagnostic reports, supporting clinical decision-making and documentation.
- Comorbidity Reasoning: The model’s ability to synthesize multimodal evidence enables robust handling of complex, multi-disease cases.
- Medical Education and Audit: The explicit CoE reasoning chains facilitate model auditing and can serve as educational tools for clinicians and trainees.
Deployment Considerations
- Computational Requirements: Training requires significant GPU resources (e.g., 8×A800 80GB GPUs), but inference can be optimized for clinical settings by leveraging model quantization and efficient modality encoders.
- Data Integration: Real-world deployment necessitates robust pipelines for synchronizing and preprocessing heterogeneous clinical data streams.
- Generalization and Limitations: The model’s performance is contingent on the availability of temporally aligned, multimodal data. Current limitations include the lack of additional modalities (e.g., patient history, genomics) and the challenge of scaling to broader disease categories.
Theoretical and Future Directions
MedTVT-R1 advances the field by demonstrating that explicit cross-modal interaction and adaptive modality weighting are essential for interpretable, high-accuracy medical reasoning. The integration of reinforcement learning with verifiable, task-specific rewards (Jaccard similarity) sets a precedent for optimizing LLMs in structured, multi-label clinical tasks.
Future research may focus on:
- Expanding modality coverage (e.g., integrating clinical notes, genomics, or wearable data).
- Improving data efficiency through semi-supervised or self-supervised learning on limited multimodal datasets.
- Enhancing explainability by developing more granular, user-controllable reasoning chains.
- Clinical validation in prospective, real-world settings to assess generalizability and safety.
MedTVT-R1 establishes a robust foundation for the next generation of clinical AI systems, where multimodal, interpretable, and evidence-based reasoning is paramount for safe and effective deployment.