MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis (2506.18512v1)

Published 23 Jun 2025 in cs.CV

Abstract: Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal LLM (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at https://github.com/keke-nice/MedTVT-R1.

PDF Abstract

MedTVT-R1: A Multimodal LLM for Medical Reasoning and Diagnosis

MedTVT-R1 introduces a comprehensive multimodal LLM (MLLM) framework for medical reasoning and diagnosis, specifically targeting the integration of heterogeneous clinical data—electrocardiograms (ECG, time series), chest X-rays (CXR, images), and laboratory blood tests (LAB, tabular data). The work addresses the limitations of single-modality diagnostic models, which are unable to capture the complex, multifaceted nature of many diseases, and advances the field by enabling interpretable, evidence-based multi-disease diagnosis.

Dataset Construction: MedTVT-QA

A central contribution is the MedTVT-QA dataset, which is the first instruction dataset to simultaneously consider ECG, CXR, and LAB modalities. The dataset construction pipeline ensures temporal consistency by aligning data from the MIMIC-IV family of datasets, resulting in 8,706 multimodal patient samples. The QA pairs are designed at two levels:

Physiological-level: Each modality is annotated and described in detail, with prompts guiding GPT-4o to generate clinically meaningful, long-form explanations. These are manually reviewed for accuracy.
Disease-level: QA pairs require the model to synthesize evidence from all three modalities, following a Chain of Evidence (CoE) approach. The model must justify each diagnosis with explicit, modality-specific findings, covering seven major disease categories and their subtypes.

This dataset design enforces both granular physiological understanding and high-level diagnostic reasoning, providing a robust foundation for multimodal LLM training.

Model Architecture

MedTVT-R1 is architected to maximize cross-modal information flow and adaptive reasoning:

Modality-specific Encoders and Projectors: Each modality is processed by a dedicated encoder (ECGFM-KED for ECG, ViT-B/16 for CXR, Symile for LAB), followed by a projector that aligns features into a shared embedding space compatible with the LLM.
Modality Perception Layer (MPL): This layer consists of two key mechanisms:
- Cyclic Multi-Head Attention (CMHA): Enables each modality to attend to the others in a cyclic fashion, capturing inter-modal dependencies.
- Contribution-Aware Operator (CAO): Learns adaptive weights for each modality, allowing the model to emphasize the most relevant data for a given diagnostic context.
LLM Backbone: LLaMA 3.2-1B with LoRA adaptation is used for efficient fine-tuning and integration of multimodal embeddings.

The architecture is modular, facilitating the extension to additional modalities or alternative encoders as new data sources become available.

Training Strategy

A three-stage training protocol is employed:

Pre-training (PT): The model is exposed to physiological-level QA pairs, learning to interpret each modality in isolation.
Supervised Fine-Tuning (SFT): Disease-level QA pairs with CoE logic are used to train the MPL and LLM, enabling multimodal synthesis and diagnostic reasoning.
Reinforcement Fine-Tuning (RFT): Group Relative Policy Optimization (GRPO) is applied, using a novel Jaccard Reward function that directly optimizes for multi-label disease prediction accuracy. This approach leverages groupwise response comparison, obviating the need for a separate critic model and aligning model outputs with verifiable clinical ground truth.

Experimental Results

MedTVT-R1 is evaluated against eight state-of-the-art MLLMs, including InternVL3-1B, LLaVA-1.5-7B, Qwen2.5-VL-3B-Instruct, and Deepseek-VL-1.3B-Chat. The evaluation covers both natural language generation (BLEU, METEOR, ROUGE, BERTScore) and clinical efficacy (precision, recall, F1, AUC) metrics.

Key findings:

Superior Performance: MedTVT-R1 achieves the highest scores across all metrics, with a notable F1 of 0.5190 and AUC of 0.6554 for disease-level diagnosis, substantially outperforming all baselines.
Ablation Studies: Both the CMHA and CAO components of the MPL are shown to be critical; removing either degrades performance. Similarly, omitting any modality during pre-training leads to a measurable drop in diagnostic accuracy, with ECG being particularly important for the covered disease spectrum.
Long-Text Generation: The model demonstrates robust performance in generating detailed, clinically relevant reports exceeding 300 words, a task where other MLLMs struggle.
Qualitative Analysis: MedTVT-R1 consistently provides evidence-based, multi-modal justifications for its diagnoses, aligning with clinical reasoning practices.

Practical Implications

MedTVT-R1's design and results have several practical implications:

Clinical Decision Support: The model's ability to generate interpretable, evidence-backed diagnostic reports positions it as a valuable tool for assisting clinicians in complex cases, particularly those involving comorbidities.
Comorbidity Reasoning: By explicitly modeling the complementarity and corroboration among modalities, MedTVT-R1 is well-suited for multi-disease scenarios, a common challenge in real-world healthcare.
Dataset and Code Availability: The release of MedTVT-QA and the model code enables reproducibility and further research, facilitating benchmarking and extension to new modalities or disease categories.

Limitations and Future Directions

The authors acknowledge several limitations:

Data Scale and Diversity: The model's generalization is constrained by the size and diversity of available multimodal datasets. Larger, more temporally aligned datasets would further enhance performance.
Modal Coverage: While ECG, CXR, and LAB are highly informative, additional modalities (e.g., patient history, genomics) are not yet integrated due to data availability constraints.
Clinical Validation: Prospective validation in real-world clinical settings is necessary to assess the model's impact on diagnostic workflows and patient outcomes.

Future work should focus on expanding the range of modalities, improving temporal alignment, and integrating the model into clinical decision support systems with rigorous human-in-the-loop evaluation.

Conclusion

MedTVT-R1 represents a significant step forward in multimodal medical AI, demonstrating that carefully designed instruction datasets, adaptive modality fusion, and reinforcement-based fine-tuning can yield interpretable, high-accuracy multi-disease diagnostic models. The framework sets a new standard for multimodal LLMs in healthcare and provides a blueprint for future research at the intersection of AI and clinical medicine.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yuting Zhang (30 papers)
Kaishen Yuan (9 papers)
Hao Lu (99 papers)
Yutao Yue (52 papers)
Jintai Chen (57 papers)
Kaishun Wu (23 papers)