MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis (2506.18512v1)

Published 23 Jun 2025 in cs.CV

Abstract: Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal LLM (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at https://github.com/keke-nice/MedTVT-R1.

PDF Abstract

MedTVT-R1: A Multimodal LLM for Medical Reasoning and Diagnosis

MedTVT-R1 introduces a comprehensive multimodal LLM (MLLM) framework for interpretable medical reasoning and multi-disease diagnosis, addressing the limitations of single-modality approaches in clinical AI. The model is designed to integrate heterogeneous clinical data—electrocardiograms (ECG, time series), chest X-rays (CXR, images), and laboratory blood tests (LAB, tabular)—to enable both physiological-level understanding and disease-level diagnostic reasoning. This work is underpinned by the construction of MedTVT-QA, a curated instruction dataset that provides question-answer pairs for both physiological and disease-level tasks, leveraging a Chain of Evidence (CoE) methodology to ensure robust, evidence-based reasoning.

Methodological Contributions

1. MedTVT-QA Dataset Construction

MedTVT-QA is the first instruction dataset to simultaneously consider ECG, CXR, and LAB modalities, with 8,706 multimodal data combinations derived from the MIMIC-IV suite. The dataset is structured to support two levels of reasoning:

Physiological-level QA pairs: Each modality is annotated and paired with prompts to elicit detailed, clinically relevant interpretations. These are generated using GPT-4o and manually reviewed for accuracy.
Disease-level QA pairs: These pairs require the model to synthesize evidence across all modalities to justify the presence of specific diseases, enforcing a CoE approach. Seven major disease categories (e.g., coronary artery disease, sepsis, diabetes) are covered, with each QA pair demanding explicit, modality-grounded evidence for each diagnosis.

2. Model Architecture

MedTVT-R1 comprises:

Modality-specific encoders and projectors: Pretrained encoders for ECG (ECGFM-KED), CXR (ViT-B/16), and LAB (Symile) extract features, which are projected into a shared embedding space compatible with the LLM.
Modality Perception Layer (MPL): This layer includes a Cyclic Multi-Head Attention (CMHA) mechanism for cross-modal interaction and a Contribution-Aware Operator (CAO) that adaptively weights each modality’s contribution based on diagnostic context.
LLM backbone: LLaMA 3.2-1B with LoRA adaptation is used for LLMing and reasoning.

3. Training Strategy

A three-stage training pipeline is employed:

Pre-training (PT): The model is trained on physiological-level QA pairs to build foundational modality understanding.
Supervised Fine-Tuning (SFT): Disease-level QA pairs with CoE logic are used to teach the model multimodal synthesis and diagnostic reasoning.
Reinforcement Fine-Tuning (RFT): Group Relative Policy Optimization (GRPO) is applied with a novel Jaccard Reward function, directly optimizing for multi-disease diagnostic accuracy and output format compliance.

Experimental Results

Quantitative Performance

MedTVT-R1 demonstrates state-of-the-art results on both natural language generation (NLG) and clinical efficacy (CE) metrics. In disease-level diagnostic reasoning, it achieves:

Model	BLEU	METEOR	ROUGE	BERT	Precision	Recall	F1 Score	AUC
MedTVT-R1	0.1353	0.3536	0.2295	0.8652	0.5407	0.5908	0.5190	0.6554
Best prior baseline	0.0341	0.2031	0.1435	0.8181	0.3493	0.1397	0.1995	0.5053

Ablation studies confirm that both the MPL (CMHA and CAO) and the inclusion of all three modalities are critical for optimal performance. The RFT stage with GRPO and the Jaccard Reward further boosts diagnostic accuracy, as evidenced by significant improvements in F1 and AUC.

Physiological-level Understanding

On single-modality QA tasks, MedTVT-R1 outperforms all baselines, particularly excelling in long-form, detailed physiological analysis. Notably, the model’s performance is highest on LAB data, likely due to its tabular-to-text proximity, but it also achieves strong results on CXR and ECG.

Qualitative Analysis

MedTVT-R1’s diagnostic outputs are characterized by explicit evidence tracing, with each diagnosis substantiated by findings from multiple modalities. The model’s reasoning chains are transparent, aligning with clinical expectations for interpretability and trustworthiness.

Practical Implications

Clinical Applications

Diagnostic Report Generation: MedTVT-R1 can generate comprehensive, evidence-based diagnostic reports, supporting clinical decision-making and documentation.
Comorbidity Reasoning: The model’s ability to synthesize multimodal evidence enables robust handling of complex, multi-disease cases.
Medical Education and Audit: The explicit CoE reasoning chains facilitate model auditing and can serve as educational tools for clinicians and trainees.

Deployment Considerations

Computational Requirements: Training requires significant GPU resources (e.g., 8×A800 80GB GPUs), but inference can be optimized for clinical settings by leveraging model quantization and efficient modality encoders.
Data Integration: Real-world deployment necessitates robust pipelines for synchronizing and preprocessing heterogeneous clinical data streams.
Generalization and Limitations: The model’s performance is contingent on the availability of temporally aligned, multimodal data. Current limitations include the lack of additional modalities (e.g., patient history, genomics) and the challenge of scaling to broader disease categories.

Theoretical and Future Directions

MedTVT-R1 advances the field by demonstrating that explicit cross-modal interaction and adaptive modality weighting are essential for interpretable, high-accuracy medical reasoning. The integration of reinforcement learning with verifiable, task-specific rewards (Jaccard similarity) sets a precedent for optimizing LLMs in structured, multi-label clinical tasks.

Future research may focus on:

Expanding modality coverage (e.g., integrating clinical notes, genomics, or wearable data).
Improving data efficiency through semi-supervised or self-supervised learning on limited multimodal datasets.
Enhancing explainability by developing more granular, user-controllable reasoning chains.
Clinical validation in prospective, real-world settings to assess generalizability and safety.

MedTVT-R1 establishes a robust foundation for the next generation of clinical AI systems, where multimodal, interpretable, and evidence-based reasoning is paramount for safe and effective deployment.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yuting Zhang (30 papers)
Kaishen Yuan (9 papers)
Hao Lu (99 papers)
Yutao Yue (52 papers)
Jintai Chen (57 papers)
Kaishun Wu (23 papers)

MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis (2506.18512v1)

MedTVT-R1: A Multimodal LLM for Medical Reasoning and Diagnosis

Methodological Contributions

Experimental Results

Practical Implications

Theoretical and Future Directions

Related Papers

GitHub

YouTube