- The paper introduces a 3D-native multimodal framework that unifies segmentation, quantification, disease classification, VQA, and report generation for comprehensive CMR analysis.
- The methodology employs a dual-path U-Net architecture and hierarchical classification achieving high AUCs (above 0.97 for prevalent conditions) and a BERT-score F1 of 0.898 in report generation.
- The integrated system cuts interpretation time from 1800 to 90 seconds with over 99% tool invocation precision and zero hallucinations, demonstrating robust clinical applicability.
Automated Multimodal Reasoning and Diagnosis for Cardiovascular Diseases: The BAAI Cardiac Agent
Introduction
Cardiovascular magnetic resonance imaging (CMR) provides multidimensional, temporally-resolved, and anatomically-detailed data for the assessment of structural heart diseases. Effectively leveraging these data for diagnosis and clinical decision-making demands substantial expert labor due to the multi-sequence, multi-phase character of CMR and the need for coordinated reasoning across quantitative measures and qualitative interpretation. Fragmented or single-task AI solutions fall short of addressing these practical challenges since they do not replicate the integrated reasoning pipelines inherent to clinical workflows. The BAAI Cardiac Agent presents a comprehensive, 3D-native, multimodal agent architecture that unifies segmentation, quantification, disease classification, visual question answering (VQA), and report generation into a single, coherent pipeline designed explicitly for end-to-end CMR interpretation.
System Architecture and Methodological Foundations
The agent comprises a large multimodal model (LMM) engine built on top of LLaVA-Med, extended natively for 3D input. It orchestrates a suite of expert models for:
- View-specific segmentation (SAX cine, 2CH cine, 4CH cine, SAX LGE), using a refined dual-path, coarse-to-fine U-Net-based architecture with strong robustness to anatomical variability.
- Hierarchically-structured disease classification for both cardiac disease screening (CDS: normal, IHD, NICM) and fine-grained non-ischemic cardiomyopathy subclassification (HCM, DCM, RCM, ACM, myocarditis).
- Report generation with structured integration of automated measurements and clinical inferences.
- RAG modules for evidence recall and context-enrichment.
- VQA and sequence recognition modules operating directly on temporally resolved volumetric data.
This system is instruction-following: the LMM acts as the planner and executor, dynamically invoking expert models as required by clinical queries, integrating their output, and synthesizing responses encompassing both quantitative data and narrative conclusions. Lightweight LoRA-based finetuning, intricate data augmentation strategies, and efficient optimization schemes are employed to achieve generalization and computational tractability.
Empirical Results
Segmentation
Across multiple CMR sequence types, the BAAI Cardiac Agent delivers strong numerical superiority versus canonical SOTA architectures such as nnUNet, MedSAM2, ResUNet++, and DiffUNet. For example, on the SAX cine segmentation, the agent's model achieves a Dice Similarity Coefficient (DSC) of 90.21%, consistently outperforming baselines across DSC, Hausdorff Distance, and Average Surface Distance. The variance of these metrics is minimal, attesting to stability across anatomical presentations.
Disease Diagnosis
On a multicenter cohort (n=2413), the CDS model achieves an internal test AUC of 0.980 for normal heart detection, 0.938 for IHD, and 0.960 for NICM, with all test-specificity and sensitivity values tightly clustered around high values. NICMS subclassification for prevalent subtypes yields AUCs above 0.97 (e.g., HCM 0.975, DCM 0.971), and external validation on a held-out site (n=275) remains robust (all AUCs >0.81 for the main categories).
Comparative analysis against Video-based Swin Transformer (VST), ViT, and ResNet backbones confirms that the agent's performance is not architecture-dependent but reflects the benefit of its integrative design over traditional, vision-centric models. In rare subtypes (e.g., ACM, RCM), absolute AUCs are lower, reflecting current limitations in cohort size and data distribution.
Automated Quantification and Report Generation
The agent achieves a near-unity correlation with clinical ground truth for all primary cardiac function parameters (Pearson r>0.90 for LVEF, LVEDV, LVESV, SV, LVM). LV wall thickness estimation and regional distribution precisely replicate clinical pathological patterns, crucial for conditions like HCM.
Reports generated by the agent achieve a BERT-score F1 of 0.898, approaching semantic equivalence with radiology reports. Multilevel radiologist review (junior, mid, senior) confirms high clinical trustworthiness: expert concordance and report-writing confidence are maximized relative to other mainstream LMM frameworks, with the agent rated at 87–88 by all groups versus 58 for Qwen3-VL-30B and <30 for other medical LMMs.
Tool invocation precision rates consistently exceed 99% across all tested scenarios, and zero hallucination rates in task execution and sequence recognition are documented—an essential feature for clinical deployment.
Contradictory or Notable Claims
- The agent demonstrates that, with only 7B parameters and targeted fine-tuning, performance surpasses larger LMMs (e.g., Qwen3-VL-30B) on CMR-specific reasoning, report generation, and sequence discrimination.
- The system achieves near-complete automation of the end-to-end CMR reasoning pipeline, reducing median interpretation time from 1800 seconds (manual) to 90 seconds, with no loss of clinical fidelity.
Practical and Theoretical Implications
This paper substantiates the feasibility and clinical applicability of 3D-native multimodal agentic architectures in a domain characterized by high data complexity and heterogeneous workflows. The design aligns tool invocation, expert model inference, and structured report synthesis in a closed-loop manner reminiscent of radiologist practice, yet achieves dramatically higher throughput and reproducibility.
Practically, the agent can scale cardiac imaging analysis to environments with a shortage of clinical experts, improve cross-institutional standardization, and augment clinical throughput. Theoretically, the work motivates the development of specialized LMM-based agents that layer expert models and orchestrate their invocation according to domain logic—a paradigm likely to generalize to other complex, workflow-rich medical domains (e.g., neuroimaging, oncology).
The study identifies current limitations in the handling of rare subtypes due to data scarcity and the absence of non-CMR modalities; ongoing work should prioritize multi-center data harmonization and integration of CT, ultrasound, and EHR data for full-spectrum cardiovascular diagnostics.
Conclusion
The BAAI Cardiac Agent demonstrates robust, clinically-aligned performance for CMR-based cardiac diagnosis, achieving SOTA segmentation, disease classification, quantitative measurement, and narrative report generation within an integrated, agentic multimodal framework. Its empirical results highlight significant gains over mainstream deep learning and LMM approaches, and the architecture’s modularity paves the way for future multimodal, multi-agent system generalization in broader medical AI contexts (2604.04078).